Some days ago I attended a series of lectures organized by Telefonica Research (TID) where they explained several projects that have been developing in the last years in the field of Big Data. These projects or use cases mostly are related to the use of data gathered from their mobile phone communications, besides other sources such as credit card transactions and social networks (e.g. twitter and facebook). In general, the talks showed interesting information both at content and format level. In addition, two key concepts, as it might be expected, were mentioned repeatedly during the explanation: “anonymity” and “aggregation”, conveying that personal data they collected are protected, in order to ensure the privacy of their users. Although I don’t want to doubt about this, we must recognize that this is a controversial issue both for Telcos and OTTs and whose discussion isn’t over; still a lot of water under the bridge must flow in order to clarify the suitable use of personal data. I mean It’s necessary the existence of a strict legal framework that protects users worldwide, but that’s another topic for another post.
Well then, I understand and guess that in general the focus of these talks was simply to present compelling and novel visualizations, so audience can glimpse the power behind the data and the endless options that the use of Big Data technology can bring us in the future. Visualizations such as the movement of Russian tourists or cruise passengers through Barcelona, where they sleep, eat or buy luxury items, food, etc, (all geo-located) or also sentiment analysis by means of twitter of a determined event, etc. etc. Also to mention some research projects with social scope where TID is involved in several countries as could be the analysis of crowd movement after an earthquake or during a flood, i.e. migration events. On the other hand, they highlighted also something very important and revealing: analyze the behaviour of people by means of their movements through cellular radio system (and social networks) provides a more accurate and less biased notion of the users (potential clients) than an opinion survey. Anyway, it gets the feeling that Big Data is a world of possibilities and as Henry Ford said: “If I had asked people what they wanted, they would have said faster horses”. See Smart Steps project by TID.
However, it’s logical to think all this is the tip of the iceberg of an emerging business that could be very lucrative: “sales data”…well in fact, it already is the case. This reminds me a news of October 2012 where Von McConnell, director of technology at Sprint said, in relation to if Telcos became nothing more than a dumb pipe, “we could make a living just out of analytics”, this is, Telcos can survive on Big Data alone. Besides, I remember last year at Telecom Big Data conference (Barcelona), Telcos were aware that they are “sitting” on a goldmine of data and already are working on mechanisms to get useful business information, at all level, with a main goal: data monetization. However, here I would like to mention briefly an aspect that can modify this scenario: there exists a war Telco vs OTT players for dominance over the data, but it’s another story which we must be alert. Anyway, currently some relevant topics in a telco are: marketing analytics, M2M solutions, voice analytics, operational management (network and devices), advertising models, recommendation systems (cross/ up selling) etc. This give us an idea about the topics that Telcos are currently working. By the way, I recommend to check Okapi project by TID (Tools for large-scale Machine Learning and Graph Analytics).
Configuring RHIPE and SDN-Mininet
Well, actually this preamble was only a pretext to present a simple example where is possible to see an application of Big Data & analytics tools (e.g. Map/Reduce Hadoop and R) over data gathered from a network. It’s true however, these are well-known issues that I already had mentioned in previous posts, but my intention this time (as well as to repeat my speech) is to place Big Data in a context purely of network. Typically when we talk about Big Data or analytics in a Telco, some common examples appear such as customer churn analysis or pattern analysis over a cellular radio system. SDN (and NFV) in this sense, by decoupling control and data plane, offers a clear opportunity to manage network communications in a centralized way, with which now it’s possible to have server a farm (Data Center) that can process several network metric in real time using Big Data analytics, i.e. now can be possible to do an advanced network tomography: huge network matrix, delay matrix, loss matrix, link state, alarms, etc
Anyway… currently however, I don’t have access to real traffic data of a Telco, which would be ideal, but as proof of concept, a simple network created with Mininet is enough from my point of view. So, I programmed a tree-based topology with Python, with an external POX controller and a series of Open Flow switches and hosts. In this tree-based topology is possible to configure the fanout (number of ports) and some characteristics for the links such as bandwidth and delay. It’s very easy to add packet loss rate or CPU load, but this time I only used the first two features. Moreover, it isn’t very complicated to programme a fat-tree or jellyfish topologies or inclusive random networks, if you prefer to work with more complex networks.
On the other hand, I used wireshark tool to gather network data. In any case, I only want to capture ICMP packets in order to calculate simply the latencies between nodes and then construct a “Delay Matrix”. Actually this is very simple, but now all analysis will be done with RHIPE package, in order to apply a Map/Reduce & HDFS scheme. According to Tessera project: “RHIPE is a R-Hadoop Integrated Programming Environment. RHIPE allows an analyst to run Hadoop MapReduce jobs wholly from within R. RHIPE is used by datadr when the back end for datadr is Hadoop. You can also perform D&R (Divide and Recombine) operations directly through RHIPE MapReduce jobs, as MapReduce is sufficient for D&R, although in this case you are programming at a lower level than for datadr.” So, basically RHIPE is a R library that acts as “wrapper” which allows interact directly with Hadoop.
Configuring Mininet (see Github for test_tree.py)
# Loading wireshark sudo wireshark & # Filter ICMP (hiding Open Flow messages) icmp && !(of) && ip.addr == 10.0.0.0/24 #Loading POX controller ~/pox$ ./pox.py forwarding.l2_learning #Loading tree-based topology sudo python test_tree.py