Data Blast

Data, Telecom, Maths, Astronomy, Origami, and so on


Leave a comment

Remembrances, Exoplanets, and Data Mining in Astronomy

Just a couple of weeks ago I read an interesting article in Scientific American (Feb 26) about the Kepler mission and the search of the extrasolar planets or exoplanets. This report talked about the number of exoplanets validated to date by Kepler Telecope Team: 715 exoplanets; a revealing number since the mission was just launched in 2009 and in 2012 stopped taking data for technical glitches. It’s clear that all this is a promising fact because to date only a negligible zone of de Universe has been explored within a short time. There are many planets that remain undiscovered, although the most important thing is to find planets with habitability conditions. In this sense other approach that now comes to my mind is the Phoenix Project (SETI) devoted to extraterrestrial intelligence search based on the analysis of patterns in radio signals. In 2004 however, it was announced that after checking the 800 stars, the project had failed to find any evidence of extraterrestrial signals. In any case, all these attempts have huge value to improve our knowledge about the Universe.

Also, two weeks ago, in a boring Sunday, I bumped into a TV show called “How the Universe works: Planets from Hell” (S02E03, Discovery Max). It explained the features and conditions of the planets detected to date: most of them were gas giants and massive planets; i.e. new worlds with extreme environments in both senses: cold and heat, which make impossible life as we know it. However, an interesting concept called “Circumstellar habitable zone” (Goldlocks zone) was mentioned, which corresponds to a region around a star, neither too cold nor too hot, where are orbiting planets similar to the Earth. A shared particularity among them is that they can support liquid water, which is an essential ingredient necessary for life (as we know it). Continuing with this idea, other concept emerged: “Earth Similarity Index” (ESI), although it isn’t a measure of habitability, is at least a reference point because it measures how physically a planetary mass object is similar to the Earth. ESI is a formula that depends on the following planet parameters: radius, density, escape velocity and surface temperature. The output of the formula is a value on a scale from 0 to 1, with the Earth having a reference value of one.

Also, in the same TV-show, Michio Kaku, an outstanding theoretical physicist, I remember he said: “that science fiction writers of the 50s had fallen short in the description of these strange worlds and where clearly the reality has exceeded the fiction”. In this point, I would like to remind an amazing book “The Face of the Deep” (1943) written by Edmond Hamilton where “Captain Future” (my childhood hero, I admit it, specially his Japanese cartoon) along with a group of outlaws fall to a planet at the verge of extinction. Actually, this “planetoid” was in the Roche’s Limit, i.e. the closest that a moon (or a planetoid in this case, say) can come to other planet more massive without being broken up by the planet gravitational forces. It lies at about 2 ½ times the radius of the planet from the planet’s centre. Well, better to go to the source as a tribute:

 “He swept his hand in a grandiloquent gesture. “Out there beyond Pluto’s orbit is a whole universe for our refuge! Out there across the interstellar void are stars and worlds beyond number. You know that exploring expeditions have already visited the worlds of Alpha Centauri, and returned. They found those worlds wild and strange, but habitable.

The Martian’s voice deepened. “I propose that we steer for Alpha Centauri. It’s billions of miles away, I know. But we can use the auxiliary vibration-drive to pump this ship gradually up to a speed that will take it to that other star in several months.

Two months from now, this planetoid will be so near the System that its tidal strains will burst it asunder. Roche’s Limit, which determines the critical distance at which a celestial body nearing a larger body will burst into fragments, operates in the case of this world let as though the whole System were one great body it was approaching.

The fat Uranian’s moonlike yellow face twitched with fear, and his voice was husky. “It’s true that Roche’s Limit will operate for the whole System as though for one body, in affecting an unstable planetoid like this. If this planetoid gets much nearer than four billion miles, it will burst”.

My childhood memories and my old astronomical dictionary (by Ian Ridpath). Now, there exists a modern version called Oxford Dictionary of Astronomy (2012).

My childhood memories and my old astronomical dictionary (by Ian Ridpath). Now, there exists a modern version called Oxford Dictionary of Astronomy (2012).

Continuing with my memories, I must say Astronomy has been always a very interesting topic to me, even more during my engineering student days, where I was fortunate to work as a summer intern in Electronic Lab at two main observatories in Chile: ESO-La Silla Observatory in 1995 and Cerro Tololo Inter-American Observatory in 1996. On the other hand, I also worked in a university project for the development of an autoguider system for a telescope, i.e. an electronic system that helps to improve astrophotography sessions in order to get perfect round stars during long exposure time. For this, I still remember to work with different elements such as CCD image sensor TC-211 (165 x 192 pixels, today, it’s a joke), programming microprocessors via Assembler, a Peltier cell as a cooling system, and a servomotor controller for an equatorial mount, among other things.

Astro1

With all this, at that time, I felt like an amateur astronomer or at least an Astronomy enthusiast. However, looking back, now I have the feeling that an amateur astronomer in 1995 or 1996 was only able to take astronomical photos and maybe then to apply some filtering technique (e.g. Richardson-Lucy algorithm) to improve the image, or inclusive a person could make a homemade telescope with a robotic dome, etc. In this sense, I remember some amazing reports from Sky & Telescope magazine (I was subscribed five years, I recommend it) where people shared their experiences in astrophotography, gave tips about telescopes construction, and sometimes some “visionary” came into stage, teaching how to make a spectrograph with fiber optic, for example or another advanced technique. But beyond this, I don’t remember any remarkable report related to astronomical data analysis made by an amateur.

At that time, I know Internet didn’t have the development that currently has, also it’s true that available technical resources were scarce, and external information from astronomical organization was minimal or non-existent; I mean astronomical open data. Well, unfortunately after many years of being disconnected to Astronomy, I have newly discovered a “new world” thanks to many astronomical organizations have opened their databases enabled APIs, and where any person can download an astronomical file or, even, there are websites that include powerful analysis tools. Surely, someone can tell me, it isn’t novel and surely also he/she is absolutely right, but at least for me it’s a real discovery due to today I spend my time analyzing data, all this is really a gold mine and a great opportunity to continue with my hobby. By using tools like Python or R, it’s possible extract interesting information, for example, about exoplanets or another astronomical object of interest, and therefore, any person can contribute even more to enrich and expand the general knowledge in Astronomy. However, in spite of how challenging is to analyze data “from scratch and from my home”, to look at a starry night is an incomparable sensation to anything, especially in the Southern hemisphere (Atacama desert) with the Magellanic Clouds.

In Python there exist many packages and libraries associated to astronomical data analysis. AstroPython for example, is a great website where it’s possible to find different resources from emailing-lists or tutorials to Machine Learning and Data Mining tools like AstroML, whose authors also recently (January 2014) published an excellent book called “Statistics, Data Mining, and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data” (2014) by Zeljko Ivezic et al. Moreover in R there is a similar book called “Modern Statistical Methods for Astronomy: with R Applications “ (2012) by Eric D. Feigelson and G. Jogesh Babu. In general, many astronomical data are in FITS (Flexible Image Transport System) format and once you decode them, say, you can use any programming language (e.g. Python, C, Java, R, etc.) and any Machine Learning tool such as: Weka, Knime, RapidMiner, Orange or directly via Python (by using Scikit-Learn and Pandas) or R scripts (here, I also recommend RStudio).

Well, coming back to initial topic “explanets”, I would like to share some things that you can make with astronomical open data and Python, but previously I would like to comment some things about exoplanets.

A brief glance at Exoplanets

As mentioned before, Kepler Mission was launched in 2009 and from that moment till 2012 sent valuable information about possible exoplanets. Kepler satellite orbits the Sun (not Earth orbit) points its photometer to a field in the northern constellations of Cygnus, Lyra and Draco. In NASA Exoplanet Archive is possible to find much information about mission, technical characteristics, current statistics, and as well as access to interesting tools for data analysis.

I merely want to add that there are mainly two methods for detecting explanets: direct method and indirect method. The former is simply based on the direct observation of a planet i.e. direct imaging. It’s a complicated method because, as we know, a planet is an extremely faint light source compared to star and this light tends to be lost in the glare of the host star. In the case of indirect method, it consists in observing the effects that the planet produces (or exhibits) on the host star (see Table).

ExoMethod

As it can be seen in the following figures, Transit method based on light curves is currently the most common technique used in exoplanets detection. A light curve is simply an astronomical time series of brightness of a celestial object over time.Light curve analysis is an important tool in astrophysics used for estimation of stellar masses and distances to astronomical objects. Additional information in these links: Link1, Link2 and Link3.

Number of Exoplanet detected to date (Source NASA, more plots in this link)

Number of Exoplanet detected to date (Source NASA)

Mass of the detected planets with regard to Jupiter (Source NASA)

Mass of the detected planets with regard to Jupiter (Source NASA)

On MAST Kepler Public Light Curve website it’s possible to download light curves from Kepler Mission as tarfile or individually in FITS format for different quarters and type of cadence period. However, for easier access, I recommend to use Python packages such as Klpr and PyFITS . The following figure shows a simple example on how to get light curves for confirmed exoplanet Kepler-7b (KepID: 5780885, KOI Name: K00097.01, Q5, long cadence (lc) and normalized data). Also you can use NASA Exoplanet Archive tools to visualize the data. A light curve contains different data columns (see more detail here) but, in this example, I only used the following parameters:

  • TIME (64-bit floating point): The time at the mid-point of the cadence in BKJD. Kepler Barycentric Julian Day is Julian day minus 2454833.0 (UTC=January 1, 2009 12:00:00) and corrected to be the arrival times at the barycenter of the Solar System.
  • PDCSAP_FLUX (32-bit floating point): The flux contained in the optimal aperture in electrons per second after the PDC (Presearch Data Conditioning) module has applied its detrending algorithm to the PA (Photometric Analysis module) light curve. Actually, it’s a preprocessed and calibrated flux. Generating light curves from images isn’t trivial process. First it’s necessary calibrate the images because there are many systematic source of errors in the detector (e.g. bias and flat field). Next, it’s needed to select reference stars and correcting for image motion and then to apply some method like aperture photometry. It’s to be welcomed that in this case everything is OK.

LightCurve_K7b

Lightcurve_NASA_K7b

Now, given these light curves, you can to get interesting data about exoplanets just “from the plot” as proof of concept. However, I used only one curve in a specific quarter and cadence, but actually it’s necessary for more accuracy (and statistical rigor) to use many light curves in the measurement. Anyway, by using the same example, Kepler-7b light curve (Q5, lc), we can approximately estimate: Orbital Period (T), Total Transit Duration (T_t), Transit Flat (T_f, duration of the “flat” part of the transit), and Planet-Star radius ratio. The latter can be also calculated by using the phase plot, i.e. flux vs phase. In Python simply you can use “phase=(time % T)/T” and to create a new column in your DataFrame.

Period_Transit1

Phase

So, by applying some assumptions and simplifications from “A Unique Solution of Planet and Star Parameters from an Extrasolar Planet Transit Light Curve (2003) by S. Seager and G. Mallé́n-Ornelas, we can get additional parameters such as “a” (Orbit Semi-Major Axis) or Impact Parameter “b”. By means of using other formulas it’s possible to find stellar density, stellar mass-radius relation, etc.

ParameterK7b

ParameterNasa7b

Machine Learning for Exoplanets

As initial point, I would like to mention an interesting project called PlanetHunters, which is a citizen-driven initiative from ZooUniverse project launched in December 16, 2010 to detect exoplanet from light curves of stars recorded by Kepler Mission. It’s a tool that exploits the fact that humans are better at recognizing visual patterns than computers. Here, each user contributes with his/her own assessment indicating whether a light curve shows evidence of the presence of a planet orbiting the star.

Classifying a light curve isn’t an easy task. For example, a little planet could be undetectable because its effect in the dip of the light curve is imperceptible. Many times, variations in the intensity of a star are due to internal star processes (variable stars) or for the presence of an eclipsing binary system, i.e. a pair of stars that orbit each other. In this sense, it’s worth mentioning that a light curve associated to the transit of an exoplanet will be a light curve relatively constant, with certain regularity, and with small dips corresponding to the transit.

In fact, NASA defines three categories in Kepler Objects of Interest (KOI): confirmed, candidate and false positive. According to them, “a false positive has failed at least one of the tests described in Batalha et al. (2012). A planetary candidate has passed all prior tests conducted to identify false positives, although this does not a priori mean that all possible tests have been conducted. A future test may confirm this KOI as a false positive. False positives can occur when 1) the KOI is in reality an eclipsing binary star, 2) the Kepler light curve is contaminated by a background eclipsing binary, 3) stellar variability is confused for coherent planetary transits, or 4) instrumental artifacts are confused for coherent planetary transits.”

Taking as inspiration the paper called “Astronomical Implications of Machine Learning” (2013) by Arun Debray and Raymond Wu, I decided to try some ML supervised models to classify light curves and determine whether these correspond to exoplanets (confirmed) or non-exoplanets (false positive). As a footnote: All this is just a “proof of concept”. I don’t intend to write an academic paper and my interest is only at hobby level. Well, I selected 112 confirmed exoplanets and 112 false positives. I just considered the parameters PDCSAP_FLUX and Time from each light curve by using Q12-lc and normalized data. Here, a key point is to characterize a light curve in term of attributes (features) that allow its classification.

According to Matthew Graham in his talk “Characterizing Light Curves” (March 2012, Caltech), “light curves can show tremendous variation in their temporal coverage, sampling rates, errors and missing values, etc., which makes comparisons between them difficult and training classifiers even harder. A common approach to tackling this is to characterize a set of light curves via a set of common features and then use this alternate homogeneous representation as the basis for further analysis or training. Many different types of features are used in the literature to capture information contained in the light curve: moments, flux and shape ratios, variability indices, periodicity measures, model representations”.

In my case, I simply used the following attributes: basic dispersion measures (e.g. percentile 25/50/75 and standard deviation), shape ratio (e.g. fraction of curve below median), periodicity measures (e.g. amplitudes, frequencies-harmonics, and amplitude ratios between harmonics by means of the periodogram based on Lomb-Scargle algorithm), and distance from a baseline light curve (false positive) by using Dynamic Time Warping (DTW) algorithm.

For getting periodogram, I used the pYSOVAR module and the Scipy Signal Processing module. According to Peter Plavchan, “A periodogram calculates the significance of different frequencies in time series data to identify any intrinsic periodic signals. A periodogram is similar to the Fourier Transform, but is optimized for unevenly time-sampled data, and for different shapes in periodic signals. Unevenly sampled data is particularly common in Astronomy, where your target might rise and set over several nights, or you have to stop observing with your spacecraft to download the data”. Moreover, for DTW, I used the mlpy module. Alternatively, it’s also possible to use R scripts over Python by using the Rpy2 module. Regarding to this topic, in the paper called “Pattern Recognition in Time Series” (2012), Jessica Lin et al. mention: “the classical Euclidian distance is very brittle because its use requires that input sequences be of the same length, and it’s sensitive to distortions, e.g. shifting, along the time axis and such a problem can generally be handled by elastic distance measures as DTW. DTW algorithm searches for the best alignment between two time series, attempting to minimize the distance between them”. For this reason, distance DTW could be a interesting metric to characterize a light curve. Finally, two examples for periodogram and DTW.

DTW

periodogram

After generating a DataFrame and creating a csv file (or .tab), I used Orange tool to apply different classification techniques over the dataset. At this point, as I said above, there are many tools that you can use. Also, I tested available classifiers on the dataset by using 10-folds cross-validation scheme. Previously, it’s possible to see some relationships between attributes such as: sd (standard deviation), sr (shape ratio), f1 (maximum frequency in periodogram) Amax (maximum amplitude periodogram), and dist1 (DTW distance). Shape ratio, as we see, is an attribute that can give advantage in the task of classification because it would allow a clearer separation of classes.

R_lc

As first approach, I used 5 basic methods: Classification Tree, Random Forest, k-NN, Naïve Bayes and SVM. The results are presented in the following tables and plots (ROC curve). There are a large variety of measures that can used to evaluate classifiers. However, some measures used in a particular research project may not be appropriate to the particular problem domain. Choosing the best model is almost an art because it‘s necessary to consider the effect of different features and operating condition of the algorithms such as type of data (e.g. categorical, discrete, continuous), number of samples for class (e.g. large or little difference between classes which can impact in the classification bias), performance (e.g. execution time), complexity (to avoid overfitting), accuracy, etc. Well, in order to simplify the selection and taking into account for example CA and AuC, Classification Tree, Random Forest and k-NN (k=3) are the best options. Also, applying performance formulas over Confusion Matrix it’s possible to reach this obvious conclusion. Moreover, according to literature an AuC value between 0.8 and 0.9 is considered “good” so that three methods are in the range. By the way, for more detail how Orange calculates some index, I recommend to read this link. As final comment, it’s true, however, that a more detailed study is needed (e.g. statistical test, new advanced models, etc.), but my intention was only to show that with a few scripts could be possible to do amateur Astronomy of “certain” quality.

Table

ROC2

ROC1


Leave a comment

Mobile World Congress 2014: A Deep Dive Into Our Digital Future

(Published in Barcinno  on March 07, 2014).

MWC 2014 ended about a week ago and I thought it’s appropriate to share some things that I saw, heard, and read during this event. It was also my fourth year attending this event, always with high expectations. To the point; although it’s common, specially on this type of event, to discuss which is the best mobile device launched by this or that manufacturer or which are the most promising apps for this starting year, my personal interest is to talk about technological trends in the mobile industry and some other interesting things, at least for me, such as which is the state of the Telco APIs, what happen in the OTT-Telco relationship, and also to comment briefly the Dr. Genevieve’s talk from Intel. However, I would like to say that my post at first sight doesn’t have a defined storyline, because my initial idea was just to mention noteworthy topics without having in mind a special order; though all these topics are part of the mobile communications ecosystem and therefore these are related each other somehow.

I just attended a couple of talks and main exhibition but unfortunately I was unable to attend Mark Zuckenberg’s talk in situ.  It’s clear that his presence was the most expected thanks to WhatsApp acquisition by Facebook for USD 19 billions but the key point in my view was to know, among other things, which is the new scenario opens in the fight for supremacy of the instant messaging and specially of the mobile voice calls at the global level, once WhatsApp announced in Q2 of this year will include this new last feature for free. It’s true that apps such as Line or Viber already have voice services but Whatsapp is a giant and in this sense, Jan Koum, WhatsApp CEO, he talked about figures: “To date, we have 330 million daily and 465 million monthly active users. We also have detected an increase of 15 million on the number of users when Facebook acquisition was known”. It suggests that in the upcoming months there will be significant movements in relation to OTT-Telco battle, which will be fought not only here but also in other scenarios such as Netflix vs US Telcos (Who should pay for the upgrade of the network?, network neutrality obsolete?), but it’s another interesting story.

An inspirational Talk: Dr. Genevieve Bell from Intel

I went to the Dr. Bell’s talk with many expectations. I wanted to know her vision about technology today and I wasn’t disappointed. Let’s see, all she said is a bit expected or say obvious, she doesn’t reinvent the wheel but her vision is valuable because as anthropologist and director of Intel Corporation’s Interaction and Experience Research, presented a clear and inspiring idea about which things we should keep in mind when develop applications or services currently. I want to remember this talk was given within the scope of WIPjam workshop where audience is typically technical.

At times, apps developers and designers tend to focus on technology forgetting the real needs, this is, “technology must grasp human nature” in order to have success apps. In this sense, she explained that, despite of passage of time, there are 5 things that haven’t changed in human behaviour:  “1) We need friends and family, 2) We need share interest, 3) We need a big picture 4) We need objects that talk for us and 5) We need our secrets”.  So, with this in mind, it’s logical to think the reasons why social networks have been successful. On the other hand she also mentioned 5 things that technology is reshaping the human behaviour and where new questions arise: “1) How to guard our reputations?, 2) How to be surprised or bored?, 3) How to be different?, 4) How to have a sense of time?, and 5) How to be forgotten?”.

GBell_WIPjam

It’s true that all these are generic issues which have multiple implications such as privacy and anonymity, but it also points to the fact that human being “wants to be human not digital” and this must be considered as a starting point in the development of new services and applications.

Coming back to MWC…

Telco APIs

Telco API workshop was interesting in general terms: different visions and one common goal: to provide a flexible and robust APIs ecosystem that allow to Telcos develop solutions more quickly under the premise of interoperability, scalability and security overall. Telcos want to bridge the gap with OTT trying to recover territory in some businesses as instant messaging and proposing advanced solutions in the area of mobile calls and video streaming by means of RCS (Rich Communications) suite. For this, GSMA, a global association of mobile operators, want to give a new boost to solutions as Joyn (app for chat, voice calls,  file sharing, etc) which to date hasn’t given the expected results.  On the other hand WebRTC, a multimedia framework focused on real-time communications over web browsers, is gaining momentum from last months. It’s doesn’t clear that WebRTC is going to be fully embraced by Telcos, but sure they are planning some synergies with its RCS, at least this I perceived in MWC.

For all is known the slowness in innovation of the operators either due to internal bureaucracy of organizations, interoperability and integration issues or standardization delay, but according to different speakers in this workshop, Telcos are aware of this lack and, as it would expect are working with manufacturers on solutions associated to RCS and VoLTE (Voice over LTE).  Moreover, new business models are need because with free competitor solution such as Line, Viber or soon Whatsapp in voice calls, they cannot charge for services such as SMS or perhaps mobile calls inclusive. Although it’s true that there is a big difference in the targeted markets. At least I don’t know a big company that uses Whatsapp for its corporative communications for security aspects and on the other hand when you pay for a service you demand quality, security, etc and therefore Telcos play in another league (for now). In any case, Telco are convinced to create a robust Telco API ecosystem that allows them to take advantage over OTTs.

Some interesting companies present in Telco API workshop were: ApigeeAepona and Solaiemes (from Spain). In general Telco APIs are focused on improving interoperability, integration, monetization, service creation, among other things and can be grouped as follows: Communications (Voice, Messaging, Conferencing), Commerce (Payment, Settlement, Refund, and Identity), Context (Location, Presence, Profile, Device Type), and Control (QoS, Data Connection Profile, Policy, and Security & Trust).  For more detail, I also recommend to read about OneAPI, a global GSMA initiative to provide APIs that enable applications to exploit mobile network capabilities.

Big Data and SDN

From my recent posts about Big Data and SDN nothing has changed in my appraisal. So, I just would like to add some ideas to complement the information given previously.  I can see that Big Data and SDN/NFV are on track to become key elements to support OSS (Operational Support Systems) and BSS (Business Support Systems) within Telcos. OSS and BSS are considered main assets whose importance in the current and future business is now unquestionable. For example, churn is a challenge problem for the Telcos whose predictive analysis is key in order to avoid customer attrition. Another example, no without controversy, is the sale of anonymized and aggregate data. Here, telco thanks to information gather from cellular cells indicate areas where an enterprise can install this or that type of business and other demographic data. Also, I could see some interesting solutions given by SAP (Hana platform) or Hitachi Data System. Unfortunately, I also was unable to attend to keynote panel “Up to Close and Personal: The Power of Big Data” where representatives from Alcatel-Lucent, EMC and SK Planet debated about the convergence of ubiquitous connectivity, cloud computing and Big Data analytics. I guess they talked about these issues and challenges in the industry. It’s a new world to explore.

Although SDN and NFV are solutions, say, mainly focus on data centers and backbone networks (also SDN WAN), it seems that mobile networks neither have escaped to “the charm” of SDN. It’s proposed to use the SDN capabilities in centralization/abstraction of the control plane in order to apply traffic optimization functions (also for radio resources). From technical point of view, this is very interesting because, as we all know, mobile traffic is increasing and Telcos and researchers are searching practical solutions to mitigate some capacity problems and SDN and NFV could be a real alternative. In MWC, HP also launched its Open NFV strategy for telcos with the purpose to help them in the acceleration of the innovation and generation of new services.

Hetnets, Backhaul, Small Cells and Hotspot 2.0

MWC is the great meeting point where Telcos (mainly Mobile Carriers) and manufacturers can discuss their network needs and to present their services and products respectively. As in 2013, many well-known concepts were heard again in the talks, halls, and booths of the exhibition. However a key idea always is present in the environment: How to provide better bandwidth (network capacity) and QoS under an increasing traffic or radio spectrum depletion. In this sense, many people are currently talking about “mobile data crunch” which is indicative that it’s necessary to search possible solutions. Small Cell, for example, is a solution to improve the mobile service coverage and so to expand network capacity for indoor and outdoor locations.  This solution can be used in complementary way to macro cell or to cover a new zone faster and over licensed or unlicensed spectrum. Depending on the needs there are many options in choosing a small cell: pico cells, femto cells, etc.

On the other hand, Wifi networks are again seeing the light of day, thanks to Hotspot 2.0, which is a set of protocols to enable cellular-like roaming (i.e. automatic Wifi authentication and connection). This allows that Wifi is a real alternative to improve coverage and many Telcos is already exploring this solution or inclusive planning alliances with established Wifi operators like Fon or Boingo.

All this brings us another recurrent concept: Hetnet, this is, heterogeneous network but focus on the use of multiple types of access nodes in a wireless network, i.e. integration and interoperability for traditional macro cells, small cells and Wifi. And now, How can we connect all these type of access points to backbone?  Well, the answer is simple: through Mobile Backhaul, which is a part of mobile network that interconnect cell towers with core network.  In MWC were presented different solutions: wired and wireless, based on Ethernet IP/MPLS, Microwave, with line-of-sight (LoS), near line-of-sight (nLoS), and non-line-of sight (NLoS), etc. It’s is an area very active and broad but in this post isn’t worth to enter into too details.

Wearables and Digital Life

Finally, I would like to comment two ideas that were highlighted in MWC 2014.  First, “wearables”, a buzzword that refers to small electronic devices based on sensors (body-borne computing) that may be worn under, with or on top of clothing, i.e. watches, wristbands, glasses and other fitness sensors. Many companies presented their products that will set trend this 2014: FitbitSmartband (Sony), Gear2/Neo(Samsung), etc, etc. According to Dr. Bell, wearable computing has much potential because also fit with the human nature, although all this is an early stage (even though, it’s a old idea), it’s clear that technology as said she, “is changing some of our behaviours and preferences”.

MWC14

Digital Life is another buzzword. Basically is commercial name given to an advanced home automation system by ATT. I suppose Digital Life, as name or concept, is related to research from MIT about rethinking of the human-computer interactive experience. Well, ATT solution basically uses Internet of things (IoT), Augmented Reality (AR) and many other technologies that “supposedly” will make our life most secure, comfortable and easy at home. I mention all this because “wearables” and “Digital Life” are two old concepts that today again are gaining strength and it’s expected are trends in the next months or years. Many things I have left out (e.g. LTE evolution, VoLTE, Fronthaul, 5G, Internet of Things, Connected Car/Home, 4YFN initiative, some local startups, and many etc.). Perhaps in an upcoming post I come back to comment some of the topics.


Leave a comment

Beautiful and Elegant Maths

Some days ago thanks to my brother Jorge, I found out that British scientists conducted a study whose results showed that “the experience of mathematical beauty correlates parametrically with activity in the same part of the emotional brain, namely field A1 of the medial orbito-frontal cortex (mOFC) as the experience of beauty derived from other sources (e.g. visual and musical)”. At first glance, this can sound, say, “weird”, perhaps, but it’s evident that is an interesting conclusion because highlights the fact that the beauty of a formula or equation is perceived over time in the same zone of brain where we perceive the beauty of a painting or a piece of music.  The paper is: “The experience of mathematical beauty and its neural correlates” written by Semir Seki et al.  (Frontiers in Human Neuroscience, 13th February 2014).

I’m not a neurobiologist and, by now, I don’t feel qualified enough to assess and question the technical procedures (e.g. functional magnetic resonance imaging-fMRI) that led to this conclusion and more thinking that one of the authors is Michael F. Atiyah a Field Medal and Abel Prize laureate. I just would like to express my satisfaction about the study by the fact of linking concepts as “Mathematics” and “Beauty”, which, at first sight are perceived as distant or unrelated terms. From childhood, Mathematics is presented as a difficult and complex (sometimes abstract) subject to understand and maybe this affects finally the people perception.

The study was conducted on a sample of 15 mathematicians. These should assess different equations assigning a value coded as “-1” for ugly, ”0” for “neutral”, and “1” for “beautiful”. So, this response was analyzed and contrasted via the fMRI system.  Perhaps, from a statistical point of view, it’s logical to think that this study could be considered incomplete due to the limited use of sampling methods or a heterogeneous population. On the other hand, it’s clear that is a first approach and a great starting point in order to address a broader study in the neurobiology of the beauty. Moreover, “the understanding of the meaning” of numbers and symbols that form an equation is key to appreciate “the wanted” beauty, so the level of experience will be bounded by the understanding, this is, everyone must be “trained” to appreciate the beauty.  The following figure shows some equations used in the study.

equations

So far, however, I haven’t said nothing new or amazing that already wasn’t mentioned before in the paper. However, I would like to comment some things that I previously read and where adjectives such as “elegant” and “beautiful” were associated to theorems proofs, equations and symmetry studies.  My idea is just to complement the information given by the authors.

Mathematical Beauty: additional data

Bertrand Russell in his book “Mysticism and Logic and other essays” (1917) declared “Mathematics, rightly viewed, possesses not only truth, but supreme beauty — a beauty cold and austere, like that of sculpture, without appeal to any part of our weaker nature, without the gorgeous trappings of painting or music, yet sublimely pure, and capable of a stern perfection such as only the greatest art can show”.

James Mcllister in his essay “Mathematical Beauty and the Evolution of the Standards of Mathematical Proof” (2005) said “the beauty of mathematical entities plays an important part in the subjective experience and enjoyment of doing mathematics. Some mathematicians claim also that beauty acts as a guide in making mathematical discoveries and that beauty is an objective factor in establishing the validity and importance of a mathematical result. The combination of subjective and objective aspects makes mathematical beauty an intriguing phenomenon for philosophers as well as mathematicians”.

In this sense, there are different aspects to appreciate the mathematical beauty. For example in the “method” used in a proof, this is: a proof would be considered “elegant” when it’s remarkably brief, uses a minimum number of additional assumptions or previous results, figures out a result in a not expected way from unrelated theorems, it’s based on an original insight, and can be easily generalized. All this makes me think in the book “The Man Who Loved Only Numbers” (1998) by Paul Hoffman where Paul Erdös, an outstanding and prolific Hungarian mathematician, said when a proof of a theorem was elegant and perfect: “It’s straight from the Book”. The Book is simply an imaginary book where God keeps the most wonderful and beautiful proofs. Erdös, that was atheist, said also “You don’t have to believe in God, but you should believe in the Book”.

For many mathematicians, included Erdös, the first proof candidate to be in the Book would be Euclid’s Theorem proof where it’s showed that the number of primes is infinite.  However, Erdös and some mathematicians consider the proof of Fermat’s Last Theorem by Andrew Wiles (1995) isn’t worthy to be in the Book, because the proof is long and difficult to understand and therefore it isn’t elegant. From a metaphorical sense, mathematicians seem to believe in the existence of the Book because they yearn for elegant and perfect proofs. As a curious fact, now there is a book “in flesh and bone” called “The Proof from the Book” (2010, Fourth Edition) where mathematicians pay tribute to Erdös and collect some of the most beautiful theorems in different areas.

As anecdotal fact, John D. Cook in his blog said that a simple and short Euclid’s theorem proof can almost be written within the 140-characters limit of Twitter: “There are infinitely many primes. Proof: If not, multiply all primes together and add 1. Now you’ve got a new prime”. In any case, it’s clear that Fermat himself, in his own proof, had to use more than 140-characters, because otherwise he wouldn’t have claimed: “I have a proof but this is too large to fit in the margin”. For those brave people who aren’t afraid to face to “modular elliptic curves”, nor to fight against the “Taniyama–Shimura–Weil” conjecture, Andrew Wiles’ proof is here.

According to mathematicians it’s possible also to see beauty in the “result”, this is when a formulation join together different areas in a not expected way and by means a simple equation presenting truth that has “universal validity”. The best example is Euler’s identity, which joins together Physics, Mathematics, and Engineering and where 5 fundamental constants share scenario with 3 basic arithmetic operations. The physicist Richard Feynman (Nobel Prize in Physics 1965) called this equation “our jewel” and “the most remarkable formula in mathematics”.

Finally, many books talk about golden ratio, divine proportion to define beauty but I would like to refer to an aspect related to symmetry. In the book “The Language of Mathematics: Making the Invisible Visible” (2000), Keith Devlin, professor from Stanford University says that geometry can describe some of the visual patterns that we see in the world around us; these are “patterns of shape” and the study of symmetry captures one of the deepest and most abstract aspects of the shape. He says “we often perceive these deeper, abstract patterns as beauty, their mathematical study can be described as the mathematics of beauty”.

The study of the symmetry is carried out observing the transformations of the objects. These transformations can be seen as a type of function (e.g. rotation, translation, reflection, stretching, and shrinking). A symmetry of a figure is a transformation where the result is invariant i.e. the figure seems the same after the transformation (e.g. the circle). In nature there are many examples of shapes described by geometry and symmetry, which are symbol of beauty.

As conclusion, if I have one, it’s easy to appreciate mathematics because all is mathematics and just we have to look carefully.

Recommended Videos:

“Beautiful Equations” BBC by Matt Collings (2012)

“Paul Erdös, N is a number”, BBC (2013)

“Fermat’s Last Theorem”, BBC Horizon by Simon Singh (1996)

And finally, I would like to recommend some books on my bookshelf that are related with this topic:

  • The Man Who Loved Only Numbers by Paul Hoffman
  • Fermat’s Last Theorem by Simon Singh
  • Prime Obsession, Bernhard Riemann and Greatest Unsolved Problem in Mathematics by John Derbyshire
  • The Man Who Knew Infinity: A life of the Genius Ramanujan by Robert Kanigel
  • The Music of the Primes: Searching to Solve the Greatest Mistery in Mathematics by Marcus du Sautoy
  • Symmetry: A Journey into the Patterns of Nature by Marcus du Sautoy
  • Perfect Rigour: A Genius and The Mathematical Breakthrough of the Century by Masha Gessen.
  • The Language of Mathematics by Keith Devlin


Leave a comment

Some Data & Maths Jokes

For Python enthusiasts…very inspirational (link)

Python (source xkcd)

Python (source xkcd)

Correlation doesn’t imply causation (link)

Correlation (source xkcd

Correlation (source xkcd

Data Scientists as a Sexy Profession (link)

Framework (source Dilbert)

Framework (source Dilbert)

Today, anonymity is a joke (link)

Anonymity (source Dilbert)

Anonymity (source Dilbert)

From Susan Stepney website:

About results:

  • The problem with engineers is that they tend to cheat in order to get results.
  • The problem with mathematicians is that they tend to work on toy problems in order to get results.
  • The problem with program verifiers is that they tend to cheat at toy problems in order to get results.

(from science jokes, ver 6.7 mar 1, 1995 )

About reality:

  • An engineer thinks that equations are an approximation to reality.
  • A physicist thinks reality is an approximation to equations.
  • A mathematician doesn’t care.

(from Canonical List of Math Jokes )

About prime numbers:

Various proofs that every odd number is prime :

  • Mathematician: “3 is prime, 5 is prime, 7 is prime. The result follows by induction.”
  • Physicist: “3 is prime, 5 is prime, 7 is prime, 9 is experimental error…”
  • Engineer: “3 is prime, 5 is prime, 7 is prime, 9 is prime…”
  • Computer programmer: “2 is prime, 2 is prime, 2 is prime, 2 is prime, …”
  • Economist: “2 is prime, 4 is prime, 6 is prime, 8 is prime…”


Leave a comment

Programmable Networks: Separating the hype and the reality

(Published in Barcinno on February 20, 2014).

Each year MIT Technology Review presents its annual list of 10 breakthrough technologies that can change the way that we live. These are technologies that outstanding researchers believe will have the greatest impact on the shape of innovation in years to come. In 2009, Software Defined Networking (SDN) was one more in that list. This is a significant fact because this technology promises to make more programmable computer networks, changing the way that we have been designing and managing them. But, let’s start from the beginning: What is a Programmable Network (PN)?. According to SearchSDN website, PN is defined as a network “in which the behavior of network devices and flow control is handled by software that operates independently from network hardware. A truly programmable network will allow a network engineer to re-program a network infrastructure instead of having to re-build it manually”.

Till now, a lot of water has passed under the bridge, but it doesn’t mean that SDN is currently a consolidated technology. According to Gartner Hype Cycle (2013), SDN and other related technologies like NFV (Network Function Virtualization) are still at the peak of inflated expectations stage waiting to fall at trough of disillusionment (see Hype Cycle definitions). Although it seems to be a recent topic, it isn’t. In this point I recommend to read an interesting report called “The Road to SDN: An Intellectual History of Programmable Networks” by Nick Feamster et al. (Dec 2013) where authors remark that SDN and NFV aren’t novel ideas at all, but an evolution of a set of ideas and concepts in relation to PNs over at least the past 20 years.

Particularly, PNs interest me because for years I have had the opportunity to work on networking from two points of view: industry (Telco) and academia (research university), and in both areas I have checked “in situ” some limitations of the current network infrastructures, which have already been described widely in the literature (e.g. report).  Therefore, I consider interesting to explore the current network capabilities and which are industry/academia proposals (and challenges) to tackle the migration problem towards PN.

For example, in order to improve diverse network aspects such as speed, reliability, management, security, and energy efficiency, researchers need to test new protocols, algorithms, and techniques in a real large-scale scheme. It’s a tough work to do it over existing infrastructures, because routers and switches run a complex, distributed, and closed (in general, proprietary) control software. Moreover, to try a new scheme or approach can result a cumbersome task specially when each time you need to change the software (firmware) in each network element.

On the other hand, network administrators, for both management and troubleshooting, need to configure each network device individually, which also is an annoying task when there are many devices. However, it’s true that in the market there are some management tools to manage these elements in a centralized manner, but they typically use limited protocols and interfaces, and therefore, it’s common to find interoperability problems between vendors, which adds complexity to the solution.

Without going further, last June I attended SDN World Conference & Exhibition in Barcelona where the telecom community discussed several aspects about SDN and Network Virtualisation in general (Note: two next events this year will be on May/London and on September/Nice). Well, I remember, among other things, a recurrent idea in the talks: “In last years, computing industry has actively evolved towards an open environment based on abstractions thanks to the cloud paradigm. It has allowed to progress in various aspects: virtualization, automation, and elasticity (e.g. pay-per-use and on-demand infrastructure).  On the contrary, networking has evolved towards a complex, rigid, and closed network scheme, lacking of abstractions (e.g. control plane) and open network APIs.”

In consequence, from Telco’s point of view it’s necessary to consider a change in the network design that facilitates the network management. It’s also urgent to work in the integration between computing and networking by means of virtualization and so to see the network as a pool of resources; not as separated functional entities. Also it’s desirable to avoid hardware-dependent and vendor lock-in. All this suggests that the simple idea or promise of a PN that helps to solve these drawbacks would be a big leap in quality and a great opportunity towards network innovation.

Currently there are two main approaches in the field of PNs that industry is fostering: SDN and NFV. The former is mainly (though not exclusively) focused on Data Center Networks and the latter on Operator Networks. Next, some definitions about SDN and NFV are given for a better understanding of the text, however, the main goal of this post is to review some characteristics of SDN and NFV from an innovation point of view. Therefore, I am not reviewing technical aspects in depth. For more detail, I recommend to review the following websites: ONFSDNCentral and TechTarget (SearchSDN). For experienced readers, I recommend to check this website for academic papers.

Some definitions: What is SDN?

According to Open Networking Foundation (ONF), an organization dedicated to the promotion and adoption of SDN through the development of open standards, SDN is an approach for a network architecture where the control plane and data plane (forwarding plane) are decoupled. In simple terms, the control plane decides how to handle the traffic and the data plane forwards traffic taking to account the decisions that the control plane performs.

In SDNCentral, Prayson Pate describes SDN as “Born on the Campus, Matured in the Data Center”. Effectively, SDN began on campus networks at UC Berkeley and Stanford University around 2008, and thereafter, it made the leap to data centers where currently has been showing since then, to be a promising architecture for cloud services. In a simplified way, he indicates that the basic principles that define SDN (at least, today) are: “separation of control and forwarding functions, centralization of control, and ability to program the behavior of the network using well-defined interfaces”. Through the Figure 1, I will try to clarify, as far as possible, these concepts.

Figure 1: Traditional Scheme vs. SDN Architecture

Figure 1: Traditional Scheme vs. SDN Architecture

Figure 1a shows a traditional network scheme where each network device (e.g. router, switch, firewall, etc.) has its own control plane and data plane, which are currently vertically integrated. This supposes extra cost and incompatibilities among manufacturers. Moreover, each device has a closed and proprietary firmware/control software (Operating System) by supporting diverse modules (Features) to perform some protocols related to routing, switching, QoS, etc.

On the other hand, Figure 1b depicts a logical view of the SDN architecture with its three well-defined layers: Control, Infrastructure, and Application. The first layer consolidates the control plane of all network devices in a network control “logically” centralized and programmable. Here, the main entity is the SDN Controller whose key function is to set the appropriate connections to transmit flows between devices and therefore to control the behavior of all network elements. Also, there may be more than one SDN controller in the network, depending on configuration and attending to scalability and virtualization requirements. In contrast, the second layer involves the infrastructure formed by the packet forwarding hardware of all network devices, which is abstracted to the other layers, this is, physical interface from each device is seen just as a generic element by applications and network services. Finally, the third layer is the most important from innovation point of view because contains the applications that will provide added-value to the network. Applications such as: access control (e.g. firewall), load balancing, network virtualization, energy efficient networking, failure recovery, security, etc.

In this architecture, an important aspect to be highlighted is the communication between layers, which is carried out via APIs (Application Programming Interface). Conceptually, in computer and telecom science two terms are used to describe these interfaces: Southbound Interface and Northbound Interface. The former refers to an interface to communicate with lower layers. In the case of SDN, Openflow is a prominent example of this type of API. Conversely, the latter refers to an interface to communicate with higher layers. Today, however, there isn’t a consolidated standard for this kind of interface, which is the key to facilitate innovation and enable efficient service orchestration and automation.  Later, I will comment some things about this.

What is NFV?

In simple words, NFV is a new approach to design, deploy and manage networking services. Its main characteristic is that some network functions are decoupled from specific and proprietary hardware devices. This means network functions such as Firewall, NAT (Network Address Translation) IDS (Intrusion Detection System), DNS (Domain Name Service), DPI (Deep Packet Inspection), etc. are now virtualized on commodity hardware, i.e. over high performance servers of independent software vendors. It’s applicable to any data plane or control plane function in both wired and wireless network infrastructures. (see Figure 2).

Figure 2: Vision for Network Function Virtualisation (source ETSI)

Figure 2: Vision for Network Function Virtualisation (source ETSI)

Although SDN is perhaps the buzzword when talking about PNs, currently the term NFV hasn’t lagged far behind in popularity. In fact, it turns to be a recurrent term between Services Providers, given that a group of them formed in October 2012 a consortium dedicated to analyze the best way to provide a solution of PNs in the field of operator networks. This consortium afterwards created a committee under the umbrella of ETSI (European Telecommunications Standards Institute) in order to propose and promote virtualization technology standards for many network equipment types. In the paper called “Network Functions Virtualization: An Introduction, Benefits, Enablers, Challenges & Call for Action” (October 2012), NFV ETSI Working group describes the problems that are facing along with their proposed solution.

Trends and some Comments

Beyond the obvious differences between SDN and NFV, such as: focus (Data Centers / Service Network Providers), main characteristic  (Separation of control and data plane / relocation of network functions), etc. Analysts agree that both approaches can co-exist and complement each other (i.e. there is synergy), but each one can operate independently (Note: I recommend to read “The battle of SDN vs. NFV“ for more detail). In fact, Service Providers understand there are too “works in progress” (and open problems) and nothing completely defined, and therefore it wouldn’t be logical to dismiss out of hand any solution, more even when they have a broad portfolio of services in several fields. TelefónicaColt, and Deutsche Telekom, are just three examples of Service Providers from ETSI that are working actively in these topics developing pilot programs.

SDN and NFV are just tools and they don’t specify the “killer application” that hypothetically could boost its use. Actually, SDN is meant to give support for every new coming application. Here, a key element in the development of applications is the Northbound API. In recent months there has been movement in ONF to form a group in order to accelerate the standardization of this interface, but it isn’t clear that this effort will bear fruit in the short/medium term.  At this moment there isn’t a common idea and many discordant voices. In this link there is an interesting discussion about this topic. For instance, it’s mentioned that apparently ONF is more interested in developing Openflow protocol than a Northbound API, because “standardizing a northbound API would hamper innovation among applications developers”. Dan Pitt from ONF said in this sense: “We will continue to evaluate all of the northbound APIs available as part of our commitment to SDN end users, but any standard for northbound APIs, if necessary, should stem from the end users’ implementation and market experience”. It’s clear that finally, as in other many occasions, the market will give its verdict.

Meanwhile, it’s gaining relevance a consortium called OpenDaylight formed by “important” network hardware and SDN vendors, which have launched on April 2013 an open source SDN controller with its respective framework through which is possible to develop SDN applications.  OpenDaylight supports the OSGi framework and bidirectional REST-API for Northbound API. They expect this initiative gains momentum but for now, they rejected to say that are an universal standard, although seeing the power of its members, it isn’t excluded that they have a high market share in the future.

Taking into account this previous idea, we already know that the heart of SDN is the SDN controller and today there are other many alternatives to OpenDaylight. Personally, I have worked with NOX/POX(C++/python-based) and Floodlight (java-based). The former is suitable to learn about SDN controller because has, say, an academic character, meanwhile the latter has a focus more professional and works with REST-API, which is a more common interface. On the other hand, in addition to startups whose focus is to develop a SDN controller, there are many others related to SDN applications (some people use the term ADN-Application Defined Networking) in areas as diverse as security, management, energy-saving systems, etc. With all this, I want to say there are many available open source tools and it’s possible already to begin the development of SDN applications or applications on top able to work on top.  Some interesting startups are mentioned in CIO and IT World. In Spain, at startup level (by excluding Service Providers) there are few examples to mention. In fact, the first one and unique that comes to mind is Intelliment Security based in Seville, which provides a centralized network security management offering “an abstract and intelligent control plane where changes are automatically orchestrated and deployed to the network without human intervention”.

Going to another topic, SDN and Big Data analytics are two technologies that are destined to understand each other.  It’s expected that SDN makes certain network management tasks easier (e.g. OSS/BSS) and therefore will be necessary to have a technology that takes advantage of the huge amount of data about the network and so Big Data will enter to scene. For instance, issues such as traffic patterns analysis, traffic demand prediction analysis will help to enable an intelligent management. On the other hand, a prickly topic that certainly will be on the table, for anonymity reasons, is the use of DPI (Deep Packet Inspection) techniques. So, in general, as we are talking about to centralize the communications, it’s logical to think that SDN and Big Data will meet soon. One first approach can be found in the paper called “Programming Your Network at Run-time for Big Data Applications” (2012), G. Wang et al. where authors explore an integrated network control architecture to program the network at run-time for Big Data applications based on Hadoop using optical circuits with an SDN controller.

Another pertinent issue to be commented is how PNs will affect the development of OTT providers.  Currently they are, say, helpless at network control level due to they cannot guarantee by themselves the delivery of their services. I know “it’s Internet and many Service Providers come into play”, but today, OTT providers like Skype, Netflix or Wuaki-TV in Spain just can make some QoS network measurement with your client in order to adapt the delivered content (transcodig) or simply to indicate minimum requirements to guarantee a suitable quality of experience (QoE). Usually SDN and NFV promise to help Service Providers improving the management and control of “their own” networks, but perhaps in the future this control can also be extended to third-parties. OTT providers are injecting increasingly traffic to the network becoming in major players in the Internet wherewith likely they demand more control and SDN or NFV can be the key to reach it as well as to generate new business models.  So, OTT providers will be able to give better services with better QoE, Service Providers will be able to adapt to the requirements of OTT providers, and Apps OTT will be able to talk to the network in real time.

On the eve of MWC 2014 at February in Barcelona, undoubtedly SDN and NFV will be buzzwords again. Matthew Palmer in SDNCentral said about SDN and its trends in 2014: “2012 was the year for people to learn what is SDN?, 2013 was the year for people to understand how they use SDN, and 2014 will be the “show me” year; the year of the proof-of-concept for SDN, NFV, and Network Virtualization”.  SDN and NFV have a huge potential but are still in early stage; we must to be expectant to the news in the upcoming months. SDN is much more than a “match/action” scheme in switches and a logically centralized control of multiple network devices. Moreover, there are still a lot of open problems to solve such as: Northbound API, control orchestration, security, etc.

Finally, for people who want to learn more about SDN and implement some practical examples by using open source SDN controllers, network simulator, python programming, etc., I strongly recommend to take the free MOOC by Cousera called “Software Defined Networking” from Georgia Tech that begin at June 24th. I took the same course the last year and really it’s an excellent starting point for understanding this topic. The course lasts only 6 weeks and has 10 quizzes and 4 programming assignments, approximately. Nick Feamster, the instructor, said the content will be updated according to new developments and trends in the field.


Leave a comment

Big Data In Barcelona: Your Intro And Guide To Its Promising Future

(Published in Barcinno on January 10, 2014).

According to the latest “Emerging Technologies Hype Cycle for 2013” published annually by Gartner Research, currently Big Data is a technology located near the peak of inflated expectations in the hype cycle curve and its plateau of productivity is expected to be reached in 5 to 15 years (Figure 1).

Fig1: The 2013 Emerging Technologies Hype Cycle, (Source: Gartner August 2013)

Figure 1: The 2013 Emerging Technologies Hype Cycle, (Source: Gartner August 2013)

If we interpret this graphic tool using its formal definition, Big Data in the “environment” would be generating enthusiasm and sometimes unrealistic expectations. Even more, some pioneer experiences would be providing promising results although typically there would be more failures than successes. Furthermore, it is estimated that just from 2020 (roughly) it could deliver widely accepted benefits, therefore, it’s important that we don’t lose sight its evolution in the upcoming months and years because today it’s difficult to predict if its use will be widespread or just limited to a few niche markets.

What is Big Data?

Big Data is a term that just began to be used in 1998 indicating the huge amount of data that are being created every day. Nowadays it predominates an updated definition previously coined by Doug Laney from Gartner, which says that the growth of these data is “high” and is associated to three variables known as 3 V’s: high Volume (increasing amount of data), high Velocity (streaming data arriving to high speed, e.g. real-time), and high Variety (many different sorts of data like text, audio, video, etc).  Additionally some researchers have added three new V’s that can be also found in the literature: Veracity (how the organizations trust in the data in the sense of integrity and confidentiality), Variability (how data structure can change), and Value (business value given to the data by the organizations).

From a conceptual point of view, Big Data isn’t a new thing. For decades, companies have been gathering large amounts of data. However, in the last years companies are seeking ways to analyze data by means of new data mining and data analysis techniques wherewith new opportunities are being generated. The book entitled “Big Data: a revolution that will transform how we live, work, and think” (2013) by Meyer-Schönberger and Cukier, describes how Big Data will have a fundamental effect in the human thought and in the decision-making process as well as will bring forward some challenges and risks that surely will affect issues as relevant as privacy and individual liberties. Moreover, one interesting idea that describes this book is the “datafication”. This term can conceptually be analogous to the “electrification” of the industrial process where the use of the electricity is fundamental as energy source. In this sense, datafication corresponds to the use of data as the fundamental fuel that will move the business.

On the other hand, some aspects in Big Data continue generating some controversy with regard to its scope. For example, Big Data is a term that usually is used to encompass technologies dedicated to extraction, processing and visualization of “large datasets”. Nevertheless, sometimes either by ignorance or by commercial interests (marketing), this term is utilized indiscriminately to describe some data mining process with “little” datasets around of some few Gigabytes of data gathered during a specific period of time. Furthermore, it’s true that this conceptual difference will not be so in the next years because the data will continue growing.

Without going too far, currently in Spain there are a few companies that, we can say, work effectively with large data volumes in more or less real-time. You might think of some Banks, Insurance Companies, Telcos and little else, without taking into account research areas like astronomy, genomic or particle physics (LHC experiments) where the processing of huge datasets is daily bread. Also, Hadoop/MapReduceopen-source software framework for storage and large scale processing of datasets on clusters, sometimes is presented as the ideal solution to enter into Big Data world but it isn’t always the best option, especially for SMEs. In simple words this is: “use a sledgehammer to crack a nut”.

Finally, it’s important to consider ethical aspects when personal and sensible data are being used. Certainly, there are techniques to aggregate data and make them anonymous but sometimes, commercial criteria is more powerful.  An interesting story came to light in May of this year, when The Guardian revealed that a Harvard professor had re-identified the names of more than 40% of a sample of anonymous participants in a high-profile DNA study by using cross-referencing with public records highlighting therefore, the dangers of revealing a huge amount of personal data on the Internet. So, it’s logical to think how this situation could affect the value of insurance policies, for example. For more details, I recommend to read the report titled “Mining Big Data: Current Status, and Forecast to the Future” by Wei Fan and Albert Bifet (2012). There the authors summarize and describe these and more aspects that worthwhile to keep in mind.

Moreover, the idea that bigger data is always better data is erroneous, it depends on noise and representativeness of the data by itself. In this sense, the real issue isn’t to gather huge amounts of data but the key aim is to take data from any source and then by means of high-powered analytics to extract relevant information that help us to reduce costs and processing time, generating novel products and services, optimizing processes, and improving decision-making tasks. As a summary, the definition of Big Data is a moving target. Each new academic or commercial report includes a new shade. See this link for more details. Also from datascience@berkeley Blog there is an excellent infographic where real-life examples are presented to help explain the scope of data size.

There are many other examples where data mining and predictive analytics solutions (and Big Data in general) are being used, for example, to develop: churn analysis (customer attrition rate) in order to plan quick responses to retain customers because as you known, it’s much more expensive for a company to go after new customers than service existing clients, and also recommender systems for improving cross and up selling. Moreover, it’s assumed that areas such as Banking/Insurance and Healthcare are more advanced in the use of Big Data highlighting use cases as analysis of credit, fraud, and risk for the former and personal health information managing and clinical trials, for the latter. In general, the final goal is to get a single view of each customer to offer personalized solutions.

It’s surprising at least for me that an area, say, “less traditional or conventional” as online videogames is also using similar techniques and resources as mentioned above. Here, the main goal is to understand player behavior and enhancing the player experience as well as to develop new analyses and insights in order to make decisions and strategies to acquire new users, retention, decrease the churn rate and improve the monetization, of course. I take this opportunity to recommend the book entitled  “Game Analytics: Maximizing the Value of Player Data” (2013) by Seif El-Nasr et al. which is an excellent introduction to this interesting and promising topic.

I mention all this because if we take as reference point recent job offers related to Big Data or data analytics in Barcelona-Catalonia published in Linkedin and Infojobs, we can see two things: first, these are few offers and second most come almost exclusively from consulting firms who are seeking professionals in this field generally to develop projects in the banking sector with emphasis in business intelligent. It’s possible also to find some other open positions for online videogame startups. In simple words, the two most popular requested profiles are: developers (Hadoop/NoSQL DB) and data scientist. In this point, we must remember however, that currently a bachelor’s degree in Data Science or exclusively for Big Data doesn’t exist, yet. Meanwhile at postgraduate level have recently appeared some Master’s degrees that are trying to fill this gap in the formation, since for certain type of work, it’s necessary to have multidisciplinary skills in several fields like computer science, statistics, mathematics, hacking, etc. In fact, many data scientists previously have been researchers in areas so diverse like astronomy, biology, etc.

What’s happening with Big Data in Barcelona?

Although in terms of volume of business Barcelona cannot be compared to big cities like London or New York where the concepts of Big Data are more embraced and the market is more mature, Barcelona hasn’t been unaware to this trend. In 2013, Barcelona hosted a series of events where Big Data has been also a buzzword: Mobile World Congress (February), BigDataWeek (April), BDigital Global Congress-The challenges of Big Data (June), Telco Big Data Summit (November), Smart City Expo (November),  IoT Forum (December), and periodically several Meetups such as Data Tuesday, IoT-BCN, Big Data Developers, etc.

In general, in all these events Big Data has been introduced as a great opportunity and the next frontier for innovation, competition, and productivity. As a cornerstone for Smart Cities, the increasing of volume and detail of the information captured by companies in recent years, next to the rise of multimedia data, social media data, and the appearance on the scene of IoT (Internet of Things) have been highlighted.

The reality however is that today in Spain, SMEs prioritize their tight IT budgets in improving efficiency and productivity of the current IT platforms, and Big Data, although a promising idea shares between IT Manager, the investment in technology related to Big Data is minimal or simply it doesn’t exist.  Despite this scenario, there are interesting initiatives related to Startups which use the potential of public data (Open Data) or data generated by own mobile apps. An opportunity for developing innovative services and solutions by using tools related to data mining, predictive analysis, and so on.

Barcelona like other major European cities is trying to improve the urban life and welfare by means of Big Data. Through OpenDataBCN it’s possible to find historical data about different topics related to economy, administration, population, etc. Most of datasets are statics, updated monthly, semiannually or annually, and in standard and open format.  Also, AMB (Àrea Metropolitana de Barcelona) has recently bet on open data and its new website presents useful information for citizens that will be updated, “as they said”, as soon as possible. According to press information, it’s expected that soon they will make available to developers more real-time open datasets for transport and other crucial services . Other interesting initiative is iCity Project, which is a platform to access public infrastructures for the development of public interest services by means of standard REST API, and where Barcelona participates along with London, Bologna, and Genova. In this point, it’s worth reviewing London Datastore as reference in open data topics. Nevertheless, currently few real-time open datasets are available via standard API (application programming interface).

The Open Source Revolution

An interesting example to highlight is bike sharing service, Bicing, that from a long time ago has a real-time open API. I think it’s a good starting point to envisage the potential of this kind of solutions in order to generate useful visualizations for citizenship as well as to allow uncovering for instance patterns of human behavior, daily routines and discover hidden aspects in the city dynamic. The picture shows a simple web application R-Shiny based (and Leaflet library) that uses this API. I chose this platform because it’s extremely fast to develop quickly interactive web apps, but there are many other open source tools for visualization, data mining, machine learning and NoSQL DB.

Figure 2: Example Web App

Figure 2: Example Web App

By means of a drop-down selection box the users can choose a station and depict the number of available bikes in last two hours to see the trend evolution. Also there is a city map where it is possible to get a real-time status of bike availability for each station interactively.

There are many other things that can be analyzed such as the most popular bike routes in the city. In addition, it allows also developing different spatio-temporal analysis or by using statistical forecasting to optimize the overnight bike redistribution. This scheme can be extrapolated to other scenarios following a similar process. In fact, open source revolution clearly is and will be a key player in the future of Big Data.

From much ado about nothing to promising new heights
In the end, launching Big Data projects will carry the same challenges facing any emerging industry or technology. First, it requires qualified professionals to get involved and promote growth, and second, but no less important, the clarity of ideas when sizing up new projects to avoid falling into the practice of “killing a fly with cannonball”. Finally, I remember that in a BDigital Global Congress here in Barcelona some months ago, a speaker asked the audience: “Can anyone explain with certainty how to monetize Big Data? Please stand up and explain!”. No one stood up. It is still a quest many are pursuing. At this moment and considering this last anecdote, we could say: Big Data is much ado about nothing. However, it’s also true that Big Data as idea has enormous potential and  will spawn immense opportunities all in due course.


Leave a comment

An inspirational poem for Error Correcting Codes enthusiasts:

In Galois Fields, full of flowers
primitive elements dance for hours
climbing sequentially through the trees
and shouting occasional parities.

The syndromes like ghosts in the misty damp
feed the smoldering fires of the Berlekamp
and high flying exponents sometimes are downed
on the jagged peaks of the Gilbert bound.

S.B. Weinstein (IEEE Transactions on Information Theory, March 1971)

Follow

Get every new post delivered to your Inbox.