Data Blast

Data, Telecom, Maths, Astronomy, Origami, and so on


Leave a comment

Going off the subject: GPS tracks, Travel times and Hadoop

It’s been a long time since my last post and I sincerely thought I wouldn’t update my blog again. I had many ideas and wanted to publish some results, or inclusive reflections about some books I had read, but then as always it happens to me, after an initial exploratory analysis, some certainly very promising and interesting, I couldn’t find the time to pursue the subject. Unfortunately the time is inevitable, at least not in relativistic terms, and then a new issue caught my attention and the former lost interest and so on; I have many inconclusive files in Python and R that will be forgotten in my hard drive, although not everything is lost, because in the way, I’ve learnt many things, new techniques, etc, and this is what is important, at least for me!!!! Moreover, then you can always recycle code for a new project, because there’s no need to reinvent the wheel every time. So, in the last months I’ve spent time fighting against so dissimilar datasets (or topics) that go from Ligo Project (Laser Interferometer Gravitational-Wave Observatory) to Dublin Bus traffic frequency, passing through raster analysis for precision agriculture .

ligoc

Digressions

When in February of this year the press reported the detection of gravitational waves produced by the collision of two black holes, which proved an old theory of Einstein, I had the need to research the matter. I was curious to know what gravitational waves were and how scientists could prove their existence. This also reminded me when around 2012 CERN confirmed the existence of the Higgs boson, a fact without precedent, that helped to complete the standard model of particle physics. It’s interesting as a conjecture based on mathematical abstractions (e.g. it needs “something that has mass” that helps to establish equality in an equation) that can then be confirmed by means of measurements with sophisticated instruments several years later. The same happened with the theory of relativity and the total eclipse of 1919. Another day I read that the same CERN had possibly discovered a new particle but I understand that this time, there wasn’t a previous theory to prove, they found differences in the masses in a measurements and that this could be explained by the existence of a new particle. According to scientists this discovery could prove the existence of “extra space-time dimensions” or explain “the enigma of dark matter”. It’s a recent fact and it must be confirmed because it may be a measurement error, but I thought immediately, this is a curious fact, probably for scientists engaged in these topics not too much, but for a novice reader like me, I expected an “experiment confirms etc etc” and not, “we found something” and there is no previous theory that explains it. Surely with large projects being developed nowadays (LIGO, ALMA, LHC, E-ELT, etc) and many others coming in next few years, technicians will advance to theorists.

Well, in any case, I entered on Ligo Open Science Center website and discovered an extraordinary material and python codes to dive into the world of gravitational waves. Obviously, it’s necessary to be a physicist to understand everything, but for me the interesting issue was the use of the interferometry to detect variations in the signals. It’s incredible how with all instruments (in terms of calibration, I mean) they are able to isolate (filter) seismic, thermal and shot noise at high frequencies to analyse finally the “real” signal. Here the most important thing is to analyse time-series (e.g. time-frequency spectrograms analysis) applying cross-correlation and regression analysis in order to reduce RSM (SNR signals) or using Hypothesis testing (the typical mantra “to be or not to be”), signal is present vs signal is no present?, alignment signals, etc.  Anyway, I did some tests and tried with some “inspiral signals”, i.e. “gravitational waves that are generated during the end of life stage of binary systems where the two objects merge into one” but I don’t have time. I love this, maybe I got the wrong profession and I should study Physics, but it’s too late. In any case, it’s interesting to use of HDF5 file format (and h5py package) that allows to store huge amounts of numerical data and easily to manipulate that data from NumPy arrays.

With regards to raster analysis, I was interested in a project that used UAVs (Unmanned Aerial Vehicles)  or drones in order to analyse the health of a crop (See this news ). The basic example in this area is to work with multispectral images and calculate for example the Normalised Difference Vegetation Index (NDVI). In this case, using a TIFF image from Lansat 7 (6 channels), it’s easy to calculate NDVI using band 4 (Near Infrarred, NIR) and band 3 (Visible Red, R) and applying the formula: NDVI=(NRI-R)/(NIR+R). Simply, 0.9 corresponds to dense vegetation and 0.2 nearly bare soil. Beyond this, I wanted to learn how to develop a raster analysis using R (raster, rasterVIS and rgdal packages).

 

Finally, regards to Dublin Bus traffic, I only did an basic exploratory analysis of the data from Insight project available on Dublinked. In this dataset it’s possible to see different bus routes (geotagged) during 2 months. I simply cleaned data with R and then I used CartoDB to generate an interactive visualisation. It’s interesting to see for example some patterns in the routes or how they change due to some problems or how the last bus in the day does a shorter path, only to city centre, etc. Maybe I come back to this dataset, although I expect that Dublin City Council releases a new data version, current version is 2013. (See example Bus 9, 16-01-2013).

GPS tracks

Also, I was working in others topics as for example analysing GPS tracks with R. I had two GPS tracks that gathered with my mobile phone and My tracks app during a walk around Tonelagee mountain in Co. Wicklow and another in the first stage of the Camino de Santiago, crossing Pyrenees. The R code can be found on this Rpubs

tone4

ronce5

Travel Times in Dublin

Searching on Dublinked I found an interesting dataset called “Journey Times” across Dublin city. Data were released on 2015-11-19 (just one day) from Dublin City Council (DCC) TRIPS system. This system provides travel times in different points of the city based on information about road network performance. The data correspond to different datasets in csv and KML format, which can be downloaded directly from this link.

DCC’s TRIPS system defines different routes across city (around 50) and each route consists of a number of links and each link is a pair of geo referenced Traffic Control Site (sites.csv). On the other hand, “trips.csv” is updated once every minute. Data configuration is considered as a static route data and provides a context for the real time journey details. The R code can be found on this Rpubs.

dublin_tt

Hadoop with Python: Basic Example

Thanks to Michael Noll website, I discovered that you can use native Python code for doing basic examples with Hadoop. Usually, the quintessential example is “Wordcount” and from there it’s possible to make small changes depending on your dataset and “voil\`{a}”: you have your toy Hadoop-Python example. I chose an old dataset used in a previous post. It’s about the relationship between Investors and companies in a startup fundraising ecosystem using data from Crunchbase (see IPython notebook).

Investor Startup
ns-solutions openet
cross-atlantic-capital-partners arantech, openet, automsoft
trevor-bowen soundwave
……. …….

In this example ns-solutions invests only in openet and cross-atlantic-capital-partners invests in three startups, arantech, openet and automsoft, and so on. The idea is to apply Hadoop MapReduce algorithm to count how many investment companies invest in each startup and which they are. In each line of the dataset, the first name is the investment company and the following names are its startups. I used a Virtual Machine based on Vagrant (Centos65, Hadoop 2.7.1).

#Create a HDFS directory
hdfs dfs -mkdir inv_star-input
#Put dataset (txt format) into HDFS directory.
hdfs dfs -put investor_startup.txt inv_star-input/investor_startup.txt
#Execute Hadoop process.
hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.1.jar
-file /home/vagrant/test/mapper.py -mapper /home/vagrant/test/mapper.py
-file /home/vagrant/test/reducer.py -reducer
/home/vagrant/test/reducer.py -input
inv_star-input/investor_startup.txt -output star-output
#Show results
hdfs dfs -cat inv_star-output/part*

mapper.py

 #!/usr/bin/env python
import sys
#Input comes from STDIN (standard input)
for line in sys.stdin:
#Replace commas
line =line.replace(',','')
#Remove whitespace
line = line.strip()
#Split the line into names of companies
startup = line.split()
for i in range(len(startup)-1):
#Print all Startups for each Investment Company
print '%s\t%s' % (startup[i+1], startup[0]) 

reducer.py

#!/usr/bin/env python
import sys
from collections import defaultdict
#Defining default dictionary
d = defaultdict(list)
#Input comes from STDIN (standard input)
for line in sys.stdin:
#Remove whitespace
line = line.strip()
#Split the line into Startup and Investor
startup, company = line.split()
#Filling defaultdict with Startups and their Investors
d[startup].append(company)

for i in range(len(d)):
#Print all Startups, Number of Investors, Name of Investors
print '%s\t%s\t%s' % (d.keys()[i],len(d[d.keys()[i]]),d[d.keys()[i]])

col1
col2
Books

See below the picture of my latest book purchases. Currently I’m reading “Birth of a Theorem: A Mathematical Adventure” by Cèdric Villani (Fields Medal 2010) , I have good feelings although for moments it’s a little bit indecipherable in some concepts. I must say that previously I investigated about Boltzmann equation in order to enjoy the book. Anyway, so far it’s not so clear in ideas as “A Mathematician’s Apology” by G.H Hardy (a classic,1940), although I’m willing to give it a chance.

Well, my idea, however, is only comment briefly two books. “Simplexity” by Jeffrey Kluger disappointed me because this book has not good references and bibliography and sometimes turns on obvious things. It’s true that you learn about human behaviours, but I really expected more. In fact, I bought this book exclusively for its chapter 4: Why the jobs that require the greatest skills often pay the least?, Why do the companies with the least to sell often earn the most?. In short, using or no “U complexity analysis”, it’s clear that many bosses and companies (job market in general) don’t appreciate the complexity about many jobs, It’s somewhat disappointing. Finally I recommend reading “Pricing the Future” by  George G. Szpiro. For anyone who wants to learn about mathematics apply to finance is an excellent starting point.  Previously to this book my knowledge about Black-Scholes equation was anecdotic and I liked how writer mixes history with mathematical concepts.  The other books are excellent as “The Signal and the Noise”, already a classic in prediction, etc.

books

Advertisements


Leave a comment

Lies, Damned Lies, and Statistics – and Visualisation

By chance days ago, I met again with the famous Anscombe’s Quartet, which in case anyone doesn’t know or remember, corresponds to four datasets that seem to be identical when they are examined using simple summary statistics, but our perception changes considerably when they are analyzed visually, i.e. through “graphs”.

Group 1 Group 2 Group 3 Group 4
x1 y1 x2 y2 x3 y3 x4 y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89

p_stat2

x1 y1 x2 y2 x3 y3 x4 y4
Average 9 7.5 9 7.5 9 7.5 9 7.5
Variance 11 4.12 11 4.12 11 4.12 11 4.12
Correlation 0.816 0.816 0.816 0.816
Linear Regression y1=0.5 x1 + 3 y2=0.5 x2 + 3 y3=0.5 x3 + 3 y4=0.5 x4 + 3

Doing some history, in 1973 the statistician Francis Anscombe in his paper “Graph in Statistical Analysis” presented this construction in order to show the essential role that a graph has in a good statistical analysis, for example, to show the effect of outliers over statistical measurements. It’s revealing, however, to observe how he described the situation of the statistics at that time: “Few of us escape being indoctrinated with these notions: 1) Numerical calculations are exact, but graphs are rough, 2) For any particular kind of statistical data there is just one of set of calculations constituting a current statistical analysis, and 3) Performing intricate calculations is virtuous, whereas actually looking at the data is cheating”. Nowadays however it seems to be a “graph” paints a thousand “numerical calculations”; anyway, a simple moral of all this could be that seeing the four graphs, it isn’t possible to describe reality with only an unique statistical metric and so the use of graphs is key to present information more accurately and completely. In a certain sense, summary statistics don’t tell the whole story.

On the other hand and with some relation to the latter, a great quote attributed to former British Prime Minister Benjamin Disraeli, now it comes to my mind: “There are three kinds of lies: lies, damned lies, and statistics”. Actually it’s a great joke, still valid, specially when we think about the politicians which sometimes use statistical values (biased or nonsense) as “throwing weapon” to support their weak arguments. Unfortunately it isn’t an exclusive area to politics because also the press sometimes falls into excesses when use statistical values incorrectly to describe a fact. As Mark Twain said: “Facts are stubborn, but statistics are more pliable”. The truth is that today (and always, actually) there are many interests involved and the line of the ethics and independence is sometimes too fuzzy for some eyes. This reminds me of the book “A Mathematician reads the Newspaper” by John Allen Paulos where the author tries explaining the misuse of maths and statistics in the press. I don’t know if the author has had success in his crusade to evangelize to readers of newspapers, but it’s clear that people with notions about maths or statistics will perceive more critically a determine news and so will demand more accuracy to the journalists.

At this point and although I’m straying slightly from the topic, I recommend reading the editorial of The Guardian (29th January 2015) and subsequent comments on the use of statistics in political debate. The editorial highlights that “Big data doesn’t settle the big arguments. Too many of the statistics thrown around reflect nothing but noise, confusion or damned lies”. It’s true that the arguments presented in this report make sense although they don’t say in general anyhing new, but it’s important, I think, to note that the problem itself isn’t in the data source, assuming that source is reliable like UK Statistics Authority, but in the data interpretation (or vision, say) which it could be sometimes too selfish, simplistic and biased. With this I mean the misuse of statistical values in the press because I guess data related to census or health statistics are reliable in terms of calculus and methodologies. In this sense Hethan Shah, Executive director, Royal Statistical Society commented about how we can improve the quality of public debate using statistics: “Three things would help. To ensure transparency, government should publish the evidence base for any new policy. To build trust, we should end pre-release access to official statistics, whereby ministers can see the numbers before the rest of us. And to build capability, politicians and other decision-makers in Whitehall should take a short course in statistics, which we’d be more than happy to provide”. Well, another interesting post related to this was written by Matt Parker: “The simple truth about statistics”.

Coming back to the issue, times have changed and many events have happened since 1973, but the persuasive power of numbers hasn’t changed and remains as an key element in any presentation, as well as, the use of visualisation in order to produce an immediate impact in the client, audience, etc. In this sense, I don’t know if in some newsrooms exists the slogan: “Don’t let a graph ruin great news (fake news, I mean)” although sometimes seems to be that this happens. Anyway, visualisation is mainly to convey information through graphical representations of data. You can use visualisation to record information, analyze data to support reasoning (e.g. visual exploration, find patterns and possible errors in data), and communicate information to others in order to share and persuade by means of a visual explanation. Harvard University offers a visualisation course, which unfortunately isn’t free but those lecture slices and videos are available for everyone at this moment.

Anyway, surely I’m talking nonsense and mixing things but my intention in this post was mostly to refer to some ideas, websites or books that are related to the proper or improper use of statistics and visualisation. Actually my goal was just to write a basic reflection about statistics and visualisation . Now I’d like to refer to some interesting things:

  • An hilarious website is “Spurious Correlation“. According to Business Dictionary: “a mathematical relationship between two variables that don’t result from any direct relationship, but is wrongly inferred to be related to each other. The false assumption of correlation may be attributed to coincidence or to another unseen factor”. Here you can find crazy and funny correlations, but it’s important to remember that “correlation doesn’t imply causation”.

p_stat1

  • WTF visualisation is another website to take into account in order to avoid repeating the same errors. They define their website as “visualisations that make no sense” but actually these are a lot of poorly conceived graphics. The focus is more related to infographics (i.e. graphic visual representations of information, data or knowledge) than graphs generated by R or Matplotlib; I mean it isn’t a simple graph such as a line, scatterplot or bar chart. Here, say, you have freedom to use colors and shapes; there is no limit to imagination and creativity and usually you condense much information in a small space, which can result the representation is unintelligible. Now, a key point is when the scales are modified giving an erroneous idea of magnitude. Also some 3D representations can produce the same effect because distort areas depending on point of view.

wtf_graph

  • To learn about visualisation and its design principles, there are many books but always it’s appropriate to revisit the classics: Edward Tufte, overall. He wrote: “Graphical excellence is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design”. “Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”. “Clutter and confusion are failures of design, not attributes of information”. “Cosmetic decoration, which frequently distorts the data, will never salvage an underlying lack of content”. Also are the books of Nigel Holmes and Alberto Cairo although I’m not very familiar with them. Information about E.Tufte and his work click here.
  • Visual Complexity Blog by Manuel Lima. He has a beautiful book call “Visual Complexity: Mapping Patterns of Information” which compiles a series amazing graphs.

To sum up, I’d say: “Use statistics accurately and visualisation in moderation”