Data Blast

Data, Telecom, Maths, Astronomy, Origami, and so on

Lies, Damned Lies, and Statistics – and Visualisation

Leave a comment

By chance days ago, I met again with the famous Anscombe’s Quartet, which in case anyone doesn’t know or remember, corresponds to four datasets that seem to be identical when they are examined using simple summary statistics, but our perception changes considerably when they are analyzed visually, i.e. through “graphs”.

Group 1 Group 2 Group 3 Group 4
x1 y1 x2 y2 x3 y3 x4 y4
10 8.04 10 9.14 10 7.46 8 6.58
8 6.95 8 8.14 8 6.77 8 5.76
13 7.58 13 8.74 13 12.74 8 7.71
9 8.81 9 8.77 9 7.11 8 8.84
11 8.33 11 9.26 11 7.81 8 8.47
14 9.96 14 8.10 14 8.84 8 7.04
6 7.24 6 6.13 6 6.08 8 5.25
4 4.26 4 3.10 4 5.39 19 12.50
12 10.84 12 9.13 12 8.15 8 5.56
7 4.82 7 7.26 7 6.42 8 7.91
5 5.68 5 4.74 5 5.73 8 6.89

p_stat2

x1 y1 x2 y2 x3 y3 x4 y4
Average 9 7.5 9 7.5 9 7.5 9 7.5
Variance 11 4.12 11 4.12 11 4.12 11 4.12
Correlation 0.816 0.816 0.816 0.816
Linear Regression y1=0.5 x1 + 3 y2=0.5 x2 + 3 y3=0.5 x3 + 3 y4=0.5 x4 + 3

Doing some history, in 1973 the statistician Francis Anscombe in his paper “Graph in Statistical Analysis” presented this construction in order to show the essential role that a graph has in a good statistical analysis, for example, to show the effect of outliers over statistical measurements. It’s revealing, however, to observe how he described the situation of the statistics at that time: “Few of us escape being indoctrinated with these notions: 1) Numerical calculations are exact, but graphs are rough, 2) For any particular kind of statistical data there is just one of set of calculations constituting a current statistical analysis, and 3) Performing intricate calculations is virtuous, whereas actually looking at the data is cheating”. Nowadays however it seems to be a “graph” paints a thousand “numerical calculations”; anyway, a simple moral of all this could be that seeing the four graphs, it isn’t possible to describe reality with only an unique statistical metric and so the use of graphs is key to present information more accurately and completely. In a certain sense, summary statistics don’t tell the whole story.

On the other hand and with some relation to the latter, a great quote attributed to former British Prime Minister Benjamin Disraeli, now it comes to my mind: “There are three kinds of lies: lies, damned lies, and statistics”. Actually it’s a great joke, still valid, specially when we think about the politicians which sometimes use statistical values (biased or nonsense) as “throwing weapon” to support their weak arguments. Unfortunately it isn’t an exclusive area to politics because also the press sometimes falls into excesses when use statistical values incorrectly to describe a fact. As Mark Twain said: “Facts are stubborn, but statistics are more pliable”. The truth is that today (and always, actually) there are many interests involved and the line of the ethics and independence is sometimes too fuzzy for some eyes. This reminds me of the book “A Mathematician reads the Newspaper” by John Allen Paulos where the author tries explaining the misuse of maths and statistics in the press. I don’t know if the author has had success in his crusade to evangelize to readers of newspapers, but it’s clear that people with notions about maths or statistics will perceive more critically a determine news and so will demand more accuracy to the journalists.

At this point and although I’m straying slightly from the topic, I recommend reading the editorial of The Guardian (29th January 2015) and subsequent comments on the use of statistics in political debate. The editorial highlights that “Big data doesn’t settle the big arguments. Too many of the statistics thrown around reflect nothing but noise, confusion or damned lies”. It’s true that the arguments presented in this report make sense although they don’t say in general anyhing new, but it’s important, I think, to note that the problem itself isn’t in the data source, assuming that source is reliable like UK Statistics Authority, but in the data interpretation (or vision, say) which it could be sometimes too selfish, simplistic and biased. With this I mean the misuse of statistical values in the press because I guess data related to census or health statistics are reliable in terms of calculus and methodologies. In this sense Hethan Shah, Executive director, Royal Statistical Society commented about how we can improve the quality of public debate using statistics: “Three things would help. To ensure transparency, government should publish the evidence base for any new policy. To build trust, we should end pre-release access to official statistics, whereby ministers can see the numbers before the rest of us. And to build capability, politicians and other decision-makers in Whitehall should take a short course in statistics, which we’d be more than happy to provide”. Well, another interesting post related to this was written by Matt Parker: “The simple truth about statistics”.

Coming back to the issue, times have changed and many events have happened since 1973, but the persuasive power of numbers hasn’t changed and remains as an key element in any presentation, as well as, the use of visualisation in order to produce an immediate impact in the client, audience, etc. In this sense, I don’t know if in some newsrooms exists the slogan: “Don’t let a graph ruin great news (fake news, I mean)” although sometimes seems to be that this happens. Anyway, visualisation is mainly to convey information through graphical representations of data. You can use visualisation to record information, analyze data to support reasoning (e.g. visual exploration, find patterns and possible errors in data), and communicate information to others in order to share and persuade by means of a visual explanation. Harvard University offers a visualisation course, which unfortunately isn’t free but those lecture slices and videos are available for everyone at this moment.

Anyway, surely I’m talking nonsense and mixing things but my intention in this post was mostly to refer to some ideas, websites or books that are related to the proper or improper use of statistics and visualisation. Actually my goal was just to write a basic reflection about statistics and visualisation . Now I’d like to refer to some interesting things:

  • An hilarious website is “Spurious Correlation“. According to Business Dictionary: “a mathematical relationship between two variables that don’t result from any direct relationship, but is wrongly inferred to be related to each other. The false assumption of correlation may be attributed to coincidence or to another unseen factor”. Here you can find crazy and funny correlations, but it’s important to remember that “correlation doesn’t imply causation”.

p_stat1

  • WTF visualisation is another website to take into account in order to avoid repeating the same errors. They define their website as “visualisations that make no sense” but actually these are a lot of poorly conceived graphics. The focus is more related to infographics (i.e. graphic visual representations of information, data or knowledge) than graphs generated by R or Matplotlib; I mean it isn’t a simple graph such as a line, scatterplot or bar chart. Here, say, you have freedom to use colors and shapes; there is no limit to imagination and creativity and usually you condense much information in a small space, which can result the representation is unintelligible. Now, a key point is when the scales are modified giving an erroneous idea of magnitude. Also some 3D representations can produce the same effect because distort areas depending on point of view.

wtf_graph

  • To learn about visualisation and its design principles, there are many books but always it’s appropriate to revisit the classics: Edward Tufte, overall. He wrote: “Graphical excellence is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design”. “Graphical excellence is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space”. “Clutter and confusion are failures of design, not attributes of information”. “Cosmetic decoration, which frequently distorts the data, will never salvage an underlying lack of content”. Also are the books of Nigel Holmes and Alberto Cairo although I’m not very familiar with them. Information about E.Tufte and his work click here.
  • Visual Complexity Blog by Manuel Lima. He has a beautiful book call “Visual Complexity: Mapping Patterns of Information” which compiles a series amazing graphs.

To sum up, I’d say: “Use statistics accurately and visualisation in moderation”

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s