If you have been following the progress of the information technology phenomenon known as “Big Data” – and it’s OK if you haven’t – you will likely have seen the solemn proclamations that it means the end of theory.
The idea basically is that given today’s computational power and cheap storage, you can take giant pools of data – the billions of hospital “events” logged in the decades the NHS has been operating; Google’s billions of search engine queries; or electronic transactions analysed – and gain insights by finding patterns that are invisible to the naked eye. It is, in this way of thinking, no longer necessary to study causality, because who cares about causality when you’ve got correlation?
I don’t find this idea terribly satisfying. I guess I am a prisoner of old-fashioned thinking, and favour the old science stuff of making a hypothesis, testing it, revising it, and then testing it again – and having other, independent people replicate the results. Believing that causality doesn’t matter can lead you down all sorts of rabbit holes. Approximately 430,000 people pass through Clapham Junction every day (number from Wikipedia). Do you want to draw the conclusion that people really like the place (because they keep going there) or that they hate it (because very few stay)?
In 2008 piece for Wired called, “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”, Chris Anderson wrote:
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear. Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.
The big target here isn’t advertising, though. It’s science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of the correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
But faced with massive data, this approach to science – hypothesize, model, test – is becoming obsolete… Petabytes allow us to say: “Correlation is enough.”
The great thing about science is that six years later you can go back to Anderson’s statements and view them as a hypothesis and study the results.
The examples Anderson cited are mostly about Google, which relies on data and very little on scientific models – this company once tested 41 shades of blue to find out which attracted the most clicks rather than let a designer choose intuitively. Many different statistics enable the Page Rank algorithm to decide which Web pages are most relevant to a given search; no principle is involved. Google Translate was created by statisticians who analysed billions of Web pages that had already been human translated to calculate meaning based on probability. Google Flu Trends was quicker at spotting regional flu outbreaks than the US Centers for Disease Control, which rely on doctors’ reports.
Later review is not quite so rosy. The rough translations Google Translate has provided have become Web pages and entered the database, degrading its quality, a problem Google is working to fix.
In 2013, Nature reported that Google Flu Trends predicted approximately double the number of cases the CDC did. In a study of the reports, “The Parable of Google Flu: Traps in Big Data Analysis”, authors David Lazer, Ryan Kennedy, Gary King, and Alessandro Vespignani point out several difficulties. First, Google’s secret algorithm is constantly being changed, ruling out detailed study and replication. Second, changes such as adding suggested search terms based on what others have typed in unevenly amplify some searches. Aggregated data is not available for replication or further study. Third, “big data” often does not include information collected as part of “small data”.
This last is particularly interesting because work in some sciences requires a small number of in-depth interviews. One example is computer security, which mixes technology, economics, and human psychology to develop effective controls. Analysing big data won’t predict that locking the access door nearest the visitors’ bathroom will cause staff to prop it open rather than get up constantly to open it for people.
The ability to make a hypothesis and predict the outcome is what lets us find our footing in the unknown. No doubt we will gain valuable insights from big data. But like any tool, we’re going to have to learn to use it correctly.