Big data

Who cares about science? “Big Data” will solve everything ;)

Some of you may remember Chris Anderson’s article titled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”. You may think it was published in The Onion but no, it was on the Wired magazine. It is a fun read for sure. He uses three examples to make his case. None of them made sense then nor they make sense now.

The first example relates to quantum physics. As you read it keep in mind the 6 billion dollars spent on Large Hadron Collider (LHC), a particle accelerator to test Unified Field Theory. This article was written 4 months before LHC’s inauguration.

“Faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete….the energies are too high, the accelerators too expensive, and so on. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.”

The second example is related to web search. As you read this one, keep in mind investments search engines are making and the direction web is headed towards. Such as Microsoft’s acquisition of Powerset and Google’s acquisition of Applied Semantics and MetaWeb.

“Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required. ”

Third example is related to biology, in particular sequence alignment.

“A sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page.”

Really? I thought they knew the function of the protein they’re matching against… which is more than what Google knows about my MySpace page using their page rank algorithm. How do they know the function?… Ah… scientific method of course.. And since when comparing something you know with something you don’t became a new thing? New technologies that support handling more data only made the existing processes faster, scalable  and as a result more feasible.

So why am I looking at this article written almost 4 years ago, again?

I read a recent interview with Vivek Ranadivé, the CEO of TIBCO which is essentially in the “big data” market as provider of analytics, visualization and complex event processing (CEP) solutions. In this interview he claims that “Science is dead”. He continues by saying:

“I believe that math is trumping science. In the example of dropped calls, I don’t really need to understand human behavior or psychology, I just need to detect patterns. The pattern tells me that six dropped calls is the key number. Why it’s not eight, I don’t know. I don’t need to know. You just need to know that A plus B will lead to C. I can solve just about every problem in the world with that approach.”

which is very much along the lines of the Chris Anderson’s article.

This is a very problematic point of view especially when raised by people who’re likely to be somewhat influential.

Here is a pattern. We need more pirates to solve the global warming problem :)

There are multiple sides to this. First question is how much data do you need? At some point people believed the stars and the planets rotated around a fixed Earth. People used to see faces on Mars. It took better understanding of how solar system worked or more data/less noise (e.g. a higher quality image of Mars) to figure out what’s really going on.

If I have 1 Petabyte of data, it takes up a lot of space on disk but does that mean it is enough to solve my problem? If you have a better understanding of the problem, you can answer this question much better. If you have a hypothesis, you’d know what other data you need to verify it which could potentially lead you to a better result or at least make you realize the uncertainty in your results or incompleteness of your analysis. 

Second question is, how good is your data? With no good understanding or reasoning and just “letting the data speak for itself”, it is very easy to overfit and end up with terribly inaccurate predictions.  It is also easy to get rid of real, meaningful data points as errors/outliers. Also you’d be blindly betting that past correlations in data will hold up in the future.

Third question is related to principle of reflexivity in social theory. Once you react based on data and make changes to a system will your straw-man model still be valid? Your behavior may affect the system in a way that makes your observation invalid or lead to unexpected results both of which are more likely when your thinking relies only on available data and lacks the crucial question “Why?”.

I can go on and on but you got the point.

“The numbers have no way of speaking for themselves, we speak for them.” political forecaster Nate Silver writes in his book, The Signal and the Noise: Why So Many Predictions Fail — But Some Don’t. I’ll finish with his words.

“Data-driven predictions can succeed — and they can fail. It is when we deny our role in the process that the odds of failure rise. Before we demand more of our data, we need to demand more of ourselves.”


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s