How “big” data can help…

Interesting article on WSJ about big data and its uses to help with issues in developing countries such as malaria. Having worked on a similar project, I wish the best of luck to Jake Porway.

I think they’re off to a very good start. Malaria is a well defined, narrowly scoped but high visibility problem and for these kind of projects, one of the most important things is the visibility as it defines the longevity of the project, scope, support (funding, volunteers)… It is a great place start, not only because it saves lives but also would help the project last longer and tackle many other problems in the long run.

I see the use of buzzwords like big data the same way. In most cases data to solve a particular problem won’t be anywhere remotely near even a terabyte. For example here is a quote from the article “It takes about 600 trillion pixels to cover the surface of the earth” . While this is a storage problem, it is not a big data analysis issue. Data analysis is not done in global scale for the issues being discussed in the article and images are tiled so one can easily pick a very small subset.

I think the real issue here is data, not big data.

Here is a quote that I completely agree with from the article “Democratization of data is a real issue, and people do try to protect data for good reasons, or bad. But once they have seen the value their data can generate when combined with other sources, then the walls start to crumble.”

Many of these datasets will be very small.

The tough problem is not dealing with large datasets. It is developing a sense of community and more importantly making the data contribution frictionless and disparate datasets useful. Once people start contributing datasets, the problem will become their use different formats, different terminologies, different temporal/spatial resolutions, different measurement units, different projections (in the case of maps/imagery being discussed in the article) and insufficient documentation.

For such a project to deliver the best value, the outcome should be more than the sum of its parts. When there is a collection of  datasets, different experts can use them to solve different problems. An environmental scientist would look at a different combination of the datasets from a different angle than a climatologist and solve a different problem. Contributors of some of those datasets may have nothing to do with those domains and maybe never would have thought their data would be used to address these problems. So the eventual goal should be giving people a large collection of datasets they can easily discover/browse and analyze for problems they define and form communities within the framework towards common goals, evangelize and spread the word within their own professional circles. 

And that is the real issue here. Not the size of the datasets.

Once again, best of luck.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s