Big data, GIS, Mapping

New Beginnings

After 5 exhilarating years at Tableau on December 29th, I said my goodbyes and walked out of the Seattle offices for the last time.

Tableau was a great learning experience for me. Watching the company grow nearly 7 fold, going through an IPO and becoming the gold standard for business intelligence…

I  worked with very talented people, delivered great features and engaged with thousands of enthusiastic customers.

It was a blast.

But there is something about starting anew. It is the new experiences and challenges we face that make us grow. And there’re very few things I like more in life than a good challenge 🙂

I flew out of Seattle on December 30th to start a new life in Santa Fe, New Mexico and join a small startup called Descartes Labs on January 2nd.


At Descartes Labs, we’re building a data refinery for remote sensing data; being used for a growing set of scenarios from detecting objects in satellite images using deep learning to creating the most accurate crop forecasts on the market and understanding the spread of Dengue fever, using a massively scalable computational infrastructure with over 40 Petabytes of image data with 50+ Terabytes new data added daily. If you haven’t heard of us already, check us out. You won’t be disappointed.

I will continue writing but Tableau won’t be the primary focus for new content on the blog anymore. I will try to answer Tableau related questions as time permits. Although I must say that several of Descartes Labs’ customers also use Tableau so I am using Tableau quite often on my new job as well.

To the new year and new beginnings!


Mapping, Visualization

At a store near you…

Wal-Mart, McDonalds, Starbucks… There seems to be one around every corner.  Can you guess how close you’re to one? I had the 2010 Census data and when I came across store locations for Wal-Mart at GeoCommons I built this viz in Tableau.  Each pixel is the centroid of a census tract. Turns out 12 percent of the US population live within 2 miles of a Wal-Mart Store! This is an interactive viz but unfortunately WordPress doesn’t allow embedding it into this page 😦 at this point in time. You can click on the image to open the interactive version in a new window.

Just how close are you to a Walmart? (Click to open interactive viz)

Store DensityIn addition to closest store, I calculated a density metric by adding up inverse distances of all Wal-Mart store locations for each census tract which is shown on the left. The areas with higher density of stores are shown in light green.

GIS, Mapping, Visualization

Simplifying multiple neighboring polygons

Complexity of polygons are often an issue when using online mapping applications that rely on JavaScript. Lengthy initial loading times, browser unresponsiveness, even crashes can occur when dealing with polygons that consist of tens of thousands of points. Real maps are generally complex but for web mapping reasons accuracy is not always a necessity. For example you probably wouldn’t worry about topographic fidelity if your goal is to show census data using a choropleth map, neither would your end users. Not to mention, at times high fidelity does not serve any purpose as zoom level may not allow a users to see the details in the first place. When user zooms in, the shrinking viewing area would allow adding more detail as you won’t have to worry about points outside the viewing area.

Douglas-Peucker algorithm is a very popular algorithm for curve simplification. It is a method offered in many spatial packages including SQL Server’s spatial library. However a common problem is that the algorithm simplifies polygons one at a time without taking into account shared boundaries between neighboring polygons. This results in gaps and overlaps between polygons as seen below.

Multiple polygon simplication resulting in gaps

I wrote a small utility in C# that helps address this problem which I’d like to share here. It took me about an hour so don’t be surprised if you find bugs. At least with the maps I used it on and reasonable tolerance factors the results were great. Keep in mind that Douglas-Peucker algorithm can reduce polygons to lines or even points if the provided tolerance factor is high, so you may want to try different values to find what’s ideal.

Code looks for two types of shared edges:

  • Start-end points of polylines that make up the polygon overlap with those on other polygons (which is the most common case).
  • Lines overlap without start-end points necessarily overlapping (a rare case for most maps).

Algorithm is applied to all polygons, then simplification applied to shared edges are synced between polygons. First and the last shared point between two polygons are always kept to maintain the edge.

Code doesn’t do any special handling for multiple geometries. They need to be broken into their pieces e.g. multi-polygons into individual polygons. Same applies to interior/exterior rings.

It takes a call to the function “Reduce” and a  loop over the array of points returned to get the results. Function returns the list of polygons (a list of array of points) in which the removed points are set to null.  Just filter out the nulls and you’re good to go.

Topology aware polygon simplification - Click to see before and after side-by-side

Above is a very simplified (orange) version of the map of California overlaid on top of the source map (in gray). Dashed lines indicate the new boundaries between polygons. You can see that the islands completely disappeared and San Francisco bay isn’t there anymore while most counties are reduced to rectangles and even triangles yet the borders are still maintained. Ideally you’d pick a tolerance value that won’t simplify the polygons this much. You can download the code here. Feedback appreciated. Maybe somebody can make this into a SQL Server UDF. Smile

Big data, Mapping

How “big” data can help…

Interesting article on WSJ about big data and its uses to help with issues in developing countries such as malaria. Having worked on a similar project, I wish the best of luck to Jake Porway.

I think they’re off to a very good start. Malaria is a well defined, narrowly scoped but high visibility problem and for these kind of projects, one of the most important things is the visibility as it defines the longevity of the project, scope, support (funding, volunteers)… It is a great place start, not only because it saves lives but also would help the project last longer and tackle many other problems in the long run.

I see the use of buzzwords like big data the same way. In most cases data to solve a particular problem won’t be anywhere remotely near even a terabyte. For example here is a quote from the article “It takes about 600 trillion pixels to cover the surface of the earth” . While this is a storage problem, it is not a big data analysis issue. Data analysis is not done in global scale for the issues being discussed in the article and images are tiled so one can easily pick a very small subset.

I think the real issue here is data, not big data.

Here is a quote that I completely agree with from the article “Democratization of data is a real issue, and people do try to protect data for good reasons, or bad. But once they have seen the value their data can generate when combined with other sources, then the walls start to crumble.”

Many of these datasets will be very small.

The tough problem is not dealing with large datasets. It is developing a sense of community and more importantly making the data contribution frictionless and disparate datasets useful. Once people start contributing datasets, the problem will become their use different formats, different terminologies, different temporal/spatial resolutions, different measurement units, different projections (in the case of maps/imagery being discussed in the article) and insufficient documentation.

For such a project to deliver the best value, the outcome should be more than the sum of its parts. When there is a collection of  datasets, different experts can use them to solve different problems. An environmental scientist would look at a different combination of the datasets from a different angle than a climatologist and solve a different problem. Contributors of some of those datasets may have nothing to do with those domains and maybe never would have thought their data would be used to address these problems. So the eventual goal should be giving people a large collection of datasets they can easily discover/browse and analyze for problems they define and form communities within the framework towards common goals, evangelize and spread the word within their own professional circles. 

And that is the real issue here. Not the size of the datasets.

Once again, best of luck.