If you’re looking for a data prep challenge, look no further than satellite imagery

I has been almost 10 months since my last blog post. Probably time to write a little bit about what I’ve been up to.

As some of you might know, in January I joined Descartes Labs. As a company, one of our goals is to make spatial data more readily available and to make it easier to go from observations to actionable insights. In a way, just like Tableau, we’d like people to see and understand their data but our focus is on sensor data whether it is remote such as satellite or drone imagery, video feeds or in-situ such as a weather station data. And when we talk about big data we mean many Petabytes being processed using tens of thousands of CPU or GPU cores.

But at a high level, many common data problems that you’d experience with databases or Excel spreadsheets apply just the same. For example, it is hard to find the right data, there are inconsistencies and data quality issues which become more obvious when you want to integrate multiple data sources.

Sounds familiar?

We built a platform that aims to automatically address many of these issues, what one might call a Master Data Management (MDM) system in enterprise data management circles but focusing on sensor data. For imagery, many use cases from creating mosaics to change detection and various other deep learning applications require these data corrections for best results. And having an automated system shaves off what would otherwise be many hours of manual data preparation.

For example to use more than two images in an analysis, the multiple images have to be merged into a shared data space. The specific requirements of the normalization is application dependent, but often requires that the data be orthorectified, coregistered, and their spectral signatures normalized, while also accounting for interference by clouds and cloud shadows. We use machine learning to automatically detect clouds and their shadows hence can filter them out on demand, an example of which you can see below.

Optical image vs water/land/cloud/cloud shadow segmentation

However, to truly abstract satellite imagery to an information layer, analysts must also account for a variety of effects that distort the satellite observed spectral signatures. These spectral distortions have various causes that include geographic region, time of year, differences in the satellite hardware, and the atmosphere.

The largest of these effects is often the atmosphere. Satellites are above the atmosphere looking down and, therefore, mix the sunlight reflected from the surface with that scattered by the atmosphere. The physical processes at play are similar to why the sky is blue when we look up.

The process of estimating and removing these effects from satellite imagery is referred to as atmospheric correction. Once these effects are removed from the imagery, the data is said to be in terms of “surface reflectance”. This brings satellite imagery into a spectral space that is most similar to what humans see every day on the Earth’s surface.

By putting imagery into this shared spectral data space, it becomes easier to integrate multiple sources of spectral information – whether those sources be imagery from different satellites, from ground based sensors, or laboratory measurements.

Top of Atmosphere vs Surface Reflectance — What a satellite sees (left) vs the surface (right)

We take a different tact than other approaches to surface reflectance in that our algorithms are designed to be a base correction that is applicable to any optical image. Other providers of surface reflectance data often focus on their own sensors and their own data, sometimes making it more difficult for users of multiple sensors to integrate the otherwise disparate observations.

We have already preprocessed, staple data sources such as NASA’s Landsat 8 and European Space Agency (ESA)’s Sentinel-2 data. This includes all global observations for the lifespan of the respective satellites. We also generate scenes for other optical sensors, including previous Landsat missions, on request. In addition to our own algorithms, we also support USGS’ (LaSRC) and ESA’ (Sen2Cor) surface reflectance data.

If you’re into serious geospatial analysis you should definitely give our platform a try and see for yourself. If you’re not but know someone who does, spread the word! With our recently launched platform, we are very excited to help domain experts get to insights faster by helping them find the right datasets, smartly distribute their computations across thousands of machines, and reduce the burden of dealing with data quality issues and the technical nuances of satellite data. You can read more about our surface reflectance correction and how to use it in our platform here.

Big data, GIS, Mapping

New Beginnings

After 5 exhilarating years at Tableau on December 29th, I said my goodbyes and walked out of the Seattle offices for the last time.

Tableau was a great learning experience for me. Watching the company grow nearly 7 fold, going through an IPO and becoming the gold standard for business intelligence…

I worked with very talented people, delivered great features and engaged with thousands of enthusiastic customers.

It was a blast.

But there is something about starting anew. It is the new experiences and challenges we face that make us grow. And there’re very few things I like more in life than a good challenge 🙂

I flew out of Seattle on December 30th to start a new life in Santa Fe, New Mexico and join a small startup called Descartes Labs on January 2nd.

At Descartes Labs, we’re building a data refinery for remote sensing data; being used for a growing set of scenarios from detecting objects in satellite images using deep learning to creating the most accurate crop forecasts on the market and understanding the spread of Dengue fever, using a massively scalable computational infrastructure with over 40 Petabytes of image data with 50+ Terabytes new data added daily. If you haven’t heard of us already, check us out. You won’t be disappointed.

I will continue writing but Tableau won’t be the primary focus for new content on the blog anymore. I will try to answer Tableau related questions as time permits. Although I must say that several of Descartes Labs’ customers also use Tableau so I am using Tableau quite often on my new job as well.

To the new year and new beginnings!

GIS, Visualization

Matey, ye’ve been boarded!

These are good days for web mapping for sure. Moving from being passive consumers of tiles on WMS servers to custom vector maps and even style sheets to do it! Kartograph and TileMill are my two favorites and when I saw A.J. Ashton’s pirate map on TileMill’s website, it made me realize that I am almost 6 months late to celebrate the “talk like a pirate day”. Since I’m so late, I thought maybe I should have a modest celebration and instead of generating tiles, by sticking with a single background image. With a little bit of data from International Chamber of Commerce on modern day pirate activities, here is what I ended up with. Click on the image to open the actual visualization.

These are the incidents since January 1st until last week. So lots of scurvy sea dogs out there, Arrr.

GIS

Fun with Census data, well… not so much

Recently, I downloaded TIGER/Census data for 2010 from Census Bureau’s website for a demo. Inserting into a SQL Server database was a breeze, thanks to FME. After creating spatial indexes I moved onto testing the performance. Even on a modest laptop, spatial joins (STIntersection), neighborhood searches (STBuffer, STDistance….) and polygon->point conversions (STEnvelope) performed very well. For reference TIGER dataset contains 74002 polygons each consisting of 508 points on average. I also had ~6000 department store locations as points.

That’s the good part. Then I started looking into the data.

First, let’s get rid of all these totals. This is a database table and it is so easy to calculate totals, there’s no point in wasting storage space on these. Then, normalize the schema. Now it looks a bit more like a database.

Next step, why the wide format? This is not an HTML table or Excel spreadsheet. Having a column that mashes up lots of things together is hardly useful. Instead of a column for Males over 85 and another one for Females over 85 and another column for Males under 5 year and…., it is better to have a column for age group and another one for gender. After lots of pivoting in T-SQL, that’s done, too. Another step towards making this a database table.

Now that we have age groups, let’s look at ethnicity.

Problem 1) Unlike age groups, there’s no gender info associated with ethnicity which introduces a granularity mismatch. Same issue exists between age and ethnicity as well. This is one problem wide format covered up. This data is surely collected, but sacrificed in the process of getting a compact subset. As a consequence I can’t find out, for example the number of Asian males over 85 years of age in a census tract. I need to either choose Ethnicity or Age group and gender. At most two dimensions at a given time excluding the location. For example I can find out the number of inmates for a given census tract but not the age breakdown or ethnicity or name your own attribute here. One may argue that coarseness of the data is because of privacy reasons but this is not true since even at the existing granularity often times there will be single person for a category in a given census tract.

Problem 2) Wikipedia has a good article about race and ethnicity in the US Census describing how the methods changed over time with a table breaking down the Hispanic population in the latest census. In 2010 census Hispanic vs. non-Hispanic is a separate category than race. Again there are multiple levels of granularity. For example, the Hispanic population is broken into origins e.g. Mexican, Puerto Rican in one field, broken into races in another field (Asian, African American, White etc.) while in another field, races are broken into sub-categories e.g. Asian: Japanese, Korean, Chinese… Since granularity is all over the place you can’t get to the number for something like White Puerto Ricans or break everything down to the level Cuban, Mexican, Chinese, Japanese etc.

Worst part is, the numbers don’t add up. Best example is the following fields.

DP0080020 Total Population of Two or More Races
DP0110017 Hispanic/Latino of Two or More Races

In about 38205 census tracts total population of two or more races is less than Hispanic population who claim to be of two or more races! Issues like this would surely be a lot easier to discover, if the data was in the proper format for analysis.

So, there are discrepancies and clearly the data is not meant for real multidimensional analysis. The question is, given how quickly and easily we can analyze relatively large sets of data these days, couldn’t the Census Bureau provide this data in an alternative/more useful shape?

The 2010 Census cost $13 billion. That is certainly a good enough reason to make the data more useful and usable than it currently is.

If concern is the bandwidth (aligning granularity properly by adding more detail would surely make it much larger), data could be normalized and broken down into groups of dimensions instead of one big chunk. Maybe a custom “table builder” could allow people to pick and choose, mix and match fields of interest. This way people could download a useful subset of the data e.g. just housing statistics instead of data that contains fields they don’t need, in a format that’s not very useful which may even save Census Bureau some bandwidth at the end of the day.

GIS, Mapping, Visualization

Simplifying multiple neighboring polygons

Complexity of polygons are often an issue when using online mapping applications that rely on JavaScript. Lengthy initial loading times, browser unresponsiveness, even crashes can occur when dealing with polygons that consist of tens of thousands of points. Real maps are generally complex but for web mapping reasons accuracy is not always a necessity. For example you probably wouldn’t worry about topographic fidelity if your goal is to show census data using a choropleth map, neither would your end users. Not to mention, at times high fidelity does not serve any purpose as zoom level may not allow a users to see the details in the first place. When user zooms in, the shrinking viewing area would allow adding more detail as you won’t have to worry about points outside the viewing area.

Douglas-Peucker algorithm is a very popular algorithm for curve simplification. It is a method offered in many spatial packages including SQL Server’s spatial library. However a common problem is that the algorithm simplifies polygons one at a time without taking into account shared boundaries between neighboring polygons. This results in gaps and overlaps between polygons as seen below.

I wrote a small utility in C# that helps address this problem which I’d like to share here. It took me about an hour so don’t be surprised if you find bugs. At least with the maps I used it on and reasonable tolerance factors the results were great. Keep in mind that Douglas-Peucker algorithm can reduce polygons to lines or even points if the provided tolerance factor is high, so you may want to try different values to find what’s ideal.

Code looks for two types of shared edges:

Start-end points of polylines that make up the polygon overlap with those on other polygons (which is the most common case).
Lines overlap without start-end points necessarily overlapping (a rare case for most maps).

Algorithm is applied to all polygons, then simplification applied to shared edges are synced between polygons. First and the last shared point between two polygons are always kept to maintain the edge.

Code doesn’t do any special handling for multiple geometries. They need to be broken into their pieces e.g. multi-polygons into individual polygons. Same applies to interior/exterior rings.

It takes a call to the function “Reduce” and a loop over the array of points returned to get the results. Function returns the list of polygons (a list of array of points) in which the removed points are set to null. Just filter out the nulls and you’re good to go.

Above is a very simplified (orange) version of the map of California overlaid on top of the source map (in gray). Dashed lines indicate the new boundaries between polygons. You can see that the islands completely disappeared and San Francisco bay isn’t there anymore while most counties are reduced to rectangles and even triangles yet the borders are still maintained. Ideally you’d pick a tolerance value that won’t simplify the polygons this much. You can download the code here. Feedback appreciated. Maybe somebody can make this into a SQL Server UDF. Smile

Bora Beran

On Anything Data

Category Archives: GIS