Recently, I downloaded TIGER/Census data for 2010 from Census Bureau’s website for a demo. Inserting into a SQL Server database was a breeze, thanks to FME. After creating spatial indexes I moved onto testing the performance. Even on a modest laptop, spatial joins (STIntersection), neighborhood searches (STBuffer, STDistance….) and polygon->point conversions (STEnvelope) performed very well. For reference TIGER dataset contains 74002 polygons each consisting of 508 points on average. I also had ~6000 department store locations as points.
That’s the good part. Then I started looking into the data.
First, let’s get rid of all these totals. This is a database table and it is so easy to calculate totals, there’s no point in wasting storage space on these. Then, normalize the schema. Now it looks a bit more like a database.
Next step, why the wide format? This is not an HTML table or Excel spreadsheet. Having a column that mashes up lots of things together is hardly useful. Instead of a column for Males over 85 and another one for Females over 85 and another column for Males under 5 year and…., it is better to have a column for age group and another one for gender. After lots of pivoting in T-SQL, that’s done, too. Another step towards making this a database table.
Now that we have age groups, let’s look at ethnicity.
Problem 1) Unlike age groups, there’s no gender info associated with ethnicity which introduces a granularity mismatch. Same issue exists between age and ethnicity as well. This is one problem wide format covered up. This data is surely collected, but sacrificed in the process of getting a compact subset. As a consequence I can’t find out, for example the number of Asian males over 85 years of age in a census tract. I need to either choose Ethnicity or Age group and gender. At most two dimensions at a given time excluding the location. For example I can find out the number of inmates for a given census tract but not the age breakdown or ethnicity or name your own attribute here. One may argue that coarseness of the data is because of privacy reasons but this is not true since even at the existing granularity often times there will be single person for a category in a given census tract.
Problem 2) Wikipedia has a good article about race and ethnicity in the US Census describing how the methods changed over time with a table breaking down the Hispanic population in the latest census. In 2010 census Hispanic vs. non-Hispanic is a separate category than race. Again there are multiple levels of granularity. For example, the Hispanic population is broken into origins e.g. Mexican, Puerto Rican in one field, broken into races in another field (Asian, African American, White etc.) while in another field, races are broken into sub-categories e.g. Asian: Japanese, Korean, Chinese… Since granularity is all over the place you can’t get to the number for something like White Puerto Ricans or break everything down to the level Cuban, Mexican, Chinese, Japanese etc.
Worst part is, the numbers don’t add up. Best example is the following fields.
DP0080020 Total Population of Two or More Races
DP0110017 Hispanic/Latino of Two or More Races
In about 38205 census tracts total population of two or more races is less than Hispanic population who claim to be of two or more races! Issues like this would surely be a lot easier to discover, if the data was in the proper format for analysis.
So, there are discrepancies and clearly the data is not meant for real multidimensional analysis. The question is, given how quickly and easily we can analyze relatively large sets of data these days, couldn’t the Census Bureau provide this data in an alternative/more useful shape?
The 2010 Census cost $13 billion. That is certainly a good enough reason to make the data more useful and usable than it currently is.
If concern is the bandwidth (aligning granularity properly by adding more detail would surely make it much larger), data could be normalized and broken down into groups of dimensions instead of one big chunk. Maybe a custom “table builder” could allow people to pick and choose, mix and match fields of interest. This way people could download a useful subset of the data e.g. just housing statistics instead of data that contains fields they don’t need, in a format that’s not very useful which may even save Census Bureau some bandwidth at the end of the day.