Big data, Collaboration, Economy, retail

The Retail “Apocalypse”

Retail is changing.  

US online sales has been growing 15.6% year over year at roughly 4x the rate of overall retail market, amounting to 15% of all retail sales in 2019.  Depending on the time of year, location and category, the change is even more pronounced.  

1010data’s report shows that while overall Black Friday sales held to similar levels year over year, representing almost 30 percent of the week’s sales, share of in-store channel has dropped from an average of 83 percent of spend in November, 2014 to an average of 68 percent in November 2019.  

For books about 15% digital penetration led to consolidation and eventually bankruptcy of physical bookstore chains. Today for apparel and electronics, share of online sales has already surpassed 30%. On average, share of online grocery sales is still in single digits, but it varies greatly by locale, from a high of 12% in New York City to 5% in San Francisco and 2% in Des Moines. 

While improved delivery times and ease of returns made a significant change in consumer attitude towards online shopping overall, for groceries growth has been slow due to thin margins with an average item price of around $3 and 30% gross margin, leaving only $0.90 to for all of handling, selling and delivery. With KPMG predicting self-driving delivery vehicles to reduce the cost of delivery to between 4 and 7 cents per mile, financially viable online grocery businesses will be well within reach in the next few years.  

And I’m not talking about distant future here. A few months ago, UPS became the first company to receive FAA certification to build a “drone airline”. Certification allows them to fly drones out of operator line of sight, during day or night, over people and with cargo weighing more than 55 pounds. Soon UPS trucks will become hubs for swarms of drones, while in densely populated areas, drones will deliver directly from distribution centers to homes.  While waiting for its own drone services business’ FAA approval of course Amazon wasn’t resting on its laurels, piloting package delivery with their Scout autonomous vehicles. Startups like Cleveron, Postmates and Starship are also building their own commercial robots for last-mile delivery. This all spells trouble for companies like Instacart and retailers that plan on relying on them to solve their delivery problem in the longer run. 

Another and often complimentary approach to reducing last-mile distribution costs is micro-fulfillment centers for deliveries and pickup orders which make 1-day, even 1-hour delivery feasible. California-based startup Farmstead has made over 100,000 deliveries in San Francisco Bay Area in the past 3 years and recently announced expanding its footprint into Carolinas. Their micro-fulfillment centers cost 1% of an average supermarket to build, have much smaller footprints, allowing for much larger number of locations, closer to customers.  UK’s Ocado having shown that it is possible turn profit as an online-only supermarket with its $2+ Billion annual revenue is starting to transform itself to a technology company by licensing its automated fulfillment center tech. Truepill, Alto Pharmacy, Nimble RX and Capsule are applying the same formula to pharmacy, delivering not only your regular prescriptions but also offer same-day delivery at no extra cost so you don’t have to endure long pharmacy lines while you’re sick. 

Many large US retailers are already in various stages of pilot implementations. Kroger through a joint venture with Ocado, is building 20 fulfillment centers for online orders and have so far committed to facilities near Dallas, Cincinnati, Orlando and Atlanta.  Meijer recently announced that it will begin testing micro-fulfillment with logistics company; Dematic. Albertsons is piloting a fully automated micro-fulfillment center with the robotics company Takeoff TechnologiesFabric claims to have built the world’s smallest fulfillment center, which can process up to 600 orders per day out of 6,000 square feet. That is roughly twice the number of orders per square foot an average brick-mortar store would generate per Readex Research’s 2019 Annual Retailer Survey.  Considering average online grocery basket size is also roughly twice1 that of brick-and-mortar, assuming a fulfillment center operating at full capacity, this translates to roughly 4x revenue per square foot compared to traditional brick-and-mortar store setup.  When you factor in the cost savings of distribution facilities relative to commercial retail space, the online model becomes even more compelling. 

Finally, many staples we’re used to buying in stores are very well suited to planned, automatic replenishment. You can schedule deliveries for your cereal and peanut butter once a week, toilet paper once a month and adjust as needed to whatever cadence best suits your household. If you own a loyalty card and use the same retailers consistently, it is only a matter of time before this gets fully automated. Thanks to machine learning, it won’t be too long before Alexa reminds you if you’d like your monthly pantry items delivered to your home or pick them up from the nearest Amazon Locker on your drive back from work. Harry’s, Dollar Shave ClubBarkbox are a few companies who have been successful with this model. Not to mention Farmstead, with most of its customers enrolled in a weekly subscription program to save money on staples like milk, eggs, bacon and most packaged goods. Predictability allows Farmstead to better optimize their supply chain, reduce waste and pass on those savings to their customers. Meal kit vendors have also benefited from such predictability resulting in reduced carbon footprint

What does all this mean for traditional brick-and-mortar retailers?  

UK-based Argos is a rare example from the catalog retail era that started in early 1900s. While almost all those retailers went bankrupt within the past 2 decades, Argos survived by transforming itself, with e-commerce click-and-collect and delivery options, now majority of its revenue being generated through online sales. Retailers of today have a similar choice but it is easier said than done. 

Transforming to an omni-channel retailer requires significant innovation and organizational change; a system-wide digital transformation including the supply chain: 

1. Articulate your vision. Executing across multiple channels harmoniously requires a highly concerted effort that might be difficult to adapt in organizations that are used to operating in silos. A strong vision from the point of view of the customer and their changing needs and expectations, not by new technology or existing organizational boundaries is key to aligning various stakeholders. 

2. Define your strategy.  Retailers need to consider their market and make their own decisions about business and operations as there is no single recipe for success. Different customer segments will value parts of the shopping experience differently, different products will align better with different distribution channels but there exists plenty of success stories to get inspiration from. 

  • Listen to your customers Internet gave customers a voice and they expect to be heard. This could be through customer support channels, social media, blogs, forums and indirect feedback through instrumenting of customer experiences. Brands that pay attention to customer feedback have more engaged customers, higher customer satisfaction scores and are able to identify new product opportunities.   Two great recent examples of this is Coca Cola and Soylent. Soylent made a name for itself with its “open-source” meal replacement products. They enabled their customer community to come up with DIY recipes and share with each other, some of which inspired recipes currently sold by the company. Coca Cola introduced the new Orange-Vanilla flavor as it was one of the most popular pairings based on the data from their Freestyle fountain dispensers. Accenture found 91% of consumers prefer to buy from brands that remember their choices and provide relevant offers and recommendations while 83% are willing to share their data to enable personalized experiences. While today customization typically means coupons in stores or product recommendations on e-commerce sites, as Coca Cola and StitchFix have shown, there are many more ways to personalize. 
  • Give customers a reason to come to your store For most customers, grocery shopping is a chore they would like to avoid if they could. Successful retailers find ways to draw them into their stores.  TJ Maxx and Lidl understand that people love the thrill of a treasure hunt.  Lidl, in addition to the usual meat, fruits and vegetables, offers a rotation of specials; “Lidl surprises” that are released every Monday and Thursday. As soon as they’re gone, they’re gone. Replace groceries with apparel and you end up with TJ Maxx’s formula. New selections at least once a month with deep discounts only available in store. Bonobos and Glossier took brick-and-mortar and turned it on its head in their successful showroom concepts, where customers visit the stores not to pick items off the shelves but for the experience, to try on products and get personalized fashion advice.  This is an approach that can be generalized to fast moving goods as well. Imagine yourself enjoying a wine flight, sampling food or even taking a cooking class in the store while robots in the automated warehouse at back of the store get your order ready for pick up on your way out. 
  • Meet your customers where they are Today’s customer has high expectations with regards to convenience and flexibility. They could be ordering through your website but exchanging at a local store, comparing product specs online but buying it in store or ordering online for in-store pickup. To be successful in the world of omni-channel retailing, a seamlessly integrated customer experience across all channels from physical stores, computers and mobile devices through apps, e-commerce sites and social media is required to deliver vastly improved customer experiences. It is important to understand what channels are most important for your customer base and start with those. Sometimes these might be usual suspects like subscriptions, free returns, curbside-pickup etc. but sometimes it might require thinking outside the box. For example, way ahead of its time in 2011 Tesco opened the world’s first virtual store in Seoul subway to help time-pressed commuters to shop on the go using their smartphones with same-day delivery. 
  • Improve your supply chain Grocery giant ALDI (owner of ALDI stores and Trader Joe’s) is known for its bargain prices. They primarily owe this to ~90% of their products being private label which means lower unit cost and reduced selection which in turn also means smaller store footprints. The British online grocery retailer Ocado operates no stores and does all home deliveries from its warehouses with an industry leading 0.02% waste. For Ocado fully automated warehouses not only allows to be price-competitive, it means a better customer experience, reduced waste and carbon-footprint.  Walmart spent $4 billion in 1991 to create Retail Link to better collaborate with its suppliers. Today there are many off-the-shelf platforms retailers can use for this purpose for a fraction of the cost. For customers, this means fewer out of stocks and the retailers an additional revenue stream through data monetization.  Retailers will need to redesign their supply chains based on services they want to offer which will often mean looking at the supply chain as many possible starting and ending points rather than a single flow to get products on the shelf. This also means a focus shift from on-shelf availability to dynamic trade-offs between availability, margins and delivery times, reallocating products across channels based on sell-through rates and even testing demand for a product in online before moving it to store shelves. This is only possible with a shared inventory and end-to-end visibility across all distribution channels.   

3. Assess infrastructure needs. Organizations must determine the necessary technology capabilities from data management and machine learning to in-store sensors, warehouse management and delivery capabilities to support their vision.  Being able to effectively execute on an omni-channel strategy requires a fully unified stack. Successful retailers use machine learning to watch consumer trends and customer feedback, personalize offers, manage product assortment, decide optimal distribution center locations, demand/inventory forecasting, understand user journey and marketing channel effectiveness, in-store IoT devices to react to user actions and inventory updates in real time, robotic automation to increase warehouse efficiency and sharing data with suppliers to more effectively manage inventory and enable timely direct-to-store shipments. Making all the components work together, integrating between different hardware and software components are often multi-year projects. 

4. Identify necessary organizational changesRetailers will need to restructure business processes and metrics, define rules for shipping products and allocating revenue between channels. If a gift is ordered from a website but exchanged at a local store where should the revenue go? What if the customer went to a store, saw a display model but product was out of stock then ordered from her smartphone to ship it to her home? With omni-channel retail blurring the lines, the right incentives need to be put in place for business success with all parties focusing on delivering customer value.  

Retail is changing but in a way, everything old is new again.

E-commerce sites are the new catalog stores, Alexa is not the name of the server who knows how you like your steak but a voice assistant who will know about almost all your buyer preferences, convenience stores will become vending machines you can walk into, and the milkman will be a robot.

All in all, it will be better for the consumer and less taxing on the environment.  

So why call it an apocalypse? It is the retail renaissance. 

Big data, GIS, Satellites

If you’re looking for a data prep challenge, look no further than satellite imagery

I has been almost 10 months since my last blog post. Probably time to write a little bit about what I’ve been up to.

As some of you might know, in January I joined Descartes Labs. As a company, one of our goals is to make spatial data more readily available and to make it easier to go from observations to actionable insights. In a way, just like Tableau, we’d like people to see and understand their data but our focus is on sensor data whether it is remote such as satellite or drone imagery, video feeds or in-situ such as a weather station data. And when we talk about big data we mean many Petabytes being processed using tens of thousands of CPU or GPU cores.

But at a high level, many common data problems that you’d experience with databases or Excel spreadsheets apply just the same. For example, it is hard to find the right data, there are inconsistencies and data quality issues which become more obvious when you want to integrate multiple data sources.

Sounds familiar?

We built a platform that aims to automatically address many of these issues, what one might call a Master Data Management (MDM) system in enterprise data management circles but focusing on sensor data. For imagery, many use cases from creating mosaics to change detection and various other deep learning applications require these data corrections for best results. And having an automated system shaves off what would otherwise be many hours of manual data preparation.

For example to use more than two images in an analysis, the multiple images have to be merged into a shared data space.  The specific requirements of the normalization is application dependent, but often requires that the data be orthorectified, coregistered, and their spectral signatures normalized, while also accounting for interference by clouds and cloud shadows. We use machine learning to automatically detect clouds and their shadows hence can filter them out on demand, an example of which you can see below.

Optical image vs water/land/cloud/cloud shadow segmentation
Optical image vs water/land/cloud/cloud shadow segmentation

However, to truly abstract satellite imagery to an information layer, analysts must also account for a variety of effects that distort the satellite observed spectral signatures. These spectral distortions have various causes that include geographic region, time of year, differences in the satellite hardware, and the atmosphere.

The largest of these effects is often the atmosphere.  Satellites are above the atmosphere looking down and, therefore, mix the sunlight reflected from the surface with that scattered by the atmosphere. The physical processes at play are similar to why the sky is blue when we look up.

The process of estimating and removing these effects from satellite imagery is referred to as atmospheric correction.  Once these effects are removed from the imagery, the data is said to be in terms of “surface reflectance”. This brings satellite imagery into a spectral space that is most similar to what humans see every day on the Earth’s surface.  

By putting imagery into this shared spectral data space, it becomes easier to integrate multiple sources of spectral information – whether those sources be imagery from different satellites, from ground based sensors, or laboratory measurements.

Top of Atmosphere vs Surface Reflectance
What a satellite sees (left) vs the surface (right)

We take a different tact than other approaches to surface reflectance in that our algorithms are designed to be a base correction that is applicable to any optical image. Other providers of surface reflectance data often focus on their own sensors and their own data, sometimes making it more difficult for users of multiple sensors to integrate the otherwise disparate observations. 

We have already preprocessed, staple data sources such as NASA’s Landsat 8 and European Space Agency (ESA)’s Sentinel-2 data. This includes all global observations for the lifespan of the respective satellites. We also generate scenes for other optical sensors, including previous Landsat missions, on request. In addition to our own algorithms, we also support USGS’ (LaSRC) and ESA’ (Sen2Cor) surface reflectance data. 

If you’re into serious geospatial analysis you should definitely give our platform a try and see for yourself. If you’re not but know someone who does, spread the word! With our recently launched platform, we are very excited to help domain experts get to insights faster by helping them find the right datasets, smartly distribute their computations across thousands of machines, and reduce the burden of dealing with data quality issues and the technical nuances of satellite data. You can read more about our surface reflectance correction and how to use it in our platform here.

Big data, GIS, Mapping

New Beginnings

After 5 exhilarating years at Tableau on December 29th, I said my goodbyes and walked out of the Seattle offices for the last time.

Tableau was a great learning experience for me. Watching the company grow nearly 7 fold, going through an IPO and becoming the gold standard for business intelligence…

I  worked with very talented people, delivered great features and engaged with thousands of enthusiastic customers.

It was a blast.

But there is something about starting anew. It is the new experiences and challenges we face that make us grow. And there’re very few things I like more in life than a good challenge 🙂

I flew out of Seattle on December 30th to start a new life in Santa Fe, New Mexico and join a small startup called Descartes Labs on January 2nd.


At Descartes Labs, we’re building a data refinery for remote sensing data; being used for a growing set of scenarios from detecting objects in satellite images using deep learning to creating the most accurate crop forecasts on the market and understanding the spread of Dengue fever, using a massively scalable computational infrastructure with over 40 Petabytes of image data with 50+ Terabytes new data added daily. If you haven’t heard of us already, check us out. You won’t be disappointed.

I will continue writing but Tableau won’t be the primary focus for new content on the blog anymore. I will try to answer Tableau related questions as time permits. Although I must say that several of Descartes Labs’ customers also use Tableau so I am using Tableau quite often on my new job as well.

To the new year and new beginnings!


Big data

End of the road for Spire

Database startup Drawn to Scale, creator of Spire; one of the first SQL on Hadoop solutions is closing down. Company had some major paying customers such as American Express and offered SQL 92 compliance and real-time responsiveness. CEO Bradford Stephens announced the closure in a blog post on Friday.

While operational SQL database space is still relatively small (FoundationDB, Splice Machine, F1), it is getting crowded* and more competitive for other SQL on Hadoop solutions. Who do you think will be the winner? Cloudera’s Impala, Greenplum’s HAWQ, SalesForce’s Phoenix, Hadapt’s AAP, HortonWorks’ Stinger…?

*Given Peregrine and Cheetah are also taken, good luck finding a name, if you have any startup ideas in this area : )

Big data

Speaking of SQL on Hadoop

Greenplum just announced Pivotal HD, their new Hadoop distribution that contains HAWQ, a high-performance relational database running on Hadoop. Here are some highlights.

  • Fully compliant and robust SQL92 and SQL99 support. We also support the SQL 2003 OLAP extensions. 100% compatible with PostgreSQL 8.2.
  • Columnar or row-oriented storage to provide benefits based on different workloads. This is transparent to the user and is specified when the table is created. HAWQ figures out how to shard, distribute, and store the data.
  • Seamless partitioning allows separating tables on a partition key, enabling fast scans of subsets by pruning off portions that are not needed in a query. Common partition schemes are on dates, regions, or anything commonly filtered on.
  • Parallel query optimizer and planner take SQL queries that look like any other, then intelligently looks at table stats to figure out the best way to return data.
  • Table-by-table specification of distribution keys allow design of table schemas to take advantage of node-local JOINs and GROUP BYs.

HAWQ follows MPP architecture. Greenplum claims that HAWQ is hundreds of times faster than Hive and orders of magnitude faster for some queries (group by and joins) than competing SQL-on-Hadoop solutions.

Big data

ACID strikes back

At the root of most technological advances is an unaddressed need…a problem to which a good solution does not exist. The accidental discoveries (Penicillin, Saccharin, Teflon, Microwave, Pacemaker, X-Ray…) although some are quite impactful don’t happen very often, especially in the domain of computer science. It is safe to say that the process is evolutionary, and it is very easy to see the some form of what we may call heredity, recombination, gene flow, adaptation and/or extinction in action.

Take  RDBMS and NoSQL for example. Both solutions came out to address certain needs, started with solving an immediate problem and over time evolved by adapting to new requirements, cross-pollinating with other (sometimes competing) technologies to prevail. Navigational DBMS came out in 1960s followed by the emergence of relational DBMS in 1970s. It was obvious that a set-oriented language was more suitable for retrieving records than dealing with loops, leading to SQL; a word that later became almost synonymous with RDBMS.

ACID (Atomicity, Consistency, Isolation, Durability) was a core piece as it dealt with the complexities such as race-conditions (a problem with multi threaded/multi user transactional systems) a data worker wouldn’t want to deal with, further helping with the mass adoption of the technology. In 1980s distributed databases had already started appearing.

When Object Oriented systems (OODBMS) entered the stage in the 1980s, relational database vendors reacted by adding similar features (e.g. user defined types/functions) and maintained their dominance.

Native XML databases of the 2000s shared a similar fate. Today most relational databases are XML-enabled (XML as data type and query using XPath/XQuery) and even built in features like geospatial support rely on custom objects (polygons, lines etc.) and methods (distance, overlap, centroid) legacy of object oriented systems. Not to mention FileStream and FileTables, full-text search with thesauri, stop words and stemming to react to emerging need for unstructured data and document handling.  Of course it was not only the adaptations. Moore’s law was also on the side of RDBMS as more CPU power meant, less need for special-purpose solutions.

More recently RDBMS adding better support for graphs (e.g. IBM DB2) while specialized solutions such as Neo4j or InfiniteGraph also exist. (Although it is quite a stretch to call it graph support, working with graphs has been “possible” to some extent in SQL using recursive self-JOINs for a while.)

Then came NoSQL and “Big Data”

To be clear “big data” existed long before the term was coined but was not a concern for tech companies until startups needed an affordable way to manage their larger than average datasets. For example at BaBar detector in Stanford Linear Accelerator Center, 250 million collisions were happening every second (roughly 8.75 Terabytes per second) before there was Hadoop or Google’s Map Reduce. But data was analyzed on the fly and only about 5 events per second were stored on disk. Although not at particle physics scale, datasets have also been big in the field of astronomy. For example Sloan Digital Sky Survey (2008) was a 40 TB dataset. Pan-STARRS will collect 10 TB of data every night while Large Synoptic Survey Telescope (LSST) will collect 30 TB per night when it is completed. But it wasn’t anything interesting for startups.

First it was the large e-mail services (e.g. Hotmail) and the appetite t0 index the entire Internet which is growing at an amazing rate. Then came social networks… which led to an urgent need to manage large amount of data at low cost which started the NoSQL movement. Neither horizontal scalability nor unstructured data were RDBMS’ strengths.  Easiest way to achieve this was using lots of cheap hardware (more disks = more IO). Making sacrifices on ACID  was one of the first steps to  achieve the desired shared-nothing scalability on commodity hardware.

Because to ensure Atomicity, Durability and Isolation over a distributed transaction all participating machines should be in an agreement at commit time which requires holding locks (number of locks depends on isolation level). Two-phase commit requires multiple network round-trips between all participating machines, and as a result the time required to run the protocol is often much greater than the time required to execute the local transaction logic. With commodity network and geographically distributed data centers, numbers add up quicker.

An alternative to ACID is BASE (Basic Availability, Soft-state, Eventual consistency). You’ll remember consistency from ACID. Eventual consistency is a more relaxed version of the same rule i.e. rather than requiring consistency after every transaction, it is enough for the database to eventually be in a consistent state. For systems in which it’s OK to serve stale data and give approximate answers, this could be a reasonable trade-off which is what most NoSQL systems do.

Is NoSQL going to dethrone RDBMS?

NoSQL is evolving at an arguably even faster pace than RDBMS did and getting better and better every day but let’s dig in a lot more before we start speculating about this question.

In early NoSQL systems queries were written in languages like Java or C# or domain specific languages like Sawzall (CouchDB and MongoDB still do this) but it wasn’t long before the query languages that resemble SQL such as HiveQL, Pig Latin, Microsoft’s SCOPE and DryadLINQ (.NET abstraction) started to emerge.  Still today most NoSQL systems rely on proprietary APIs or SQL-like languages and no standard language or API exists.

While initially these system were meant for batch processing as opposed to running real-time queries against, Google’s Dremel, Cloudera’s Impala and SalesForce’s Phoenix reduce the response times significantly making this possible. However they are still not ideal as operational systems that involve large number of simultaneous reads and writes. Another weakness of these systems is JOINs, if at all supported it is required that one side of the JOIN fit in memory. For example Google’s BigQuery service has an 8 MB limit.

Denormalized or more appropriately; non-normalized data is one of the important reasons for the speed and scalability of NoSQL systems. But lack of JOINs requires data access patterns to be known at design time i.e. person publishing the data is responsible to bundle things together in a way that is usable. User no longer has the flexibility to easily bring together data/columns from different “tables” inside a query. To support different but overlapping user scenarios, different datasets replicating the same data will need to co-exist. Of course the most obvious consequence is the need for more storage.  The other consequence is dealing with potential inconsistencies resulting from inserts, updates and deletes since if changes are made in one copy, all the other datasets containing the same data will need to be modified.

Of course while all of this was happening, relational world wasn’t resting on its laurels.

Enter NewSQL

NewSQL movement started with the goal to offer NoSQL scalability for OLTP workloads with ACID guarantees. Drawn to Scale’s Spire, Stonebraker’s VoltDB (in-memory) as well as optimized MySQL engines such as ScaleDB, Akiban, TokuDB, MemSQL, Clustrix and NuoDB (previously NimbusDB) are some notable names in the space that are commercially available or can be downloaded for free.  Google also developed a distributed RDBMS named F1 to provide the backend for its ad business which was surprising news to many, since it is where NoSQL essentially started.


Brewer’s CAP theorem (2000) was what a lot of NoSQL implementations used to justify the sacrifice of consistency. But NewSQL systems and even recent NoSQL systems contested this notion. In 2012 Brewer revised/clarified his theorem to address misunderstandings about some of the terms.  The modern interpretation is: during a network partition, a distributed system must choose either Consistency or Availability but choosing consistency doesn’t mean that the database will become unavailable for clients. In essence scenario is not much different from machine failure for fault-tolerant systems. Some machines will be unable to execute writes while the database and the application using it will remain up (in most cases).

FoundationDB and Google’s Spanner are two NoSQL systems that provide ACID guarantees.

From Spanner paper:

“We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.”

When you try to do this at Google scale (number of data centers and distance between them), the problem gets a bit more complicated. Spanner accomplishes this using time-based serialization of events, partial locking, and synchronous replication. Serialization of requests is a major problem at global scale since synchronizing time within and between datacenters is difficult due to clock drift. Google tries to minimize uncertainty using atomic clocks and GPS while “embracing” it using a Paxos replicated state-machine with time-based leasing protocol.

Spanner seems to be targeted at classic 3-tier applications, specifically those that can accept mean latencies in the 1 to 100 milliseconds range. Typical read-only transactions are near 10 milliseconds, while read/write transactions are 8.3 to 11.9 times slower.

Google’s F1 (new relational database we talked about earlier) relies on sharded Spanner servers.

So ACID is back and SQL is becoming the supported query language for many new systems (F1, Phoenix, Spire). There are tools to accomplish many different tasks on NoSQL systems e.g. Mahout for machine learning and Kafka, Flume for streaming/log data, Pegasus for graph analysis, Sqoop for data movement… Is it time to jump on the NoSQL bandwagon?

There’s no correct answer.

If you’re adventurous and have interest in new technologies, definitely.

If your data is unstructured (or has special structure e.g. graph) and you don’t see any benefit (or think it is possible) to impose a tabular schema on it and more importantly if you want to cut costs by relying on commodity hardware and open source software to manage data beyond a few Terabytes, of course. For startups, it is not hard to see why it would be tempting to go the NoSQL route.

But if scale is the only concern, relational systems (even off-the-shelf) can and do scale to tens of Terabytes despite all the noise implying the contrary.  They’re field-tested, give user the flexibility to assemble datasets (via JOINs) at query time and take care of issues like referential integrity for you.

Regardless, RDBMS are here to stay in one shape or another. The technologies may change (e.g. F1 is relational but built on Spanner which is NoSQL) but principles will prevail because of the value they provide. Even if we hit a point where relational systems are not the choice for the average enterprise operation or backend for web apps,  they’ll be running on your phone, MP3-player, TV, gaming console… (SQLite is the most widely deployed database engine in the world) helping you with daily tasks without you knowing.

Big data

Who cares about science? “Big Data” will solve everything ;)

Some of you may remember Chris Anderson’s article titled “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete”. You may think it was published in The Onion but no, it was on the Wired magazine. It is a fun read for sure. He uses three examples to make his case. None of them made sense then nor they make sense now.

The first example relates to quantum physics. As you read it keep in mind the 6 billion dollars spent on Large Hadron Collider (LHC), a particle accelerator to test Unified Field Theory. This article was written 4 months before LHC’s inauguration.

“Faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete….the energies are too high, the accelerators too expensive, and so on. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.”

The second example is related to web search. As you read this one, keep in mind investments search engines are making and the direction web is headed towards. Such as Microsoft’s acquisition of Powerset and Google’s acquisition of Applied Semantics and MetaWeb.

“Google’s founding philosophy is that we don’t know why this page is better than that one: If the statistics of incoming links say it is, that’s good enough. No semantic or causal analysis is required. ”

Third example is related to biology, in particular sequence alignment.

“A sequence may correlate with other sequences that resemble those of species we do know more about. In that case, Venter can make some guesses about the animals — that they convert sunlight into energy in a particular way, or that they descended from a common ancestor. But besides that, he has no better model of this species than Google has of your MySpace page.”

Really? I thought they knew the function of the protein they’re matching against… which is more than what Google knows about my MySpace page using their page rank algorithm. How do they know the function?… Ah… scientific method of course.. And since when comparing something you know with something you don’t became a new thing? New technologies that support handling more data only made the existing processes faster, scalable  and as a result more feasible.

So why am I looking at this article written almost 4 years ago, again?

I read a recent interview with Vivek Ranadivé, the CEO of TIBCO which is essentially in the “big data” market as provider of analytics, visualization and complex event processing (CEP) solutions. In this interview he claims that “Science is dead”. He continues by saying:

“I believe that math is trumping science. In the example of dropped calls, I don’t really need to understand human behavior or psychology, I just need to detect patterns. The pattern tells me that six dropped calls is the key number. Why it’s not eight, I don’t know. I don’t need to know. You just need to know that A plus B will lead to C. I can solve just about every problem in the world with that approach.”

which is very much along the lines of the Chris Anderson’s article.

This is a very problematic point of view especially when raised by people who’re likely to be somewhat influential.

Here is a pattern. We need more pirates to solve the global warming problem :)

There are multiple sides to this. First question is how much data do you need? At some point people believed the stars and the planets rotated around a fixed Earth. People used to see faces on Mars. It took better understanding of how solar system worked or more data/less noise (e.g. a higher quality image of Mars) to figure out what’s really going on.

If I have 1 Petabyte of data, it takes up a lot of space on disk but does that mean it is enough to solve my problem? If you have a better understanding of the problem, you can answer this question much better. If you have a hypothesis, you’d know what other data you need to verify it which could potentially lead you to a better result or at least make you realize the uncertainty in your results or incompleteness of your analysis. 

Second question is, how good is your data? With no good understanding or reasoning and just “letting the data speak for itself”, it is very easy to overfit and end up with terribly inaccurate predictions.  It is also easy to get rid of real, meaningful data points as errors/outliers. Also you’d be blindly betting that past correlations in data will hold up in the future.

Third question is related to principle of reflexivity in social theory. Once you react based on data and make changes to a system will your straw-man model still be valid? Your behavior may affect the system in a way that makes your observation invalid or lead to unexpected results both of which are more likely when your thinking relies only on available data and lacks the crucial question “Why?”.

I can go on and on but you got the point.

“The numbers have no way of speaking for themselves, we speak for them.” political forecaster Nate Silver writes in his book, The Signal and the Noise: Why So Many Predictions Fail — But Some Don’t. I’ll finish with his words.

“Data-driven predictions can succeed — and they can fail. It is when we deny our role in the process that the odds of failure rise. Before we demand more of our data, we need to demand more of ourselves.”

Big data, Mapping

How “big” data can help…

Interesting article on WSJ about big data and its uses to help with issues in developing countries such as malaria. Having worked on a similar project, I wish the best of luck to Jake Porway.

I think they’re off to a very good start. Malaria is a well defined, narrowly scoped but high visibility problem and for these kind of projects, one of the most important things is the visibility as it defines the longevity of the project, scope, support (funding, volunteers)… It is a great place start, not only because it saves lives but also would help the project last longer and tackle many other problems in the long run.

I see the use of buzzwords like big data the same way. In most cases data to solve a particular problem won’t be anywhere remotely near even a terabyte. For example here is a quote from the article “It takes about 600 trillion pixels to cover the surface of the earth” . While this is a storage problem, it is not a big data analysis issue. Data analysis is not done in global scale for the issues being discussed in the article and images are tiled so one can easily pick a very small subset.

I think the real issue here is data, not big data.

Here is a quote that I completely agree with from the article “Democratization of data is a real issue, and people do try to protect data for good reasons, or bad. But once they have seen the value their data can generate when combined with other sources, then the walls start to crumble.”

Many of these datasets will be very small.

The tough problem is not dealing with large datasets. It is developing a sense of community and more importantly making the data contribution frictionless and disparate datasets useful. Once people start contributing datasets, the problem will become their use different formats, different terminologies, different temporal/spatial resolutions, different measurement units, different projections (in the case of maps/imagery being discussed in the article) and insufficient documentation.

For such a project to deliver the best value, the outcome should be more than the sum of its parts. When there is a collection of  datasets, different experts can use them to solve different problems. An environmental scientist would look at a different combination of the datasets from a different angle than a climatologist and solve a different problem. Contributors of some of those datasets may have nothing to do with those domains and maybe never would have thought their data would be used to address these problems. So the eventual goal should be giving people a large collection of datasets they can easily discover/browse and analyze for problems they define and form communities within the framework towards common goals, evangelize and spread the word within their own professional circles. 

And that is the real issue here. Not the size of the datasets.

Once again, best of luck.