Big data

ACID strikes back

At the root of most technological advances is an unaddressed need…a problem to which a good solution does not exist. The accidental discoveries (Penicillin, Saccharin, Teflon, Microwave, Pacemaker, X-Ray…) although some are quite impactful don’t happen very often, especially in the domain of computer science. It is safe to say that the process is evolutionary, and it is very easy to see the some form of what we may call heredity, recombination, gene flow, adaptation and/or extinction in action.

Take  RDBMS and NoSQL for example. Both solutions came out to address certain needs, started with solving an immediate problem and over time evolved by adapting to new requirements, cross-pollinating with other (sometimes competing) technologies to prevail. Navigational DBMS came out in 1960s followed by the emergence of relational DBMS in 1970s. It was obvious that a set-oriented language was more suitable for retrieving records than dealing with loops, leading to SQL; a word that later became almost synonymous with RDBMS.

ACID (Atomicity, Consistency, Isolation, Durability) was a core piece as it dealt with the complexities such as race-conditions (a problem with multi threaded/multi user transactional systems) a data worker wouldn’t want to deal with, further helping with the mass adoption of the technology. In 1980s distributed databases had already started appearing.

When Object Oriented systems (OODBMS) entered the stage in the 1980s, relational database vendors reacted by adding similar features (e.g. user defined types/functions) and maintained their dominance.

Native XML databases of the 2000s shared a similar fate. Today most relational databases are XML-enabled (XML as data type and query using XPath/XQuery) and even built in features like geospatial support rely on custom objects (polygons, lines etc.) and methods (distance, overlap, centroid) legacy of object oriented systems. Not to mention FileStream and FileTables, full-text search with thesauri, stop words and stemming to react to emerging need for unstructured data and document handling.  Of course it was not only the adaptations. Moore’s law was also on the side of RDBMS as more CPU power meant, less need for special-purpose solutions.

More recently RDBMS adding better support for graphs (e.g. IBM DB2) while specialized solutions such as Neo4j or InfiniteGraph also exist. (Although it is quite a stretch to call it graph support, working with graphs has been “possible” to some extent in SQL using recursive self-JOINs for a while.)

Then came NoSQL and “Big Data”

To be clear “big data” existed long before the term was coined but was not a concern for tech companies until startups needed an affordable way to manage their larger than average datasets. For example at BaBar detector in Stanford Linear Accelerator Center, 250 million collisions were happening every second (roughly 8.75 Terabytes per second) before there was Hadoop or Google’s Map Reduce. But data was analyzed on the fly and only about 5 events per second were stored on disk. Although not at particle physics scale, datasets have also been big in the field of astronomy. For example Sloan Digital Sky Survey (2008) was a 40 TB dataset. Pan-STARRS will collect 10 TB of data every night while Large Synoptic Survey Telescope (LSST) will collect 30 TB per night when it is completed. But it wasn’t anything interesting for startups.

First it was the large e-mail services (e.g. Hotmail) and the appetite t0 index the entire Internet which is growing at an amazing rate. Then came social networks… which led to an urgent need to manage large amount of data at low cost which started the NoSQL movement. Neither horizontal scalability nor unstructured data were RDBMS’ strengths.  Easiest way to achieve this was using lots of cheap hardware (more disks = more IO). Making sacrifices on ACID  was one of the first steps to  achieve the desired shared-nothing scalability on commodity hardware.

Because to ensure Atomicity, Durability and Isolation over a distributed transaction all participating machines should be in an agreement at commit time which requires holding locks (number of locks depends on isolation level). Two-phase commit requires multiple network round-trips between all participating machines, and as a result the time required to run the protocol is often much greater than the time required to execute the local transaction logic. With commodity network and geographically distributed data centers, numbers add up quicker.

An alternative to ACID is BASE (Basic Availability, Soft-state, Eventual consistency). You’ll remember consistency from ACID. Eventual consistency is a more relaxed version of the same rule i.e. rather than requiring consistency after every transaction, it is enough for the database to eventually be in a consistent state. For systems in which it’s OK to serve stale data and give approximate answers, this could be a reasonable trade-off which is what most NoSQL systems do.

Is NoSQL going to dethrone RDBMS?

NoSQL is evolving at an arguably even faster pace than RDBMS did and getting better and better every day but let’s dig in a lot more before we start speculating about this question.

In early NoSQL systems queries were written in languages like Java or C# or domain specific languages like Sawzall (CouchDB and MongoDB still do this) but it wasn’t long before the query languages that resemble SQL such as HiveQL, Pig Latin, Microsoft’s SCOPE and DryadLINQ (.NET abstraction) started to emerge.  Still today most NoSQL systems rely on proprietary APIs or SQL-like languages and no standard language or API exists.

While initially these system were meant for batch processing as opposed to running real-time queries against, Google’s Dremel, Cloudera’s Impala and SalesForce’s Phoenix reduce the response times significantly making this possible. However they are still not ideal as operational systems that involve large number of simultaneous reads and writes. Another weakness of these systems is JOINs, if at all supported it is required that one side of the JOIN fit in memory. For example Google’s BigQuery service has an 8 MB limit.

Denormalized or more appropriately; non-normalized data is one of the important reasons for the speed and scalability of NoSQL systems. But lack of JOINs requires data access patterns to be known at design time i.e. person publishing the data is responsible to bundle things together in a way that is usable. User no longer has the flexibility to easily bring together data/columns from different “tables” inside a query. To support different but overlapping user scenarios, different datasets replicating the same data will need to co-exist. Of course the most obvious consequence is the need for more storage.  The other consequence is dealing with potential inconsistencies resulting from inserts, updates and deletes since if changes are made in one copy, all the other datasets containing the same data will need to be modified.

Of course while all of this was happening, relational world wasn’t resting on its laurels.

Enter NewSQL

NewSQL movement started with the goal to offer NoSQL scalability for OLTP workloads with ACID guarantees. Drawn to Scale’s Spire, Stonebraker’s VoltDB (in-memory) as well as optimized MySQL engines such as ScaleDB, Akiban, TokuDB, MemSQL, Clustrix and NuoDB (previously NimbusDB) are some notable names in the space that are commercially available or can be downloaded for free.  Google also developed a distributed RDBMS named F1 to provide the backend for its ad business which was surprising news to many, since it is where NoSQL essentially started.


Brewer’s CAP theorem (2000) was what a lot of NoSQL implementations used to justify the sacrifice of consistency. But NewSQL systems and even recent NoSQL systems contested this notion. In 2012 Brewer revised/clarified his theorem to address misunderstandings about some of the terms.  The modern interpretation is: during a network partition, a distributed system must choose either Consistency or Availability but choosing consistency doesn’t mean that the database will become unavailable for clients. In essence scenario is not much different from machine failure for fault-tolerant systems. Some machines will be unable to execute writes while the database and the application using it will remain up (in most cases).

FoundationDB and Google’s Spanner are two NoSQL systems that provide ACID guarantees.

From Spanner paper:

“We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions.”

When you try to do this at Google scale (number of data centers and distance between them), the problem gets a bit more complicated. Spanner accomplishes this using time-based serialization of events, partial locking, and synchronous replication. Serialization of requests is a major problem at global scale since synchronizing time within and between datacenters is difficult due to clock drift. Google tries to minimize uncertainty using atomic clocks and GPS while “embracing” it using a Paxos replicated state-machine with time-based leasing protocol.

Spanner seems to be targeted at classic 3-tier applications, specifically those that can accept mean latencies in the 1 to 100 milliseconds range. Typical read-only transactions are near 10 milliseconds, while read/write transactions are 8.3 to 11.9 times slower.

Google’s F1 (new relational database we talked about earlier) relies on sharded Spanner servers.

So ACID is back and SQL is becoming the supported query language for many new systems (F1, Phoenix, Spire). There are tools to accomplish many different tasks on NoSQL systems e.g. Mahout for machine learning and Kafka, Flume for streaming/log data, Pegasus for graph analysis, Sqoop for data movement… Is it time to jump on the NoSQL bandwagon?

There’s no correct answer.

If you’re adventurous and have interest in new technologies, definitely.

If your data is unstructured (or has special structure e.g. graph) and you don’t see any benefit (or think it is possible) to impose a tabular schema on it and more importantly if you want to cut costs by relying on commodity hardware and open source software to manage data beyond a few Terabytes, of course. For startups, it is not hard to see why it would be tempting to go the NoSQL route.

But if scale is the only concern, relational systems (even off-the-shelf) can and do scale to tens of Terabytes despite all the noise implying the contrary.  They’re field-tested, give user the flexibility to assemble datasets (via JOINs) at query time and take care of issues like referential integrity for you.

Regardless, RDBMS are here to stay in one shape or another. The technologies may change (e.g. F1 is relational but built on Spanner which is NoSQL) but principles will prevail because of the value they provide. Even if we hit a point where relational systems are not the choice for the average enterprise operation or backend for web apps,  they’ll be running on your phone, MP3-player, TV, gaming console… (SQLite is the most widely deployed database engine in the world) helping you with daily tasks without you knowing.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s