Tuesday, May 19, 2009

Database Sharding - the art of scaling the data layer based on a shared nothing architecture

One of the biggest challenges when scaling an application, is how to scale the data layer. Getting bigger and stronger machines is yesterday’s news, it’s the era of commodity hardware again. The main problem is that a single machine can’t hold all your data and SAN is just too damn expensive.

We used to call it horizontal partitioning for ages, but the world now call it - Sharding. (horizontal partitioning vs. vertical partitioning will be covered in another post. or not).

The biggest question to be asked when sharding data is -which record goes to which machine ? how to shard the data ?

This question is a key factor to the system stability and scalability.

The no-no way - shard by low cardinality dimension

The first and simplest way to shard the data is by splitting the data base on - categories, geographic location or any other low cardinality dimension.

Lets take eBay for example:

eBay gets billions of transactions per month, a data intensive system for sure. Queries mostly run on a single product category (digital cameras for example).

Some DBAs would install a database node per category or bunch of categories and shard the data horizontally per category name.


Why ? because it’s easy, because it’s logical, and mostly because they can fine tune the queries in their nice little administration console they are so used to.

When the problems begin ?

When December holidays arrive and the system gets load of requests for digital cameras. All of the sudden a single machine is not enough.

This is when startups get their first conflict of interests – business grows fast, but R&D guys wish to hold it back and chase the tail to till they solve the scale problems.

This way of sharding is the perfect way to lead to unbalanced and inconsistent system.

The wrong way – fix shard by high cardinality dimension

So, category names is not the optimal separator obviously.

Another alternative is shard by a high a cardinality dimension (user id for example) using a modulo or a range function.

A modulo function could look something like:


The good: it’s balanced.

The bad: it’s expensive at the beginning (big infrastructure – low usage), and inflexible at the end. It sucks.

The common way – dynamic sharding by index table

A third alternative is based on index table. Something like that:


This way, when reading or writing a record, the client must first look at the (cached) index table to decide on which node to run the query.


  1. Index table might become a bottleneck.
  2. A precise design need to be done in advance. Need to know in how much load each machine can handle.
  3. Changing the index table is difficult, mostly will require system shutdowns.


The interesting way – wrapping index tables with a proxy

Tractors replaced labors and clouds replaced DBAs.

In this concept - a proxy layer decouples the application (client) from the database layer, and thus, reduces the complexity of manually managing the index tables.

Queries / modification scripts goes through the proxy (that encapsulate the sharding function) to the data layer, and results then return to the client as if they were running on a single directed server.


It’s still early, but some projects seems to be growing fast; projects such as HSCALE, Hibernate and HiveDB. Very cool stuff !!


An interesting idea – Semi-Random sharding (optimized for insert/read only systems)

There’s a hidden assumption on all above methods, the assumption that all the entity’s data should live on a single host. For example – all specific user’s data should be hosted on one machine (+replication slave).

The obvious question would be – why ? says who ?

My approach comes with a table index as the below:


In this case, we provide each range few nodes to “play with”.

When reading, the client query the proxy which queries in parallel all relevant nodes and aggregates the data before returning to the client (much like map reduce task).

When writing, the client (through the proxy) writes the data into any available randomly picked relevant write node.


In this way, when reading, each db node has a small job to do, parallelism is maximized and thus, latency is not effected.

When a single node is over loaded with data, we can just remove it from being a write node. It continues to function as a read zombie server.

This approach is useful when inserting and querying only. When data modification is a demand, this approach might not be the best solution.