Running an advertising company / collecting click stream data / operating a big social network application and contemplating on ways to store, access, manipulate and monetize the collected data ? This article is for you.
Choosing the right tools and making the right compromisations is the key to success. In this post, I will try to share the set of tools and typical use-cases for the different levels of the big data architecture stack.
Before drilling down into the stack, lets talk about Linearness.
Are you Linear ?
In my last post I wrote about what I call – Linearness. To make a long story short, the concept of Linearness comes from the simple idea that a company dealing with big data should be able to play on a 3 axis matrix – cost, performance and accuracy.
If your architecture on its various levels is linear, you gotta be doing something right. For example - if your architecture enables you to easily throw more machines on the problem and thus linearly improve the performance OR by compromising the accuracy of the results (sampling) you can reduce the cost of hardware – you are in a good position.
The Big data building blocks
In the big data realm, there are 3 typical use cases that keep returning at most of the companies dealing with big data – Massive data processing, Ad Hoc reporting and Real Time access.
- Massive data processing – typically referring to ETL processes, Data Mining algorithms or calculating pre-defined aggregations. In the “old” world, we used tools like Informatica, SAS or DB level queries. These approaches are far from being linear – long processing cycles and expensive hardware cost made us compromise on the functionality. Hadoop in its different forms (HDFS, HBase, Hive and MapR) is the right tool for these scenarios.
- Ad Hoc reporting – analysts wieldiest dream (in terms of data) is to run aggregative queries on raw data on any time range on any filter. Over the years, we’ve used to call this problem - OLAP. Since linearness architectures are new, the existing technologies led us to compromise on predefine aggregations only. Todays new tools like Vertica, Greenplum, Asterdata and others open up new possibilities for linear scale ad hoc reporting systems.
- Real Time access – When it comes to delivery big data to big audience, a need for fast and reliable fetching tool rise. Cassandra, HBase and other key value stores play well here.
The Big data architecture
Big data companies usually require the above 3 blocks (blue in the chart) in order to support the 3 use cases of big data. Here’s a typical architecture chart for such organizations:
- On the top left: the customer facing tools - content or ad servers. These servers are the barrier between the end user and the backend technology. They generate tones of data which is nicely persisted into the HDFS. In many cases they consumes real-time related data from the key-value store database.
- Hadoop is the core of the data in the organization. Every piece of data is stored there. Map reduce jobs constantly extract and aggregate the data into two typical forms – aggregate by key into the key-value store and extract raw data into the analaytic warehouse. Make sure your jobs are deltas aware, no point reprocessing what you’ve just finished.
- The Key-Value store is the place where you store your delivery oriented data – user profile, product metadata, conversions information, etc. When serving content, this database is the king.
- The analytic data warehouse is where the gold in its most polished form is stored. The data here is usually somewhat (if not entirely) denormalized or nicely separated to a star schema form. When analysts look at the schema – they immediately get it as it talks the way the business people understand.
- For reporting - many BI tools works nicely on top of the analytics DB – starting with open source like Pentaho, continuing to sophisticated Excel like tools – Tableau, up to high-end solutions like Microstrategy. This queries are usually combined with your OLTP database where you store your metadata.
The above architecture might look like a total waste of money and tones of duplication. While the duplication statement is correct, every block here is needed to solve the 3 use cases.
One last note - there is an important movement now towards one-size-fits-it-all databases, but it’s still very early to conclude on that.