Consulting giant McKinsey predicts annual growth of 40% volume of data gathered the world. Storage giant EMC estimates, the market for Big Data currently amounts to some 70 billion dollars, and predicts that by 2020 will increase its number of customers, with more than Petabyte of information, about 100,000 compared to only a 1,000 now.
Share everything about Technology and Business
Building big stuff with small pieces
Sunday, February 5, 2012
Big Data - Big Opportunities
Consulting giant McKinsey predicts annual growth of 40% volume of data gathered the world. Storage giant EMC estimates, the market for Big Data currently amounts to some 70 billion dollars, and predicts that by 2020 will increase its number of customers, with more than Petabyte of information, about 100,000 compared to only a 1,000 now.
Sunday, October 9, 2011
The BIG data technology stack
Running an advertising company / collecting click stream data / operating a big social network application and contemplating on ways to store, access, manipulate and monetize the collected data ? This article is for you.
Choosing the right tools and making the right compromisations is the key to success. In this post, I will try to share the set of tools and typical use-cases for the different levels of the big data architecture stack.
Before drilling down into the stack, lets talk about Linearness.
Are you Linear ?
In my last post I wrote about what I call – Linearness. To make a long story short, the concept of Linearness comes from the simple idea that a company dealing with big data should be able to play on a 3 axis matrix – cost, performance and accuracy.

If your architecture on its various levels is linear, you gotta be doing something right. For example - if your architecture enables you to easily throw more machines on the problem and thus linearly improve the performance OR by compromising the accuracy of the results (sampling) you can reduce the cost of hardware – you are in a good position.
The Big data building blocks
In the big data realm, there are 3 typical use cases that keep returning at most of the companies dealing with big data – Massive data processing, Ad Hoc reporting and Real Time access.
- Massive data processing – typically referring to ETL processes, Data Mining algorithms or calculating pre-defined aggregations. In the “old” world, we used tools like Informatica, SAS or DB level queries. These approaches are far from being linear – long processing cycles and expensive hardware cost made us compromise on the functionality. Hadoop in its different forms (HDFS, HBase, Hive and MapR) is the right tool for these scenarios.
- Ad Hoc reporting – analysts wieldiest dream (in terms of data) is to run aggregative queries on raw data on any time range on any filter. Over the years, we’ve used to call this problem - OLAP. Since linearness architectures are new, the existing technologies led us to compromise on predefine aggregations only. Todays new tools like Vertica, Greenplum, Asterdata and others open up new possibilities for linear scale ad hoc reporting systems.
- Real Time access – When it comes to delivery big data to big audience, a need for fast and reliable fetching tool rise. Cassandra, HBase and other key value stores play well here.
The Big data architecture
Big data companies usually require the above 3 blocks (blue in the chart) in order to support the 3 use cases of big data. Here’s a typical architecture chart for such organizations:
- On the top left: the customer facing tools - content or ad servers. These servers are the barrier between the end user and the backend technology. They generate tones of data which is nicely persisted into the HDFS. In many cases they consumes real-time related data from the key-value store database.
- Hadoop is the core of the data in the organization. Every piece of data is stored there. Map reduce jobs constantly extract and aggregate the data into two typical forms – aggregate by key into the key-value store and extract raw data into the analaytic warehouse. Make sure your jobs are deltas aware, no point reprocessing what you’ve just finished.
- The Key-Value store is the place where you store your delivery oriented data – user profile, product metadata, conversions information, etc. When serving content, this database is the king.
- The analytic data warehouse is where the gold in its most polished form is stored. The data here is usually somewhat (if not entirely) denormalized or nicely separated to a star schema form. When analysts look at the schema – they immediately get it as it talks the way the business people understand.
- For reporting - many BI tools works nicely on top of the analytics DB – starting with open source like Pentaho, continuing to sophisticated Excel like tools – Tableau, up to high-end solutions like Microstrategy. This queries are usually combined with your OLTP database where you store your metadata.
Wrapping up
The above architecture might look like a total waste of money and tones of duplication. While the duplication statement is correct, every block here is needed to solve the 3 use cases.
One last note - there is an important movement now towards one-size-fits-it-all databases, but it’s still very early to conclude on that.
Think Linearness!
Wednesday, September 14, 2011
Monday, August 29, 2011
A/B test for banners
Assuming you have two banners, simply fill in the number of impressions and goals (conversions, clicks, leads, registrations, etc...) and click compare:
How does it work?
This calculation is a common statistical test called Chi Square Test.
One of its uses is to test the null hypothesis - an hypothesis that proves or disproves that a specific treatment has effect.
In our scenario, we test if banner #1 is over / under performing banner #2 and validate the statistical significance of the result.
This calculation is based on javascript, you can view its source (inspired by tom@ucla).
The equivalent java code looks something like that (using apache comons math lib):
long ad1imp = 13146;
long ad1conv = 72;
long ad2imp = 996324;
long ad2conv = 6442;
long[] ad1 = new long[] {ad1imp, ad1conv};
long[] ad2 = new long[] {ad2imp, ad2conv};
ChiSquareTestImpl cs = new ChiSquareTestImpl();
double p = cs.chiSquareTestDataSetsComparison(ad1, ad2);
System.out.println(1-p);
Thursday, August 25, 2011
Sunday, July 31, 2011
Real Time Analytics approaches at the BIG data era
Big data is here and it’s big. Real Time just makes it faster.
Many companies and technology provider
s are outlooking at the new possibilities the this tremendously growing industry is enabling. The new world requires agility, fast response to changes and ability to take educated yet automated decisions.
The world wants Real Time!
Over the past 6 years, I’ve been deep diving into the analytics and big data world, including - web analytics, advertising, business intelligence and machine learning algorithms. During this period I’ve been witnessing to different perspectives about Real Time. I’ve learned that when people talk about big data and real time, they usually refer to high freshness of data or ad hoc queries (or both).
High freshness
High freshness of data refers to the efforts of lowering the latency between the time the event occur till it’s available for reporting.
On the one hand, Facebook is publishing its architecture for the(super cool) Real Time insights product they launched few month ago. On the other hand Yahoo’s tech leader is complaining about the difficulties to develop what they call the “next-click” – effecting the experience of the visitor on the page right after the current click. Seems like that even the big guys are struggling with the technology. Nati’s post nicely explains the difficulties and proposes an alternative approach.
My concern is different - at the end of the road, Facebook implementation is based on counters in HBase, aggregate metrics per (like) URL. This fairly simple approach is easy to implement but holds some compromise on the product itself – it’s fixed, it’s not drill-downable and it takes time to process.
Aggregations are for wussies!

What if you could have a real-time analytics solution running on top of raw data ?
Ad hoc queries
The world of data warehousing has gone through last year the most drastic changes over the past 30 years. While traditional databases (Oracle, Microsoft, MySQL) were all about scaling up a single server, the new technologies (Greenplum, Netezza, Asterdata, Vertica and others) are all about what I call - Linearness.
Are you Linear ?
I claim that big data companies should drive themselves to be
linear – linear in cost of the hardware, linear in performance of the queries and linear in accuracy of the response. Yes, accuracy. who cares if last month visits were 5483238 or 5483361 – sampling is in many cases the key for success.
Using this concept – Facebook could have developed it’s insights feature and allow cool drill downs and flexibility.
To achieve linearness – three conditions must apply:
- Shared nothing - each node is independent and self-sufficient
- Massive Parallel processing - many CPUs working in parallel to execute a single program
- Columnar orientation - stores content by column rather than by row
Conclusion
Real time analytics is hot – advertising, personalization, stock trading, shift management and many other scenarios. Don’t wait for an invitation – hop in asap or step back.
Think Linearness!
Related articles
- Big Data: It's not about size; it's how you use it (fastreporting.com.au)
- Big Analytics: Closing the "clue gap" with Big Data (revolutionanalytics.com)
- Why the finance world should care about big data and data science (radar.oreilly.com)
Monday, May 23, 2011
Google IO 2011– App Engine & MapReduce – a game changer in Google’s cloud services
Everyone knows that Google is the founder the of the MapReduce framework, they even protected it by patent signed at June 2004. Google have been using MR for probably more then 10 years now, gaining a huge competitive and technology edge on all relevant competitors in terms of product, costs and scale.
At the end of 2004, Google introduced a whitepaper: MapReduce: Simplified Data Processing on Large Clusters that changed the “big data” industry as we know it. This paper initiated the foundation of Hadoop, built by a Doug Cutting.
Although Google was the true founder of the concept - today when you say MapReduce, you say Hadoop. Not any more. Google is working to re-gain it’s position as the leader of big data processing technology, also for their cloud services platform – Google App Engine.
At Google’s IO 2010, Google introduced it’s strategy for parallel processing in GAE, revealing the Mapper phase.
One year after, at Google IO 2011 - we got the second phase of the puzzle (still in early stages) – the full MapReduce mechanism.
What fascinate me the most is the Google message to the world - as I see it, GAE up until now, was considered to be “just” an alternative for developing simple web sites, a competitor to GoDaddy and php if you will. You’ve got Python or Java, a NoSQL datastore, simple scalability - and you’re ready to go.
Batch processing was always a mystery on GAE, and thus, they probably missed the opportunities for really big stuff to be built on their infrastructure.
Their latest announcement puts Google in a new position, they are telling us - go a head, we can take it. Go and build your super complicated applications on our platform – social networks, advertising companies, web crawlers, video processors, everything.
The giants are wrestling and the industry is winning; The race for building the richest cloud platform is getting to finals and it’s tight - AWS, Azure, GAE and Force are leading.
The race for 1B$ company built entirely on cloud services is getting to into a critical phase as well.
