Big data is here and it’s big. Real Time just makes it faster.
Many companies and technology provider
s are outlooking at the new possibilities the this tremendously growing industry is enabling. The new world requires agility, fast response to changes and ability to take educated yet automated decisions.
The world wants Real Time!
Over the past 6 years, I’ve been deep diving into the analytics and big data world, including - web analytics, advertising, business intelligence and machine learning algorithms. During this period I’ve been witnessing to different perspectives about Real Time. I’ve learned that when people talk about big data and real time, they usually refer to high freshness of data or ad hoc queries (or both).
High freshness
High freshness of data refers to the efforts of lowering the latency between the time the event occur till it’s available for reporting.
On the one hand, Facebook is publishing its architecture for the(super cool) Real Time insights product they launched few month ago. On the other hand Yahoo’s tech leader is complaining about the difficulties to develop what they call the “next-click” – effecting the experience of the visitor on the page right after the current click. Seems like that even the big guys are struggling with the technology. Nati’s post nicely explains the difficulties and proposes an alternative approach.
My concern is different - at the end of the road, Facebook implementation is based on counters in HBase, aggregate metrics per (like) URL. This fairly simple approach is easy to implement but holds some compromise on the product itself – it’s fixed, it’s not drill-downable and it takes time to process.
Aggregations are for wussies!

What if you could have a real-time analytics solution running on top of raw data ?
Ad hoc queries
The world of data warehousing has gone through last year the most drastic changes over the past 30 years. While traditional databases (Oracle, Microsoft, MySQL) were all about scaling up a single server, the new technologies (Greenplum, Netezza, Asterdata, Vertica and others) are all about what I call - Linearness.
Are you Linear ?
I claim that big data companies should drive themselves to be
linear – linear in cost of the hardware, linear in performance of the queries and linear in accuracy of the response. Yes, accuracy. who cares if last month visits were 5483238 or 5483361 – sampling is in many cases the key for success.
Using this concept – Facebook could have developed it’s insights feature and allow cool drill downs and flexibility.
To achieve linearness – three conditions must apply:
- Shared nothing - each node is independent and self-sufficient
- Massive Parallel processing - many CPUs working in parallel to execute a single program
- Columnar orientation - stores content by column rather than by row
Conclusion
Real time analytics is hot – advertising, personalization, stock trading, shift management and many other scenarios. Don’t wait for an invitation – hop in asap or step back.
Think Linearness!

No comments:
Post a Comment