Saturday, November 16, 2013

Stay Authentic: How to avoid the pitfalls of big data analytics

Big data repositories were made to store huge volumes of data. This means they are optimized for “writing” – regardless of whether we’re talking about file servers, NAS, Amazon S3 or Hadoop.  When I say “optimized for writing” I don’t necessarily mean that queries are impossible. Using Hadoop, query flexibility can be exceptional - you can imagine any query and just execute it. But there are no free lunches. With Hadoop, even simple queries typically take several minutes to a few hours, especially if you run them on a large dataset.

To Aggregate or not to Aggregate?

A common solution is to query slowness is to structure the data, aggregate it and load it into an analytics database. This has a cost. Phil Shelley, CTO of Sears Holdings and CEO of Metascale says “ETL is expensive in terms of people, software licensing and hardware”. He also mentions the loss of data freshness as the ETL process happens.
The ugly truth about preparing big data for analytics is that the data, once aggregated, is very different than the original event that was dumped in your big data repository. This is because aggregation is about choosing the dimensions needed according to your current analytics requirements. This does not maintain an authentic representation of your original business events – and can become a problem when new questions come up; questions that were not taken into consideration when the  original filters were defined. This happens often.
Each time data is processed it loses authenticity. For instance, if you aggregate daily level traffic, you’ll lose any insights that are hidden in the hourly data. Similarly, if you aggregate by country, you’ll miss the interesting stuff that happens at the metro area level.
Here’s an example: let’s assume you're running an ad network business and you are maintaining 3 main events: impressions, clicks and conversions.
Your real time server dumps all the data into Hadoop. To make it query-ready you choose to aggregate all events into MySQL aggregation tables.
Without noticing, you've actually created another copy of the data. And depending on your aggregation, you may at best lose granularity and at worst end up with misleading results. A roll up of a day’s events will completely omit peaks and troughs. You may miss something exceptionally good or exceptionally bad: you won’t see the link between the hourly tuning of your ad network and the actual results. Sometimes even hourly aggregation can produce a misleading subset.
But the challenges of roll ups and aggregations go beyond data granularity. Sometimes the way aggregate needs to change when the underlying business does.  You may need new filters to make sense of the data. The problem is that this means you need to restructure the data model. Restructuring the data model is the most expensive thing in software development. This may take weeks to roll out – a nightmare.

Writing & Reading Hadoop

I started this post stating that Hadoop is a “write” optimized store. As a matter of fact, Hadoop and its ecosystem have gone a long way since Hadoop was first developed, back in 2005. It is no longer just a dumpster of your data or a place to run slow batch processes using MapReduce or Hive queries. 
Some interesting players are making Hadoop run on steroids. Executing queries over tens of billions of records directly on Hadoop data in seconds is not science fiction. Maintaining some sort of indexed structure living inside the HDFS for the purpose of real time querying is a promising method to achieve this goal. This way you continue to use all the benefits you get from Hadoop for free – its scalability, redundancy, availability and robustness - and also get to run super-fast queries on raw data. I’ve actually met a company that does just this (in closed beta as I write this): JethroData.

5 tips to avoid inauthenticity

Keep it raw – don’t lose your data authenticity.

Denorm is the norm: Normalization is good for transaction based systems, when you must maintain a single copy of each dataset, mostly because every update and delete has to be atomic. But currently, for business intelligence and analytics, data duplication is crucial for good performance. So in the ad network example, if you know that the impression is associated with a specific ad campaign, consider adding the campaign attributes to the impression event - its goal, name, capping and so forth.  Make sure you add these attributes directly when the event is generated, so this will help you avoid authenticity issues and reduce query time on joins. Denormalization lets you bring the different attributes of a record together, so they are all accessible in one read. This works best for append only databases, where updates are infrequent.

Avoid fact to fact joins as much as you can. Let’s assume again that each impression is associated with a campaign. You'd like to know how many impressions, clicks and conversions are relevant for each campaign. Your initial reaction may be to filter the impressions with the relevant campaign and then to connect it to its click stream and then to the relevant conversion. That is exactly the “fact to fact join” you better avoid. Remember the previous rule - denorm is the norm. You should add the campaign from the impression to the clicks and conversions as well. Now you no longer need to join the facts, instead you can run a query on each entity independently and simply join the result sets.

Statistics aren’t that bad. Sometime getting the accurate number is cool, but sometimes a statistical number might tell a good enough story.

If you had 32,348,534 impressions or 32,350,000 impressions might not mean much when you look at the weekly activity. But it might mean the world when you send an invoice to a client.  To make statistics work, a nice trick is to add a bucket column to each table - assuming you have a user id column, add a bucket column equals to user id % 100.
You can now simply query 10% of the data by filtering all impressions that their bucket is < 10 and extrapolate the result by multiplying it by 10.
Of course, you can do the same for all of your entities. The nice thing about using statistics this way is that you can decide at query time how accurate the result is – whether you want it to run on 10%, 20% or 100% of the data.

Index your data: Keeping your data as-is in Hadoop won’t help you much with query performance. Indexing the data as it arrives in Hadoop would improve your query performance dramatically using tools such as JethroData,, etc..


To conclude, as Phil Shelley from Sears says: “ETL's Days Are Numbered”.   One reason is that migrating Big Data from one place to another will result in loss of authenticity and may eventually lead to making the wrong decisions.

Instead of transforming the data out of Hadoop for the purpose of reporting, make sure it truly represents the facts and run your dashboards and reports directly inside Hadoop, as much as possible. 

Sunday, February 5, 2012

Big Data - Big Opportunities

Here is a translation from an article in the daily magazine - (Hebrew):

"Accumulation of massive amounts of data, coming from all sorts of activities in the net, social media, mobile phones, power and water systems, etc. - lays the foundation for creating industry of big data," said Shachar - Director of Data Services, LivePerson.
"The difficulties organizations dealing with this vast amount of data creates tremendous business opportunities and establishment of companies that develop data processing systems and turning them into income manufacturer for organizations"

By Hadas Geifman, 5 February 2012, 16:07
"Big Data field creates many opportunities," says Haggai Shachar, Director Data Services at LivePerson, interviewed to people and computers. For example, explains Shachar, "it creates opportunities for companies that allow dynamic personalization of content based on patterns of historical data and companies that develop systems that identify inefficiencies in processes and non-effective exploitation of opportunities facing the organization. These companies specialize in locating information relevant to specific functions, concentrating in one place and then proposed plan to operate in real time. "

"Big Data technologies currently gaining momentum because of the rapid development of mobile communications, social networking over the past two years, creating tremendous growth in quantities of digital data," adds Shachar. "Traditional tools in the field of databases, such as those of Oracle and Microsoft, are not designed to process and analyze masses of data coming from a variety of sources quickly enough, and certainly not in real time. Combining these two factors accelerates the growth of  technology companies developing appropriate solutions ".

Shachar notes that "many organizations have not internalized yet that new methods of doing business in the data world, and see the need to store and retrieve data - type of burden. The challenges of dealing with Big Data can be classify into three categories, known as the three V: the data Volume, the Velocity in which data is generated and the Variety of instruments and sources.

Consulting giant McKinsey predicts annual growth of 40% volume of data gathered the world. Storage giant EMC estimates, the market for Big Data currently amounts to some 70 billion dollars, and predicts that by 2020 will increase its number of customers, with more than Petabyte of information, about 100,000 compared to only a 1,000 now.

"The difficulty of organizations dealing with huge amounts of data that," says Shachar, "creates business opportunities and establishment of companies that develop systems for processing data and making them a new source of revenue for organizations. In addition, he noted, "new companies are founded to provide tools for organizations to analyze historical data and market trends, and allow their customers to view the information in different cross-sections. For example, a retailer that sells backpacks can get an indication of how many cases of a particular type are in high demand, which cities or malls this past month, past six months, etc. “

"This kind of systems providing real time information helps organization to make decisions when the event occurs - analyzing the data as it arrives and responding to it immediately. For example, if it turns out that at one point in time a great demand for backpacks is made by a particular brand or a certain color, the retailer will be able to receive alerts in real time in order to meet demand, possibly even automatically."

LivePerson, operates as a SaaS company, provides systems that analyze massive amounts of data for more than 8,500 customers such as users activity, habits and trends - enabling site owners to respond in real time using a variety of channels (like chat, content etc.), thereby maximizing the relationship with the customer through the different marketing channels for sales and technical support.

Sunday, October 9, 2011

The BIG data technology stack

Running an advertising company / collecting click stream data / operating a big social network application and contemplating on ways to store, access, manipulate and monetize the collected data ? This article is for you.

Choosing the right tools and making the right compromisations is the key to success. In this post, I will try to share the set of tools and typical use-cases for the different levels of the big data architecture stack.

Before drilling down into the stack, lets talk about Linearness.

Are you Linear ?

In my last post I wrote about what I call – Linearness. To make a long story short, the concept of Linearness comes from the simple idea that a company dealing with big data should be able to play on a 3 axis matrix – cost, performance and accuracy.


If your architecture on its various levels is linear, you gotta be doing something right. For example - if your architecture enables you to easily throw more machines on the problem and thus linearly improve the performance OR by compromising the accuracy of the results (sampling) you can reduce the cost of hardware – you are in a good position.

The Big data building blocks

In the big data realm, there are 3 typical use cases that keep returning at most of the companies dealing with big data – Massive data processing, Ad Hoc reporting and Real Time access.

  • Massive data processing – typically referring to ETL processes, Data Mining algorithms or calculating pre-defined aggregations. In the “old” world, we used tools like Informatica, SAS or DB level queries. These approaches are far from being linear – long processing cycles and expensive hardware cost made us compromise on the functionality. Hadoop in its different forms (HDFS, HBase, Hive and MapR) is the right tool for these scenarios.
  • Ad Hoc reporting – analysts wieldiest dream (in terms of data) is to run aggregative queries on raw data on any time range on any filter. Over the years, we’ve used to call this problem - OLAP. Since linearness architectures are new, the existing technologies led us to compromise on predefine aggregations only. Todays new tools like Vertica, Greenplum, Asterdata and others open up new possibilities for linear scale ad hoc reporting systems.
  • Real Time access – When it comes to delivery big data to big audience, a need for fast and reliable fetching tool rise. Cassandra, HBase and other key value stores play well here.

The Big data architecture

Big data companies usually require the above 3 blocks (blue in the chart) in order to support the 3 use cases of big data. Here’s a typical architecture chart for such organizations:


  • On the top left: the customer facing tools - content or ad servers. These servers are the barrier between the end user and the backend technology. They generate tones of data which is nicely persisted into the HDFS. In many cases they consumes real-time related data from the key-value store database.
  • Hadoop is the core of the data in the organization. Every piece of data is stored there. Map reduce jobs constantly extract and aggregate the data into two typical forms – aggregate by key into the key-value store and extract raw data into the analaytic warehouse. Make sure your jobs are deltas aware, no point reprocessing what you’ve just finished.
  • The Key-Value store is the place where you store your delivery oriented data – user profile, product metadata, conversions information, etc. When serving content, this database is the king.
  • The analytic data warehouse is where the gold in its most polished form is stored. The data here is usually somewhat (if not entirely) denormalized or nicely separated to a star schema form. When analysts look at the schema – they immediately get it as it talks the way the business people understand.
  • For reporting - many BI tools works nicely on top of the analytics DB – starting with open source like Pentaho, continuing to sophisticated Excel like tools – Tableau, up to high-end solutions like Microstrategy. This queries are usually combined with your OLTP database where you store your metadata.

Wrapping up

The above architecture might look like a total waste of money and tones of duplication. While the duplication statement is correct, every block here is needed to solve the 3 use cases.

One last note - there is an important movement now towards one-size-fits-it-all databases, but it’s still very early to conclude on that.

Think Linearness!

Wednesday, September 14, 2011

Monday, August 29, 2011

A/B test for banners

Here is a little tip - how to statistically compare two advertising banners using a method called "chi-square test".
Assuming you have two banners, simply fill in the number of impressions and goals (conversions, clicks, leads, registrations, etc...) and click compare:
Banner Impressions Goals % Goal Rate
Banner #1
Banner #2

Do we have a winner ?

How does it work?
This calculation is a common statistical test called Chi Square Test.

One of its uses is to test the null hypothesis - an hypothesis that proves or disproves that a specific treatment has effect.

In our scenario, we test if banner #1 is over / under performing banner #2 and validate the statistical significance of the result.
This calculation is based on javascript, you can view its source (inspired by tom@ucla).
The equivalent java code looks something like that (using apache comons math lib):

long ad1imp = 13146;
long ad1conv = 72;
long ad2imp = 996324;
long ad2conv = 6442;
long[] ad1 = new long[] {ad1imp, ad1conv};
long[] ad2 = new long[] {ad2imp, ad2conv};
ChiSquareTestImpl cs = new ChiSquareTestImpl();

double p = cs.chiSquareTestDataSetsComparison(ad1, ad2);


Thursday, August 25, 2011

Sunday, July 31, 2011

Real Time Analytics approaches at the BIG data era

Big data is here and it’s big. Real Time just makes it faster.

Many companies and technology providerimages are outlooking at the new possibilities the this tremendously growing industry is enabling. The new world requires agility, fast response to changes and ability to take educated yet automated decisions.

The world wants Real Time!

Over the past 6 years, I’ve been deep diving into the analytics and big data world, including - web analytics, advertising, business intelligence and machine learning algorithms. During this period I’ve been witnessing to different perspectives about Real Time. I’ve learned that when people talk about big data and real time, they usually refer to high freshness of data or ad hoc queries (or both).

High freshness

High freshness of data refers to the efforts of lowering the latency between the time the event occur till it’s available for reporting.

On the one hand, Facebook is publishing its architecture for the(super cool) Real Time insights product they launched few month ago. On the other hand Yahoo’s tech leader is complaining about the difficulties to develop what they call the “next-click” – effecting the experience of the visitor on the page right after the current click. Seems like that even the big guys are struggling with the technology. Nati’s post nicely explains the difficulties and proposes an alternative approach.

My concern is different - at the end of the road, Facebook implementation is based on counters in HBase, aggregate metrics per (like) URL. This fairly simple approach is easy to implement but holds some compromise on the product itself – it’s fixed, it’s not drill-downable and it takes time to process.

Aggregations are for wussies!


What if you could have a real-time analytics solution running on top of raw data ?

Ad hoc queries

The world of data warehousing has gone through last year the most drastic changes over the past 30 years. While traditional databases (Oracle, Microsoft, MySQL) were all about scaling up a single server, the new technologies (Greenplum, Netezza, Asterdata, Vertica and others) are all about what I call - Linearness.

Are you Linear ?

I claim that big data companies should drive themselves to be imagelinear – linear in cost of the hardware, linear in performance of the queries and linear in accuracy of the response. Yes, accuracy. who cares if last month visits were 5483238 or 5483361 – sampling is in many cases the key for success.

Using this concept – Facebook could have developed it’s insights feature and allow cool drill downs and flexibility.

To achieve linearness – three conditions must apply:

  • Shared nothing - each node is independent and self-sufficient
  • Massive Parallel processing - many CPUs working in parallel to execute a single program
  • Columnar orientation - stores content by column rather than by row


Real time analytics is hot – advertising, personalization, stock trading, shift management and many other scenarios. Don’t wait for an invitation – hop in asap or step back.

Think Linearness!

Enhanced by Zemanta