Google Percolator: MapReduce Demise?

Here is my early thoughts after quickly looking into  Google Percolator and skimming the paper .

Major take-away: massive transactional mutating of tens-petabyte-scale dataset on thousands-node cluster is possible!

MapReduce is still useful for distributed sorts of big-data and few other things, nevertheless it’s “karma” has suffered a blow. Beforehand you could end any MapReduce dispute by saying “well… it works for Google”, however, nowadays before you say it you would hear “well…. it didn’t work for Google”. MapReduce is particularly criticized by having 1) too long latency, 2)too wasteful, requiring full rework of the whole tens-of-PB-scale dataset even if only a fraction of it had been changed and 3) inability to support kinda-real-time data processing (meaning processing documents as they are crawled and updating index appropriately). In short: welcome to disillusionment stage of MapReduce saga. And luckily Hadoop is not only MapReduce, I’m convinced Hadoop will thrive and flourish beyond MapReduce and MapReduce being an important big data tool will be widely used where it really makes sense rather than misused or abused in various ways. Aster Data and remaining MPP startups can relax on the issue a bit.

Probably a topic for another post, but I think MapReduce is best leveraged as ETL tool.

See also for another view on the issue. There are few others posts already published  on Precolator but I haven’t yet looked into them.

I’m very happy about my SVLC-hypothesis, I think I knew it for a long time, but somehow only now, after I have put it on paper, I felt that the reasoning about different analytics approaches became easier. It is like having a map instead of visualizing it. So where is Percolator in the context of SVLC? If it is still considered analytics, Percolator is an SVC system – giving up latency for everything else, albeit to a lot lesser degree than its successor MapReduce. That said Percolator has a sizable part that is not analytics anymore but rather transaction processing. And transaction processing  is not usefully modeled by my SVLC-hypothesis. In summary: Percolator is essentially the trade-off as MapReduce – sacrificing latency for volume-cost-sophistication but more temperate, more rounded,  less-radical.

Unfortunately I haven’t enough time to enjoy the paper as it should be enjoyed with easy weekend-style reading. So inaccuracies may have been infiltrated in:

  • Percolator is big-data ACID-compliant transaction-processing non-relational DBMS.
  • Percolator fits most NoSQL definitions and therefore it is a NoSQL.
  • Percolator  continuously mutates dataset (called data corpus in the paper) with full transactional semantics and in the sizes of tens of petabytes on thousands of nodes.
  • Percolator uses a message-queue style approach for processing crawled data. Meaning, it processes the crawled pages continuously as they arrive updating the index database transactionaly.
  • BEFORE Percolator: Indexing was done in stages taking weeks. All crawled data was accumulated and staged first, then pass-after-pass transformed into index. 100-passes were quoted in the paper as I remember. When the cycle was completed a new one was initiated. Few weeks latency after content was published and before it appeared in Google search results were considered too long in twitter age, so Google implemented some shortcuts allowing preliminary results to show in search before the cycle is completed.
  • Percolator  doesn’t have declarative query language.
  • No joins.
  • Self-stated ~3% single node efficiency relative to the state-of-the-art DBMS system on single node. That’s the price for handling (which is transactional mutating) high-volume dataset… and relatively cheaply. Kudos for Google to being so open on this and not exercising in term-obfuscation. On the other hand, they can afford it… they don’t have to sell it tomorrow on rather competitive NoSQL market ;)
  • Thread-per-transaction model. Heavily threaded many-core servers as I understand.

Architecturally reminds me MoM (Message Oriented Middleware) with transactional queues and guarantied delivery.

Definitely to be continued…

other Percolator blog posts:

Analytics Patterns

Unsatisfied by my previous post‘s Advanced Analytics definition and giving it a thought of what is advanced methods in analytics I realized that analytics industry miss a good analytics pattern catalog. A list of common problems followed by a list of common industry-consensus solutions to them. An equivalent of GoF design patterns to analytics. The list, where each list item starts from brief description of common recurring analytics problem and follows by elaboration by commonly accepted solutions to this problem followed by mandatory example section illustrating the solution using widely available tools.

Software engineers stolen this idea from the real architects (those dealing with a concrete structures not an abstract ones ;) ) 15 years ago.  They haven’t avoided initial short period of mass obsession and abuse of the concept… who does?  But eventually it worked out quite well for them us. I wonder if analytics industry could leverage these experience and create a catalog of some 25-50 most common patterns. Pattern descriptions in a catalog not to exceed few pages and number of patterns limited to few tens, making it wide industry adoption feasible.

What you think? Any ideas? I’ll try to make a first step by dumping patterns from my head right now (it is by no way a finished work):

I’ll call it analytics patterns:

1. Predictive Analytics. That was the easiest for me. I was involved into it for the first time some 12 years ago and developing what is now The system was used mostly to forecast sales taking into account an array of causal factors like seasonality, marketing campaigns, historical growth rates and etc. The problem is that there is a lot of time-based historic data available and it is required to forecast future values in the context of given historic data. The basic mechanism of implementing Predictive Analytics is to find or less preferably to develop a suitable mathematical model that can model closely (but be  cautious about overfitting) existing data, usually a time-series data and then use the model to induce forecasted values. In simple terms it is a case of extrapolation. Correct me if I’m wrong. As it was the case in 90-ties I’m pretty sure it is the case now, that exotic hardcore AI approaches like neural networks & genetic programming are best kept exclusively for moonlighting experiments and as material for cooler conversation the next morning. With deadlines defined and limited budget it is best to stick to proven techniques to achieve quick wins. I think the value of working forecasting is self evident.

2. Clustering. Well not the heavy noisy one in a cold hall :) but the statistics sub-discipline called better cluster-analysis. The problem here is that a lot of high-dimensionality data is available and it is required to discover groups with similar observations in other words automatically classify them. It is implemented by searching for correlations grouping the records according to the discovered correlations. What it is good for? Well in simple terms it helps to discriminate different kinds of objects and observe the specific properties of each kind. Without such grouping, one would be able only to observe properties that all objects exhibit or alternatively go object by object and observe it in isolation.

3. Risk Analysis - particularly through Monte-Carlo simulation. It is not called Monte-Carlo because it is invented there :) it called so because of reliance on random numbers akin Monte-Carlo casinos. Random numbers are proved most effective way to simulate mathematical model with large number of free-variables. With advent of computers it became a whole lot easier than using the book.

4. Given telecom event stream, run events through the rules engine to detect and prevent telecom fraud in real-time. This is essentially CEP engine and usually implemented by creating a state-machine per rule and running the events through it. Special version of stream sql is used. Similar scheme can be used for real-time click fraud prevention.
5. Given serialized object data or nested data allow running ad-hoc interactive queries over it in BigQuery fashion.
6. Given normalized relational model, allow running any ad-hoc queries. For common joins create a materialized view to speed up joins.
7. Canned reports. I guess they are good also for some cases…….
8. OLAP/Star schema when to use? ……

What else?

Of course it is just a first step and to do it correctly it will be a project in itself, in form of a book most probably. However, as one Chinese proverb  goes “A journey of a thousand miles begins with a single step”.