Google Percolator: MapReduce Demise?

Here is my early thoughts after quickly looking into  Google Percolator and skimming the paper .

Major take-away: massive transactional mutating of tens-petabyte-scale dataset on thousands-node cluster is possible!

MapReduce is still useful for distributed sorts of big-data and few other things, nevertheless it’s “karma” has suffered a blow. Beforehand you could end any MapReduce dispute by saying “well… it works for Google”, however, nowadays before you say it you would hear “well…. it didn’t work for Google”. MapReduce is particularly criticized by having 1) too long latency, 2)too wasteful, requiring full rework of the whole tens-of-PB-scale dataset even if only a fraction of it had been changed and 3) inability to support kinda-real-time data processing (meaning processing documents as they are crawled and updating index appropriately). In short: welcome to disillusionment stage of MapReduce saga. And luckily Hadoop is not only MapReduce, I’m convinced Hadoop will thrive and flourish beyond MapReduce and MapReduce being an important big data tool will be widely used where it really makes sense rather than misused or abused in various ways. Aster Data and remaining MPP startups can relax on the issue a bit.

Probably a topic for another post, but I think MapReduce is best leveraged as ETL tool.

See also for another view on the issue. There are few others posts already published  on Precolator but I haven’t yet looked into them.

I’m very happy about my SVLC-hypothesis, I think I knew it for a long time, but somehow only now, after I have put it on paper, I felt that the reasoning about different analytics approaches became easier. It is like having a map instead of visualizing it. So where is Percolator in the context of SVLC? If it is still considered analytics, Percolator is an SVC system – giving up latency for everything else, albeit to a lot lesser degree than its successor MapReduce. That said Percolator has a sizable part that is not analytics anymore but rather transaction processing. And transaction processing  is not usefully modeled by my SVLC-hypothesis. In summary: Percolator is essentially the trade-off as MapReduce – sacrificing latency for volume-cost-sophistication but more temperate, more rounded,  less-radical.

Unfortunately I haven’t enough time to enjoy the paper as it should be enjoyed with easy weekend-style reading. So inaccuracies may have been infiltrated in:

  • Percolator is big-data ACID-compliant transaction-processing non-relational DBMS.
  • Percolator fits most NoSQL definitions and therefore it is a NoSQL.
  • Percolator  continuously mutates dataset (called data corpus in the paper) with full transactional semantics and in the sizes of tens of petabytes on thousands of nodes.
  • Percolator uses a message-queue style approach for processing crawled data. Meaning, it processes the crawled pages continuously as they arrive updating the index database transactionaly.
  • BEFORE Percolator: Indexing was done in stages taking weeks. All crawled data was accumulated and staged first, then pass-after-pass transformed into index. 100-passes were quoted in the paper as I remember. When the cycle was completed a new one was initiated. Few weeks latency after content was published and before it appeared in Google search results were considered too long in twitter age, so Google implemented some shortcuts allowing preliminary results to show in search before the cycle is completed.
  • Percolator  doesn’t have declarative query language.
  • No joins.
  • Self-stated ~3% single node efficiency relative to the state-of-the-art DBMS system on single node. That’s the price for handling (which is transactional mutating) high-volume dataset… and relatively cheaply. Kudos for Google to being so open on this and not exercising in term-obfuscation. On the other hand, they can afford it… they don’t have to sell it tomorrow on rather competitive NoSQL market ;)
  • Thread-per-transaction model. Heavily threaded many-core servers as I understand.

Architecturally reminds me MoM (Message Oriented Middleware) with transactional queues and guarantied delivery.

Definitely to be continued…

other Percolator blog posts:

3 thoughts on “Google Percolator: MapReduce Demise?”

  1. I would like to point out, that Percolator is not a standalone technology, but layer on top of the BigTable, which ineed is a match for the MapReduce in terms of scalabiilty. Waht it dit – it adds ACID compiant transactions to the BigTable. I do think that Percolator together with the Dremel indeed prove that MapReduce is not the only way to process Big Data.

  2. Do you mean that percolator as a technology is not universal and cannot be useful on top of other storage-engines (particularly document-oriented and KV)?

  3. I mean that Percolator is not technology by itself. I see most of the technology in the BigTable, and underlaying GFS. Percolator is very interesting case of implementing ACID transactions over DBMS with eventual consistency.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>