Apache Hadoop over OpenStack Swift

This is a post by Constantine Peresypkin and David Gruzman.
Lately we were working on integrating Hadoop with OpenStack Swift. Hadoop doesn’t need an introduction neither does OpenStack. Swift is an object-storage system and the technology behind RackSpace cloud-files (and quite a few others like Korea Telecom object storage, Internap and etc…)
Before we go into details of Hadoop-Swift integration let’s get some relevant background:
  1. Hadoop already have integration with Amazon S3 and is widely used to crunch S3-stored data. http://wiki.apache.org/hadoop/AmazonS3
  2. NameNode is a known SPOF in Hadoop. If it can be avoided it would be nice.
  3. Current S3 integration stages all data as temporary files on local disk to S3. That’s because S3 needs to know content length in advance it is one of the required headers.
  4. Current S3 also suffers form 5GB max file limitation which is slightly annoying.
  5. Hadoop requires seek support which means that HTTP range support is required if it is run over an object-storage . S3 supports it.
  6. Append file support is optional for Hadoop, but it’s required for HBase. S3 doesn’t have any append support thus native integration can not use HBase over S3.
  7. While OpenStack Swift is compatible with S3, RackSpace CloudFiles is not. It is because RackSpace CloudFiles disables S3-compatibility layer in Swift. This prevents existing Swift users from integration with Hadoop.
  8. The only information that is available on Internet on Hadoop-Swift integration is that with using Apache Whirr! it should work. But for best of our knowledge it is relevant only to rolling out Block FileSystem on top of Swift not a Native FileSystem. In other words we haven’t found any solution on how to process data that is already stored in RackSpace CloudFiles without costly re-importing.
So instrumented with above information let’s examine what we got here:
  1. In general we instrumented Hadoop to run over Swift naively, without resorting to S3 compatibility layer.  This means it works with CloudFiles which misses the S3-compatibility layer.
  2. CloudFiles client SDK doesn’t have support for HTTP range functionality. Hacked it to allow using HTTP range, this is a must for Hadoop to work.
  3. Removed the need for NameNode in similar way it is removed with S3 integration for Amazon.
  4. As opposed to S3 implementation we avoided staging files on local disk to and from CloudFiles/Swift. In other words data directly streamed to/from compute node RAM into CloudFiles/Swift.
  5. Though the data is still processed remotely. Extensive data shipping takes place between compute nodes and CloudFiles/Swift. As frequent readers of this blog know we are working on technology that will allow to run code snippets directly in Swift. Look here for more details: http://www.zerovm.com. As next step we plan to perform predicate-pushdown optimization to process most of data completely locally inside ZeroVM-enabled object-storage system.
  6. Support for native Swift large objects is planned also (something that’s absent in Amazon S3)
  7. We also working on append support for Swift (this could be easily done through Swift large object support which uses versioning) so even HBase will work on top of Swift, and this is not the case with S3 now.
  8. As it is the case with Hadoop S3, storing BigData in native format on Swift provides options for multi-site replication and CDN

Futility of “tooling” a proprietary cloud.

I’v been pitched by a lot of entrepreneurs trying to make a better-than-original “tooling” for a proprietary cloud, particularly for AWS. Ain’t the attempt futile from the beginning? Amazon is smart, innovative and working hard to make its cloud offering comprehensive and has much larger arsenal to overdo anyone who dare to compete on their own turf. That is their party, the invitation cannot be taken for granted.

Let’s take NoSQL data-stores and DBMS vendors as examples. There are VC-backed companies out-there which are exclusively focused on outdoing Amazon with running MySQL/NoSQL on their own cloud, Xeround comes to mind, but many others also hoping their product will catch fire on EC2.

Well, if just single branding and plain convenience is not enough , how about these two exclusive and “unfair” competative advatnages in Amazon arselnal:

  • [My unverified assumption is] that DynamoDB has storage integrated into its fabric whenever all the rest must use slower EBS.
  • Not it is just integrated, but as announced by Amazon, it uses SSD-backed storage. SSD-backed storage is not available, as of today, for DynamoDB competitors running on AWS. So competitors must continue to use ordinary EBS. That is in fact a double-kick, first the mere fact of using different hardware for a competitive advantage and second, the announcment itself as catalyst to trigger migration.

So, future EMR, may also have an integrated storage as well as other hardware optimization, making Hadoop more efficient on AWS and good if so.  Same goes to RDS and other current and future PaaS-related services.

Do I accuse Amazon on wrongdoing? Of course not! They brought the cloud to the main-street while others were only talking about it they and made large-scale computing affordable to all and continue dropping prices passing their economies-of-scale savings on customers and also keeps optimizing and enhancing their infrastructure constantly, and were good to their shareholders also. However, as any proprietary and monopolistic platform,  they do hinder some outside-of-Amazon innovation. No matter how good they are, we don’t want only one company in the world doing cloud-infrastructure stuff for the rest. That’s why, OpenStack are so extremely important for the industry. If OpenStack will be widely adopted then infrastructural and “tooling” kind of innovation could go directly into OpenStack for greater good and fairer monetization model for the author.









OpenDremel update and Dremel vs. Tenzing

I wasn’t blogged for whole 2011 year… I’m not dead, quite on contrary, we were pretty active with OpenDremel project in 2011. First, we are renaming it to Dazo to avoid using a trademarked name and second, we did a good job implementing a secure generic execution engine and integrating it into OpenStack Swift. It also came out, that the engine is actually quite useful virtualization technology in itself and it could potentially deserve a better fate than being buried as OpenDremel subcomponent. So, we do plan to release it as independent project and are quite busy with that now, so the work on OpenDremel is all but stalled unfortunately. As for storage infrastructure we settled with OpenStack Swift, we falled in love with Swift from the day it was released and now after we have integrated ZeroVM into it we even like it even more. So right now, we have fully salable storage backend with the unique capability to run securely any arbitrary native code inside, close to data. Now, what’s left is to take our old Metaxa Query Compiler and integrate it with that backend and then after many iterations it would bake into something pretty similar to Google Dremel and BigQuery. Even better, it will always process data locally (not sure BigQuery does it now) and it will not be limited to BQL on nested records, but for any query on any data and with full multi-tenant semantics. That’s how interesting 2011 was…

It was a preamble now back to the main feature:

Google released a paper on Tenzing last year on VLDB. Tenzing is an SQL query-system implemented on top of MapReduce infrastructure and it can be thought as Google-way to do Hive and as always full of juicy details. There is already a quality post on this published and another one here. On top of that my additional takeways are:

1. It is possible to build MPP-grade system on top of MapReduce with relatively low-latency (10 seconds). However, it would requires quite a number of patches to MapReduce. Hive and Hadoop has certainly a lot to learn from Tenzing.

2. Even with Google version of a patched and leaner-than-Hadoop implementation of MapReduce getting it to Dremel latencies was not achievable. On other hand 10 seconds as minimal latency is not that bad and in same ball park as Netezza/Greenplum/Aster and other MPP gear.

3. As general Sawzall vs. Dremel vs. Tenzing comparison there is an nice youtube-datawarehousing presentation published. In fact, Dremel beats both of them on latency and if not only for limited expressive power of its query language it would end up as complete winner on all metrics considered there. Sawzall having imperative query language scores highest on the power metric. I guess when OpenDremel will be released it will be a unique combination of low-latency querying with the full expressive power of imperatively-augmented SQL.

4. Tenzing can query MySQL databases as many other popular data formats. What we witnessing here is that query-engines is being decoupled from storage engines. 10 years ago it was only the case for MySQL ecosystem and anyone who tried Oracle external table interface knows how friendly past DBMSes were to external data sources.  Dremel columnar encoding component was released internally in Google as separate ColumnIO storage engine. Then Google open-sourced their key-value LevelDB engine a-la Hadoop’s RCFiles. So we can learn here of emergence of multiple storage-engines working with multiple query engines, quite interesting phenomenon.

5. The query is compiled into native code (with LLVM) and this gave significant acceleration by factor from six to twelve. This means that SQL to native code compilation is a must for high-performance BigData query engines.


Upcoming hardware renaissance era: part #2.

Some examples of upcoming hardware renaissance era:

1. Virtually all server vendors are pitching modularized data centers by now. MDC are boxes resembling shipping containers accommodating complete vritualized data-center inside. With MDC one just connects power, network and chilled water and gets access to the cloud in the box. Most MDC are good to be deployed outside and have built-in protection against weather elements. Of course all current offering are based on x86 commodity servers but here is a hint: once competition moves to comparing whose shipping container can stuff more storage and computing power inside and who has better price/performance and energy efficiency, we will see innovation in hardware skyrocketing.

2. On processor front…. ARM architecture has all ingredients to become next Intel. If I am not mistaken, ARM processors are outnumbering x86 10 to 1 with tens of billions of processors shipped. 95% of cellphones and advanced gadgets use ARM. ARM power efficiency puts x86 in shame. However, till now ARM was focused on gadgets and dismissed data-center market. Not anymore! With newer Cortex-A15 ARM took aim at x86 on datacenter territory. Calxeda already got ~$50M in venture money for commercializing ARM in datacenter. However, ARM is not alone here, Tilera with their server-vendor partner Quanta are already shipping 512 core server in 2U form-factor. Tilera took lean MIPS processor core and put some 100 of them into single die together with x8 10Gbit Ethernet channels and four DDR3 memory channels. Nvidia also claimed that they are not GPU-only vendor anymore and are readying general-purpose processors based on ARM architecture with ample amount of GPU muscle inside. That said Intel and AMD are also far from stagnating and moving into heterogeneous many-core designs. I think we never witnessed more innovation in processor space than now.

3. On memory front… Flash is making inroads to claim space in memory-hierarchy between DRAM and HDD. Disrupting DRAM market and high-performance 15K RPM HDD market. I think 15K RPM HDD and DRAM-based SSD products are already safe to be declared dead. Same about HDD smaller than 2.5 inch form-factor. I even think 2.5 HDD are also in risk. Only capacity-optimized HDDs would survive. Even without flash, the DRAM got such capacity that most datasets fits in RAM completely. And if not in RAM of single server than it surely fits in shared cluster RAM. This solid-memory advancements in DRAM and Flash disrupts storage market, especially making high-performance SAN redundant. The only storage tomorrow server will need is capacity-optimized and energy-optimized ones. That fact among other forced EMC to move into computing… and provide complete cloud in the box instead just storage in the box like it did in the past.

4. Networking… in my view networking is most stagnating hardware market here. Infiniband finally moves into mainstream and it is good. Does it? Or it will succumb to 10GbE? Remains to be seen. My bet is on Infiniband due to architectural superiority. Networking virtualization is still on whiteboards… unfortunately. So in networking there is no signs of renaissance but the potential is there.

Emerging Proprietary Hardware Renaissance

I cannot count number of times I heard that cloud computing means innovation stagnation in the proprietary hardware business and that with cloud computing, hardware doesn’t matter anymore and will succumb sooner or later into boring razor-thin-margins oligopolistic commodity industry.

Why folks think like that? Well… there is one reason that dominates their thinking – hardware products became components and worse of all they became a well standardized components. And as such, certain low-wage countries can quickly master how to assemble them in large quantities and win competition purely on cost and nothing else seriously matters in component business. No one carries about component extended enterprise feature set and premium brand and long list of ISV partnerships and etc… what matters is very well defined functionality and price, price and price. In fact it already happened to low-end gear like entry-level basic servers and routers. The situation is very different for enterprise IT products. I estimate that if IT costs will be tripled overnight for most enterprises it would not matter in their bottom line. So IT departments of most enterprises are cost-insensitive regarding IT gear. They will not blindly overpay in most cases but cost is not high in their priority list either. IT for most enterprises is more or less fixed cost amortized among very large number of its products or services. I guess if Coca Cola IT was tripled the price of single can of coke should be elevated for less than a cent. Unlikely that it is life-threatening to their business. Therefore the game was and largely still is to market hardware products directly to enterprise IT departments competing on enterprise feature set rather than price. Now with emerging cloud computing paradigm hardware products are components and are marketed to cloud operators which are essentially server farmers. And they are extremely cost-sensitive and marketing to them premium computing gear will be as successful as marketing a premium booze as fuel to alcohol-fueled car owner. If for cloud operator server costs will be tripled overnight the next morning it would be out of business. So without a doubt it is a game over for fat margins in hardware manufacturing/assembly business. What happened to enterprise software 5 years ago is happening to enterprise hardware right now – commoditization.

So the common thinking goes, in the boring commodity business no one is going to invest in innovation so no one is going to invest much into proprietary hardware because no cloud vendor is going to buy premium hardware. The only hope is private cloud as a freshly invented loophole to continue to sell premium gear to enterprises. Well lets consider the following situation. XYZ startup manages to build a proprietary appliance which is essentially a cloud-in-the-box solution that through tight internal integration and optimization for one particular task achieves an order-of-magnitude better price-performance.Lets say it is an KV-store appliance. Would cloud-operators be interested? I bet they will. From outside it doesn’t matter if the functionality are backed up by cassandra-on-generic hardware or custom-hardware-appliance. So the cloud provider quietly rolling out such appliances can compete well both on price and on functionality (like latency) with other cloud providers running software-based KV-store on generic hardware. Other startup ABC may produce a computing appliance that can run ruby-on-rail or java or pyton applications order-of-magnitude more efficiently than a generic hardware and a cloud provider deploying such appliances from ABC startup could compete better serving RoR clients. Yet another startup may build a custom rack-sized box filled with fermi chip specifically designed for video processing. I can bring more examples but the trend is obvious. Use large chunk of dedicated hardware to do one specific task and do it extremely efficiently and you can have nice margins as hardware manufacturer. Before cloud – hardware must be generic because single enterprise server should be able to run a variety of different workloads. With cloud computing it is not longer the case. A vendor can build a dedicated hardware appliance optimized to do one specific workload and serve the whole world with it raising high barriers for the competitors. So despite popular belief I think cloud computing presents unique opportunities to creative hardware engineers, tough not in premium enterprise-feature set area as it used to be but in extreme-efficiency, acceleration and specialization areas.

Two Envelopes Problem: Am I just dumb?

It seems the recent craze about statistician being a profession of choice in the future gains steam. In future where we will be surrounded by quality BigData, capable computers and bug-free open source software including OpenDremel. Well the last one I made up… but the rest seems to be the current situation. Acknowledging this I was checking what is the state of open-source statistics software and who are the guys behind it and etc.. But it is not my today topic, it is the topic of one of my next posts. Today I want to talk about one of the strangest problems/paradoxes on the internet I have ever seen. The story is, that I encountered right now “the two envelops problem” or “paradox” as some put it. Having worked with math guys for a long time (having math mastery that I’ll never reach in my life) I immediately recognized that problem as I was teased by it few times a long time a go. However, I was never told that this is such a big deal of a problem. Wikipedia lists it under “Unsolved problems in statistics”. Heh? And I never understood what is so paradoxical or hard or even interesting in it? For me it seemed high-school grade problem at most. So I put “two envelopes problem” into Google and found tens of blogs trying to explain it and propose over-engineered over-complicated and long solutions to such a simple problem. I have a very strange feeling that I’m either totally dumb or a genius and I know I’m not a genius ;) In some sources it is mentioned that only Bayesian subjectivists suffer from this, however in large majority of other sources it is presented as an universal problem… Well enough talking lets dive in into simplest solution on internet (or I will be embarrassed by someone pointing my mistake or similar solution published elsewhere).

The problem description for those who never heard about it:

You are approached by a guy that shows you two identical envelopes. Both envelopes contain money. You are allowed to pick and keep any one of them for yourself. After you pick one, the guy makes you an offer to swap envelopes. The question is if one should swap. For me it is as clear as sunny day that it doesn’t matter if you swap and it is easily provable by simple math. Somehow most folks (some Ph.D. level!) get into very hairy calculations that suggest that one should swap and than even more hairy ones why one should not. Some mention subjectivity but most don’t.

The simplest solution on internet (joking… but seriously I haven’t found such simple one):

Let’s denote the smaller sum in one of envelopes as X and therefore the larger sum in the other will be 2X. Then expected value of current envelope selection before swap consideration is 1.5X. How I got to it? Very simply we have 0.5 probability of holding the envelope with larger sum that is 2X and 0.5 probability with holding an envelope with smaller sum witch is X. So:

0.5*2X+0.5*X = X+0.5X=1.5X

So far so good… let’s now calculate the expected value if we swap. If we swap, we will have same 0.5 probability of holding larger sum and same 0.5 probability of holding the smaller sum. Needless to repeat the calculation, you will get exactly same 1.5X as an expected value, meaning that the swap doesn’t matter. Or if time has any value it doesn’t make sense to waste it by swapping envelopes.

Do you see it as hard problem? I bet 10-year old will do fine with it, especially if offered some reward.

How come others get lost here?

The answer is that some try to apply Bayesian subjectivism probability theory and then innocent folks follow it and gets lost as well.

If you look to Wikipedia article for example you will find a classic wrong solution that allegedly is “obvious” and then a link outside Wikipedia to a “correct” solution. The correct solution seems a long post with a lot of formulae and usage of Bayes theorem that at the end came to correct answer.

Well… I see clearly a flaw with the solution published in wikipedia. That solution really looks artificial, but according to the number of followers it should be obvious for many. The blunder is in the third line:

The other envelope may contain either 2A or A/2

By A they denoted the sum in the envelope they are holding. The mistake is in “either 2A or A/2″, it should be “either 2A or A”, Then everything will be ok and no “paradox” will emerge in the end. The mistake stems from the fallacy of using same name for two separate variables that are dependent but not equal! And then repeatedly confusing them since they have same name. Here is a “patch” to be applied to wikipedia published reasoning:

1. I denote by A the amount in my selected envelope. => FINE
2. The probability that A is the smaller amount is 1/2, and that it is the larger amount is also 1/2. => FINE
3. The other envelope may contain either 2A or A/2. =>INCORRECT variable A denotes different values so it is highly confusing to write it this way.

let’s explicitly consider two cases here instead of implicit “either..or…”

in first case let’s assume we are holding the smaller sum then the other envelope contains 2A
in the second case let’s assume we have holding the larger sum then the other envelope contains A/2. However, the A is different from A of first case so let’s write it as _A/2

Moreover we know that _A is not just different from A but is exactly twice the other so
_A = 2A
So the expression “either 2A or A/2″ must be written as “either 2A or _A/2″ or substituting _A=2A as “either 2A or A”.

Then for calculating expected value you also substitute A instead of A/2 and get same expected value than before swap.


That said, I saw many people feeling so “enlightened” by reading a complicated “correct” solution that they erroneously think and argue that one should not accept the following offer thinking it is equivalent to the above problem (well not exactly this but I rephrased it for clarity):

One guy comes to you and says there are three envelopes. You are allowed to pick one and keep it. One envelope is red and two are white. All three contain money. One of white envelopes contain twice as many as red one. Another white one contains half of red one. The white envelopes are identical and there is no way to know which one contains double and which one contains half. The question is which envelope you should choose: the red one or one of the white ones. And the answer is that you should pick one of white envelopes! In fact the calculation errorneously applied to the two-envelopes problem is 100% correct to the three-envelopes-problem. And on average you will win choosing one of white envelopes rather than a red one.

Debunking common misconceptions in SSD, particularly for analytics

1. SSD is NOT synonymous for flash memory.

First of all let’s settle on terms. SSD is best described as a concept of using semiconductor memory as disk. There is two common cases: DRAM-as-disk and flash-as-disk. And flash-memory is a semiconductor technology pretty similar to DRAM, just with slightly different set of trade-offs made.

Today there are little options to use flash memory in analytics beyond SSD. Nevertheless, it should not suggest that SSD is synonymous for flash memory. Flash memory can be used in products beyond SSD, and SSD can use non-flash technology, for example DRAM.

So the question is: do we have any option of using flash-memory in other form rather than flash-as-disk?

FusionIO is the only one and was always bold in claiming that their product are not-SSD but a totally new category product, called ioMemory. Usually I dismiss such claims automatically in subconscious as a common-practice of  term-obfuscation. However, in the case of FusionIO I found it to be a a rare exception and technically true. On hardware level there is no disk-related overhead in FusionIO solution and in my opinion FusionIO are closest to the flash-on-motherboard vision among all the rest of SSD manufacturers. That said, FusionIO succumbed to implementing a disk-oriented storage layer in software because unavailability of any other standards covering  flash-as-flash concept.

You can find a more in-depth coverage of New-Dynasty SSD versus Legacy SSD issue in recent article of Zsolt Kerekes on StorageSearch.com. Albeit I’m not 100% agree with his categorization.

2. SSD DOESN’T provide more throughput than HDD.

The bragging about performance density of SSD could safely be dismissed. There is no problem in stacking up HDDs together. As many as 48 of them can be put in single server 4U box providing aggregate throughput of 4GB/sec for fraction of SSD price. Same goes to power, vibration, noise and etc… The extent to which this properties are superior to disk is uninteresting and not justifying the associated premium in cost.

Further, for any amount of  money, HDD can provide significantly more IO throughput , than SSD of any of today vendor. On any workload: read, write or combined.  Not only this, but it will do so with an order of magnitude more capacity for your big data as additional bonus. However, a few nuances are to be considered:

  • If  data is accessed in random small chunks (let’s say 16KB chunks), then SSD will provide significantly more throughput (factor of x100 may be) than disk will do. Reading in chunks at least 1MB will put HDD as a winner in the throughput game again.
  • The flash memory itself, has great potential to provide an order of magnitude more throughput than disks. Mechanical “gramophone” technology of disks cannot compete in agility with the electrons. However, this potential throughput is hopelessly being left unexploited by the nowadays SSD controller. How bad it is? Pretty bad, SSD controllers pass on less than 10% of potential throughput. The reasons include: flash-management complexity, cost-constraints leading to small embedded DRAM buffers and computationally-weak controllers,  and the main reason being that there is no standards for 100 faster disk, neither legacy software could potentially keep with higher multi-gigabyte throughputs, so SSD vendors don’t bother and are obsessed with the laughable idea of bankrupting HDD manufacturers calling the technology disruptive which it is not by definition. So we have a much more expensive disk replacement that is only barely more performant, throughput-wise, than vanilla low-cost HDD array.

3. Array of SSDs DOESN’T provide larger number of useful IOPS than arrays of disks.

While it is true that one SSD can match disk array easily in IOPS, it should not suggest that array of SSD will provide larger number of useful IOPS. The reason is prosaic, array of disks provides an abundance of IOPS, many times more than enough for any analytic application. So any additional IOPS are not needed and astronomical number of IOPS in SSD arrays is a solution looking for a problem in analytics industry.

4. SSD are NOT disruptive to disks.

Well if it is true it is not according to Clayton Christiansen definition of “disruptiveness”.  As far as I remember Christiansen defines “disruptiveness” as technology A being disruptive to technology B when all following holds true:

  • A is worse than technology B in quality and features
  • A is cheaper than technology B
  • A is affordable to a large number of new users to whom technology B is appealing but too costly.

SSD-to-disk pair is clearly not true for any condition above so I’m puzzled how one can call it disruptive to disks?

Again. I’m not claiming that SSD or flash-memory is not disruptive to any technology I just claiming that SSD are not disruptive to HDD. In fact, I think flash-memory IS disruptive to DRAM. All three conditions above hold for flash-to-DRAM pair. Also a pair of directly attached SSDs are highly disruptive to SAN.


Make no mistake I’m a true believer in flash-memory as a game-changer for analytics just not in the form of disk replacement. I’ll explore in my upcoming posts the ideas where flash memory can make a change. I know I totally missed any quantification proofs for all the claims above but…. well…. let’s leave it for comment section.

Also one of best coverage of flash-memory for analytics (and not coming from a flash vendor) is of Curt Monash on DBMS2 blog: http://www.dbms2.com/2010/01/31/flash-pcmsolid-state-memory-disk/

Google Percolator: MapReduce Demise?

Here is my early thoughts after quickly looking into  Google Percolator and skimming the paper .

Major take-away: massive transactional mutating of tens-petabyte-scale dataset on thousands-node cluster is possible!

MapReduce is still useful for distributed sorts of big-data and few other things, nevertheless it’s “karma” has suffered a blow. Beforehand you could end any MapReduce dispute by saying “well… it works for Google”, however, nowadays before you say it you would hear “well…. it didn’t work for Google”. MapReduce is particularly criticized by having 1) too long latency, 2)too wasteful, requiring full rework of the whole tens-of-PB-scale dataset even if only a fraction of it had been changed and 3) inability to support kinda-real-time data processing (meaning processing documents as they are crawled and updating index appropriately). In short: welcome to disillusionment stage of MapReduce saga. And luckily Hadoop is not only MapReduce, I’m convinced Hadoop will thrive and flourish beyond MapReduce and MapReduce being an important big data tool will be widely used where it really makes sense rather than misused or abused in various ways. Aster Data and remaining MPP startups can relax on the issue a bit.

Probably a topic for another post, but I think MapReduce is best leveraged as ETL tool.

See also http://blog.tonybain.com/tony_bain/2010/09/was-stonebraker-right.html for another view on the issue. There are few others posts already published  on Precolator but I haven’t yet looked into them.

I’m very happy about my SVLC-hypothesis, I think I knew it for a long time, but somehow only now, after I have put it on paper, I felt that the reasoning about different analytics approaches became easier. It is like having a map instead of visualizing it. So where is Percolator in the context of SVLC? If it is still considered analytics, Percolator is an SVC system – giving up latency for everything else, albeit to a lot lesser degree than its successor MapReduce. That said Percolator has a sizable part that is not analytics anymore but rather transaction processing. And transaction processing  is not usefully modeled by my SVLC-hypothesis. In summary: Percolator is essentially the trade-off as MapReduce – sacrificing latency for volume-cost-sophistication but more temperate, more rounded,  less-radical.

Unfortunately I haven’t enough time to enjoy the paper as it should be enjoyed with easy weekend-style reading. So inaccuracies may have been infiltrated in:

  • Percolator is big-data ACID-compliant transaction-processing non-relational DBMS.
  • Percolator fits most NoSQL definitions and therefore it is a NoSQL.
  • Percolator  continuously mutates dataset (called data corpus in the paper) with full transactional semantics and in the sizes of tens of petabytes on thousands of nodes.
  • Percolator uses a message-queue style approach for processing crawled data. Meaning, it processes the crawled pages continuously as they arrive updating the index database transactionaly.
  • BEFORE Percolator: Indexing was done in stages taking weeks. All crawled data was accumulated and staged first, then pass-after-pass transformed into index. 100-passes were quoted in the paper as I remember. When the cycle was completed a new one was initiated. Few weeks latency after content was published and before it appeared in Google search results were considered too long in twitter age, so Google implemented some shortcuts allowing preliminary results to show in search before the cycle is completed.
  • Percolator  doesn’t have declarative query language.
  • No joins.
  • Self-stated ~3% single node efficiency relative to the state-of-the-art DBMS system on single node. That’s the price for handling (which is transactional mutating) high-volume dataset… and relatively cheaply. Kudos for Google to being so open on this and not exercising in term-obfuscation. On the other hand, they can afford it… they don’t have to sell it tomorrow on rather competitive NoSQL market ;)
  • Thread-per-transaction model. Heavily threaded many-core servers as I understand.

Architecturally reminds me MoM (Message Oriented Middleware) with transactional queues and guarantied delivery.

Definitely to be continued…

other Percolator blog posts: