Futility of “tooling” a proprietary cloud.

I’v been pitched by a lot of entrepreneurs trying to make a better-than-original “tooling” for a proprietary cloud, particularly for AWS. Ain’t the attempt futile from the beginning? Amazon is smart, innovative and working hard to make its cloud offering comprehensive and has much larger arsenal to overdo anyone who dare to compete on their own turf. That is their party, the invitation cannot be taken for granted.

Let’s take NoSQL data-stores and DBMS vendors as examples. There are VC-backed companies out-there which are exclusively focused on outdoing Amazon with running MySQL/NoSQL on their own cloud, Xeround comes to mind, but many others also hoping their product will catch fire on EC2.

Well, if just single branding and plain convenience is not enough , how about these two exclusive and “unfair” competative advatnages in Amazon arselnal:

  • [My unverified assumption is] that DynamoDB has storage integrated into its fabric whenever all the rest must use slower EBS.
  • Not it is just integrated, but as announced by Amazon, it uses SSD-backed storage. SSD-backed storage is not available, as of today, for DynamoDB competitors running on AWS. So competitors must continue to use ordinary EBS. That is in fact a double-kick, first the mere fact of using different hardware for a competitive advantage and second, the announcment itself as catalyst to trigger migration.

So, future EMR, may also have an integrated storage as well as other hardware optimization, making Hadoop more efficient on AWS and good if so.  Same goes to RDS and other current and future PaaS-related services.

Do I accuse Amazon on wrongdoing? Of course not! They brought the cloud to the main-street while others were only talking about it they and made large-scale computing affordable to all and continue dropping prices passing their economies-of-scale savings on customers and also keeps optimizing and enhancing their infrastructure constantly, and were good to their shareholders also. However, as any proprietary and monopolistic platform,  they do hinder some outside-of-Amazon innovation. No matter how good they are, we don’t want only one company in the world doing cloud-infrastructure stuff for the rest. That’s why, OpenStack are so extremely important for the industry. If OpenStack will be widely adopted then infrastructural and “tooling” kind of innovation could go directly into OpenStack for greater good and fairer monetization model for the author.









OpenDremel update and Dremel vs. Tenzing

I wasn’t blogged for whole 2011 year… I’m not dead, quite on contrary, we were pretty active with OpenDremel project in 2011. First, we are renaming it to Dazo to avoid using a trademarked name and second, we did a good job implementing a secure generic execution engine and integrating it into OpenStack Swift. It also came out, that the engine is actually quite useful virtualization technology in itself and it could potentially deserve a better fate than being buried as OpenDremel subcomponent. So, we do plan to release it as independent project and are quite busy with that now, so the work on OpenDremel is all but stalled unfortunately. As for storage infrastructure we settled with OpenStack Swift, we falled in love with Swift from the day it was released and now after we have integrated ZeroVM into it we even like it even more. So right now, we have fully salable storage backend with the unique capability to run securely any arbitrary native code inside, close to data. Now, what’s left is to take our old Metaxa Query Compiler and integrate it with that backend and then after many iterations it would bake into something pretty similar to Google Dremel and BigQuery. Even better, it will always process data locally (not sure BigQuery does it now) and it will not be limited to BQL on nested records, but for any query on any data and with full multi-tenant semantics. That’s how interesting 2011 was…

It was a preamble now back to the main feature:

Google released a paper on Tenzing last year on VLDB. Tenzing is an SQL query-system implemented on top of MapReduce infrastructure and it can be thought as Google-way to do Hive and as always full of juicy details. There is already a quality post on this published and another one here. On top of that my additional takeways are:

1. It is possible to build MPP-grade system on top of MapReduce with relatively low-latency (10 seconds). However, it would requires quite a number of patches to MapReduce. Hive and Hadoop has certainly a lot to learn from Tenzing.

2. Even with Google version of a patched and leaner-than-Hadoop implementation of MapReduce getting it to Dremel latencies was not achievable. On other hand 10 seconds as minimal latency is not that bad and in same ball park as Netezza/Greenplum/Aster and other MPP gear.

3. As general Sawzall vs. Dremel vs. Tenzing comparison there is an nice youtube-datawarehousing presentation published. In fact, Dremel beats both of them on latency and if not only for limited expressive power of its query language it would end up as complete winner on all metrics considered there. Sawzall having imperative query language scores highest on the power metric. I guess when OpenDremel will be released it will be a unique combination of low-latency querying with the full expressive power of imperatively-augmented SQL.

4. Tenzing can query MySQL databases as many other popular data formats. What we witnessing here is that query-engines is being decoupled from storage engines. 10 years ago it was only the case for MySQL ecosystem and anyone who tried Oracle external table interface knows how friendly past DBMSes were to external data sources.  Dremel columnar encoding component was released internally in Google as separate ColumnIO storage engine. Then Google open-sourced their key-value LevelDB engine a-la Hadoop’s RCFiles. So we can learn here of emergence of multiple storage-engines working with multiple query engines, quite interesting phenomenon.

5. The query is compiled into native code (with LLVM) and this gave significant acceleration by factor from six to twelve. This means that SQL to native code compilation is a must for high-performance BigData query engines.