OpenDremel update and Dremel vs. Tenzing

I wasn’t blogged for whole 2011 year… I’m not dead, quite on contrary, we were pretty active with OpenDremel project in 2011. First, we are renaming it to Dazo to avoid using a trademarked name and second, we did a good job implementing a secure generic execution engine and integrating it into OpenStack Swift. It also came out, that the engine is actually quite useful virtualization technology in itself and it could potentially deserve a better fate than being buried as OpenDremel subcomponent. So, we do plan to release it as independent project and are quite busy with that now, so the work on OpenDremel is all but stalled unfortunately. As for storage infrastructure we settled with OpenStack Swift, we falled in love with Swift from the day it was released and now after we have integrated ZeroVM into it we even like it even more. So right now, we have fully salable storage backend with the unique capability to run securely any arbitrary native code inside, close to data. Now, what’s left is to take our old Metaxa Query Compiler and integrate it with that backend and then after many iterations it would bake into something pretty similar to Google Dremel and BigQuery. Even better, it will always process data locally (not sure BigQuery does it now) and it will not be limited to BQL on nested records, but for any query on any data and with full multi-tenant semantics. That’s how interesting 2011 was…

It was a preamble now back to the main feature:

Google released a paper on Tenzing last year on VLDB. Tenzing is an SQL query-system implemented on top of MapReduce infrastructure and it can be thought as Google-way to do Hive and as always full of juicy details. There is already a quality post on this published and another one here. On top of that my additional takeways are:

1. It is possible to build MPP-grade system on top of MapReduce with relatively low-latency (10 seconds). However, it would requires quite a number of patches to MapReduce. Hive and Hadoop has certainly a lot to learn from Tenzing.

2. Even with Google version of a patched and leaner-than-Hadoop implementation of MapReduce getting it to Dremel latencies was not achievable. On other hand 10 seconds as minimal latency is not that bad and in same ball park as Netezza/Greenplum/Aster and other MPP gear.

3. As general Sawzall vs. Dremel vs. Tenzing comparison there is an nice youtube-datawarehousing presentation published. In fact, Dremel beats both of them on latency and if not only for limited expressive power of its query language it would end up as complete winner on all metrics considered there. Sawzall having imperative query language scores highest on the power metric. I guess when OpenDremel will be released it will be a unique combination of low-latency querying with the full expressive power of imperatively-augmented SQL.

4. Tenzing can query MySQL databases as many other popular data formats. What we witnessing here is that query-engines is being decoupled from storage engines. 10 years ago it was only the case for MySQL ecosystem and anyone who tried Oracle external table interface knows how friendly past DBMSes were to external data sources.  Dremel columnar encoding component was released internally in Google as separate ColumnIO storage engine. Then Google open-sourced their key-value LevelDB engine a-la Hadoop’s RCFiles. So we can learn here of emergence of multiple storage-engines working with multiple query engines, quite interesting phenomenon.

5. The query is compiled into native code (with LLVM) and this gave significant acceleration by factor from six to twelve. This means that SQL to native code compilation is a must for high-performance BigData query engines.