Apache Drill Design Meeting

MapR folks invited me to participate in Apache Drill design meeting. Meetup site indicates that 60 people have been participated which sounds about right.

Tomer Shiran started the meeting with the overview of Apache Drill project. Then I (Camuel here) presented our team view for Apache Drill architecture. Jason Frantz of MapR continued touching technical aspects in follow on discussion. After a pizza break, Julian Hyde presented his view on logical/physical query plan separation and suggested using optiq framework for DrQL optimizer.

Overall my take away are as follows:

  1. There is very healthy interest in interactive querying for BigData.
  2. There were not even a single voice calling on making up vanilla Hadoop for this task.
  3. There is a general consensus on plurality of query languages and plurality of data formats.
  4. There is a general consensus that user always should be given freedom to supply manually authored physical query plan for execution, bypassing optimizer altogether and as opposed to hardcore hinting.
  5. Except me no one tried to challenge “common logical query model” concept. Since there are no real joins in Dremel and no indexes and only one data source with exactly one possibility – a single full table scan, I cannot see the justification for the complexity of optimizers and the logical query model. Dremel is an antidote concept to all this.

Thank you – MapR, for the Drill initiative, the great design meeting and the invitation.

Apache Drill Progress

We are continuing our efforts in contributing our OpenDremel code to Apache Drill project and look forward to be active with it right after that.

Right now the efforts are being put into our ANTLR-based parser, we want to make it work with the new grammar of BigQuery language. That should be done within a few days, the parser will be committed to the new Drill repository as a first phase of the OpenDremel-Drill merge.

Next on, we plan to refactor and contribute the Semantic Analyzer, which processes the output of the parser into an intermediate form, resolving references and rewriting (flattening) the query into single full table scan operation. That is expected within a week or two, it would depend when the Drill architecture doc will be published. We still don’t know what will be the schema language/format. Will it be Protobuf? Avro? OpenDremel supports Avro right now and has an initial support for Protobuf.

The final phase of OpenDremel – Drill merge, will be the contribution of the code generator based on the Apache Velocity templates. We have two sets of templates for now: one is a Java-based and executed with Janino executor and second one uses C/asm and executed with ZeroVM executor.

Everyone who wishes to help is welcome. The OpenDremel code resides in its usual Google code repo – http://code.google.com/p/dremel/. BE SURE TO LOCATE AND USE REPO COMBO BOX on the upper part of the page.

We probably will use https://github.com/ApacheDrill repo as a staging area or the Apache git repo directly, it all depends on what will be proposed by Ted Dunning – the Apache Drill Champion.

We also continue work on our generic execution backend built on top of OpenStack Swift and integrated with ZeroVM. We are contributing to both projects here.

We look ahead to Apache Drill with pluggable frontends and pluggable backends. So it would be able to run on top of a toy single-JVM Janino backend, or under YARN management on HDFS with Janino or ZeroVM backend, or even on a Zwift backend (that’s how we codenamed OpenStack Swift + ZeroVM combo).

On other hand the frontends will be pluggable too, so, in the future, support for new languages such as Apache Pig or Apache Hive can be added easily. Another option would be to create single frontend with pluggable language handlers, that would allow us to embed functionality from other projects such as Apache Mahout or R.

Apache Drill

We are not longer alone implementing Google Dremel and BigQuery technology. A proposal was made recently to Apache Foundation suggesting similar project. Moreover Ted Dunning kindly invited us to take part in the project.

The project is just starting now and there is no source code yet and not even a consensus design. So we sat together today evening and wrote a proposed design for Apache Drill. We already working for about two years on Dremel and BigQuery implementation. It was a fascinating journey and we have learned quite a lot and would be more than happy to share our experiences and accumulated knowledge.

All our code (OpenDremel/Dazo/ZeroVM) has Apache License from the beginning and used several Apache technologies from Avro to Velocity. Apache seems to be best home for Drill project and we are looking forward to contribute to it.

OpenDremel update and Dremel vs. Tenzing

I wasn’t blogged for whole 2011 year… I’m not dead, quite on contrary, we were pretty active with OpenDremel project in 2011. First, we are renaming it to Dazo to avoid using a trademarked name and second, we did a good job implementing a secure generic execution engine and integrating it into OpenStack Swift. It also came out, that the engine is actually quite useful virtualization technology in itself and it could potentially deserve a better fate than being buried as OpenDremel subcomponent. So, we do plan to release it as independent project and are quite busy with that now, so the work on OpenDremel is all but stalled unfortunately. As for storage infrastructure we settled with OpenStack Swift, we falled in love with Swift from the day it was released and now after we have integrated ZeroVM into it we even like it even more. So right now, we have fully salable storage backend with the unique capability to run securely any arbitrary native code inside, close to data. Now, what’s left is to take our old Metaxa Query Compiler and integrate it with that backend and then after many iterations it would bake into something pretty similar to Google Dremel and BigQuery. Even better, it will always process data locally (not sure BigQuery does it now) and it will not be limited to BQL on nested records, but for any query on any data and with full multi-tenant semantics. That’s how interesting 2011 was…

It was a preamble now back to the main feature:

Google released a paper on Tenzing last year on VLDB. Tenzing is an SQL query-system implemented on top of MapReduce infrastructure and it can be thought as Google-way to do Hive and as always full of juicy details. There is already a quality post on this published and another one here. On top of that my additional takeways are:

1. It is possible to build MPP-grade system on top of MapReduce with relatively low-latency (10 seconds). However, it would requires quite a number of patches to MapReduce. Hive and Hadoop has certainly a lot to learn from Tenzing.

2. Even with Google version of a patched and leaner-than-Hadoop implementation of MapReduce getting it to Dremel latencies was not achievable. On other hand 10 seconds as minimal latency is not that bad and in same ball park as Netezza/Greenplum/Aster and other MPP gear.

3. As general Sawzall vs. Dremel vs. Tenzing comparison there is an nice youtube-datawarehousing presentation published. In fact, Dremel beats both of them on latency and if not only for limited expressive power of its query language it would end up as complete winner on all metrics considered there. Sawzall having imperative query language scores highest on the power metric. I guess when OpenDremel will be released it will be a unique combination of low-latency querying with the full expressive power of imperatively-augmented SQL.

4. Tenzing can query MySQL databases as many other popular data formats. What we witnessing here is that query-engines is being decoupled from storage engines. 10 years ago it was only the case for MySQL ecosystem and anyone who tried Oracle external table interface knows how friendly past DBMSes were to external data sources.  Dremel columnar encoding component was released internally in Google as separate ColumnIO storage engine. Then Google open-sourced their key-value LevelDB engine a-la Hadoop’s RCFiles. So we can learn here of emergence of multiple storage-engines working with multiple query engines, quite interesting phenomenon.

5. The query is compiled into native code (with LLVM) and this gave significant acceleration by factor from six to twelve. This means that SQL to native code compilation is a must for high-performance BigData query engines.