Apache Drill Design Meeting

MapR folks invited me to participate in Apache Drill design meeting. Meetup site indicates that 60 people have been participated which sounds about right.

Tomer Shiran started the meeting with the overview of Apache Drill project. Then I (Camuel here) presented our team view for Apache Drill architecture. Jason Frantz of MapR continued touching technical aspects in follow on discussion. After a pizza break, Julian Hyde presented his view on logical/physical query plan separation and suggested using optiq framework for DrQL optimizer.

Overall my take away are as follows:

  1. There is very healthy interest in interactive querying for BigData.
  2. There were not even a single voice calling on making up vanilla Hadoop for this task.
  3. There is a general consensus on plurality of query languages and plurality of data formats.
  4. There is a general consensus that user always should be given freedom to supply manually authored physical query plan for execution, bypassing optimizer altogether and as opposed to hardcore hinting.
  5. Except me no one tried to challenge “common logical query model” concept. Since there are no real joins in Dremel and no indexes and only one data source with exactly one possibility – a single full table scan, I cannot see the justification for the complexity of optimizers and the logical query model. Dremel is an antidote concept to all this.

Thank you – MapR, for the Drill initiative, the great design meeting and the invitation.

Hadoop on OpenStack Swift: experiments

Some time has passed since our initial post on Hadoop over OpenStack Swift implementation. A couple of things have changed (Rackspace finally implemented range requests in their Cloudfiles library) others remained the same (still no built-in support for Hadoop in OpenStack / CloudFiles).

We got a lot of feedback and questions regarding the integration but not always had the time or patience to properly address them, sorry for that. But one of our readers, Zheng Xu, did a great job by putting together a slide deck on the exact procedure.

But there are still some points I need to address regarding the procedure he assembled there. It mostly boils down to Cloudfiles: although current Cloudfiles implementation has HTTP range support, our implementation uses our own code for the latter, therefore I really encourage ether using our Cloudfiles distribution (with patches) or changing our Hadoop code to use the new Rackspace one. Although the simple filesystem tasks will work as expected, any MapReduce job that works with big files will fail without correct HTTP range support.

I want to thank Zheng Xu for the effort and congratulate him on the success of his small experiment.

Apache Drill Progress

We are continuing our efforts in contributing our OpenDremel code to Apache Drill project and look forward to be active with it right after that.

Right now the efforts are being put into our ANTLR-based parser, we want to make it work with the new grammar of BigQuery language. That should be done within a few days, the parser will be committed to the new Drill repository as a first phase of the OpenDremel-Drill merge.

Next on, we plan to refactor and contribute the Semantic Analyzer, which processes the output of the parser into an intermediate form, resolving references and rewriting (flattening) the query into single full table scan operation. That is expected within a week or two, it would depend when the Drill architecture doc will be published. We still don’t know what will be the schema language/format. Will it be Protobuf? Avro? OpenDremel supports Avro right now and has an initial support for Protobuf.

The final phase of OpenDremel – Drill merge, will be the contribution of the code generator based on the Apache Velocity templates. We have two sets of templates for now: one is a Java-based and executed with Janino executor and second one uses C/asm and executed with ZeroVM executor.

Everyone who wishes to help is welcome. The OpenDremel code resides in its usual Google code repo – http://code.google.com/p/dremel/. BE SURE TO LOCATE AND USE REPO COMBO BOX on the upper part of the page.

We probably will use https://github.com/ApacheDrill repo as a staging area or the Apache git repo directly, it all depends on what will be proposed by Ted Dunning – the Apache Drill Champion.

We also continue work on our generic execution backend built on top of OpenStack Swift and integrated with ZeroVM. We are contributing to both projects here.

We look ahead to Apache Drill with pluggable frontends and pluggable backends. So it would be able to run on top of a toy single-JVM Janino backend, or under YARN management on HDFS with Janino or ZeroVM backend, or even on a Zwift backend (that’s how we codenamed OpenStack Swift + ZeroVM combo).

On other hand the frontends will be pluggable too, so, in the future, support for new languages such as Apache Pig or Apache Hive can be added easily. Another option would be to create single frontend with pluggable language handlers, that would allow us to embed functionality from other projects such as Apache Mahout or R.