Multi-engine data processing

There is a lot of criticism of HDFS – it is slow, it has SPOF, it is read only, etc. All of the above is true. Systems built on top of a local file system are more efficient than those built on top of HDFS (like Cassandra vs. HBase). That is also true.
However, HDFS has brought us a new opportunity. We have a platform to share data among different database engines. Let me elaborate.
Today there is a number of database engines built on top of HDFS. Such as Hive, Pig, Impala, Presto, etc. Many Hadoop users develop their own partially generic map-reduce jobs which could be called specialized database engines.
Before we had such a common platform, we had to select database engine for our analytic and ETL needs. It was a tough choice and it was lame. Selected engine owns data by storing it in its own format. We were forced to write some custom data operations not in the language of our choice, but
in database’s internal language – like PL-SQL, Transact-SQL, etc… When our processing became less relational, we wrote a lot of logic in database internal language and had to pay database licence in order to run our own code…
The first spring bird of change was Hive’s external tables. At first glance, it was a minor feature – instead of letting Hive select the place and the format for the data, we got the chance to do this job for it. Conceptually, it was a big change – we are beginning to own our data. Now it can be produced by our own MR jobs, it can be processed in any way we like aside of Hive. And at the same time we can use Hive to query these data.
Then came Impala, and instead of defining its own metadata, it reuses the data from Hive and thus inherits those wonderful external tables. So we have two engines capable of processing the same data. As far as I remember, for the first time.
Why is it so important? Because we are no longer restricted by the given engine problems and bugs. Let’s say Impala is perfect at running 5 out of our 6 queries, but fails at the last one. What do we do? Right, we can run 5 queries using Impala and the last one using Hive, which is slower, but is a much less demanding creature than Impala.
Then came Spark with its real time capabilities, Presto from the Facebook, and, I am sure, many more are to come. How to select the right one?
So, in my opinion, we should stop selecting, but rather combine them. We should shift to single data – multiple engines paradigm. Thanks to the gods of the free open source software it does not involve licensing costs.
What makes it possible: common storage, open data formats and shared metadata. HDFS is a storage, data formats are CSV, Avro, Parquet, ORC, etc., and shared metadata is Hive metastore and HCatalog as its successor.
We do not have to do a revolution, but I think we should prefer engines running on top of common storage (like HDFS) data formats, which are understandable to many engines and database engines that do their best at data processing and give us freedom of data placement and formats.

Hadoop on OpenStack Swift: experiments

Some time has passed since our initial post on Hadoop over OpenStack Swift implementation. A couple of things have changed (Rackspace finally implemented range requests in their Cloudfiles library) others remained the same (still no built-in support for Hadoop in OpenStack / CloudFiles).

We got a lot of feedback and questions regarding the integration but not always had the time or patience to properly address them, sorry for that. But one of our readers, Zheng Xu, did a great job by putting together a slide deck on the exact procedure.

But there are still some points I need to address regarding the procedure he assembled there. It mostly boils down to Cloudfiles: although current Cloudfiles implementation has HTTP range support, our implementation uses our own code for the latter, therefore I really encourage ether using our Cloudfiles distribution (with patches) or changing our Hadoop code to use the new Rackspace one. Although the simple filesystem tasks will work as expected, any MapReduce job that works with big files will fail without correct HTTP range support.

I want to thank Zheng Xu for the effort and congratulate him on the success of his small experiment.

Apache Drill Progress

We are continuing our efforts in contributing our OpenDremel code to Apache Drill project and look forward to be active with it right after that.

Right now the efforts are being put into our ANTLR-based parser, we want to make it work with the new grammar of BigQuery language. That should be done within a few days, the parser will be committed to the new Drill repository as a first phase of the OpenDremel-Drill merge.

Next on, we plan to refactor and contribute the Semantic Analyzer, which processes the output of the parser into an intermediate form, resolving references and rewriting (flattening) the query into single full table scan operation. That is expected within a week or two, it would depend when the Drill architecture doc will be published. We still don’t know what will be the schema language/format. Will it be Protobuf? Avro? OpenDremel supports Avro right now and has an initial support for Protobuf.

The final phase of OpenDremel – Drill merge, will be the contribution of the code generator based on the Apache Velocity templates. We have two sets of templates for now: one is a Java-based and executed with Janino executor and second one uses C/asm and executed with ZeroVM executor.

Everyone who wishes to help is welcome. The OpenDremel code resides in its usual Google code repo – BE SURE TO LOCATE AND USE REPO COMBO BOX on the upper part of the page.

We probably will use repo as a staging area or the Apache git repo directly, it all depends on what will be proposed by Ted Dunning – the Apache Drill Champion.

We also continue work on our generic execution backend built on top of OpenStack Swift and integrated with ZeroVM. We are contributing to both projects here.

We look ahead to Apache Drill with pluggable frontends and pluggable backends. So it would be able to run on top of a toy single-JVM Janino backend, or under YARN management on HDFS with Janino or ZeroVM backend, or even on a Zwift backend (that’s how we codenamed OpenStack Swift + ZeroVM combo).

On other hand the frontends will be pluggable too, so, in the future, support for new languages such as Apache Pig or Apache Hive can be added easily. Another option would be to create single frontend with pluggable language handlers, that would allow us to embed functionality from other projects such as Apache Mahout or R.

OpenDremel update and Dremel vs. Tenzing

I wasn’t blogged for whole 2011 year… I’m not dead, quite on contrary, we were pretty active with OpenDremel project in 2011. First, we are renaming it to Dazo to avoid using a trademarked name and second, we did a good job implementing a secure generic execution engine and integrating it into OpenStack Swift. It also came out, that the engine is actually quite useful virtualization technology in itself and it could potentially deserve a better fate than being buried as OpenDremel subcomponent. So, we do plan to release it as independent project and are quite busy with that now, so the work on OpenDremel is all but stalled unfortunately. As for storage infrastructure we settled with OpenStack Swift, we falled in love with Swift from the day it was released and now after we have integrated ZeroVM into it we even like it even more. So right now, we have fully salable storage backend with the unique capability to run securely any arbitrary native code inside, close to data. Now, what’s left is to take our old Metaxa Query Compiler and integrate it with that backend and then after many iterations it would bake into something pretty similar to Google Dremel and BigQuery. Even better, it will always process data locally (not sure BigQuery does it now) and it will not be limited to BQL on nested records, but for any query on any data and with full multi-tenant semantics. That’s how interesting 2011 was…

It was a preamble now back to the main feature:

Google released a paper on Tenzing last year on VLDB. Tenzing is an SQL query-system implemented on top of MapReduce infrastructure and it can be thought as Google-way to do Hive and as always full of juicy details. There is already a quality post on this published and another one here. On top of that my additional takeways are:

1. It is possible to build MPP-grade system on top of MapReduce with relatively low-latency (10 seconds). However, it would requires quite a number of patches to MapReduce. Hive and Hadoop has certainly a lot to learn from Tenzing.

2. Even with Google version of a patched and leaner-than-Hadoop implementation of MapReduce getting it to Dremel latencies was not achievable. On other hand 10 seconds as minimal latency is not that bad and in same ball park as Netezza/Greenplum/Aster and other MPP gear.

3. As general Sawzall vs. Dremel vs. Tenzing comparison there is an nice youtube-datawarehousing presentation published. In fact, Dremel beats both of them on latency and if not only for limited expressive power of its query language it would end up as complete winner on all metrics considered there. Sawzall having imperative query language scores highest on the power metric. I guess when OpenDremel will be released it will be a unique combination of low-latency querying with the full expressive power of imperatively-augmented SQL.

4. Tenzing can query MySQL databases as many other popular data formats. What we witnessing here is that query-engines is being decoupled from storage engines. 10 years ago it was only the case for MySQL ecosystem and anyone who tried Oracle external table interface knows how friendly past DBMSes were to external data sources.  Dremel columnar encoding component was released internally in Google as separate ColumnIO storage engine. Then Google open-sourced their key-value LevelDB engine a-la Hadoop’s RCFiles. So we can learn here of emergence of multiple storage-engines working with multiple query engines, quite interesting phenomenon.

5. The query is compiled into native code (with LLVM) and this gave significant acceleration by factor from six to twelve. This means that SQL to native code compilation is a must for high-performance BigData query engines.