Apache Drill Design Meeting

MapR folks invited me to participate in Apache Drill design meeting. Meetup site indicates that 60 people have been participated which sounds about right.

Tomer Shiran started the meeting with the overview of Apache Drill project. Then I (Camuel here) presented our team view for Apache Drill architecture. Jason Frantz of MapR continued touching technical aspects in follow on discussion. After a pizza break, Julian Hyde presented his view on logical/physical query plan separation and suggested using optiq framework for DrQL optimizer.

Overall my take away are as follows:

  1. There is very healthy interest in interactive querying for BigData.
  2. There were not even a single voice calling on making up vanilla Hadoop for this task.
  3. There is a general consensus on plurality of query languages and plurality of data formats.
  4. There is a general consensus that user always should be given freedom to supply manually authored physical query plan for execution, bypassing optimizer altogether and as opposed to hardcore hinting.
  5. Except me no one tried to challenge “common logical query model” concept. Since there are no real joins in Dremel and no indexes and only one data source with exactly one possibility – a single full table scan, I cannot see the justification for the complexity of optimizers and the logical query model. Dremel is an antidote concept to all this.

Thank you – MapR, for the Drill initiative, the great design meeting and the invitation.

4 thoughts on “Apache Drill Design Meeting”

  1. No BigQuery doesn’t support real joins. Those that are supported are baby joins where the recordset being joined must be small. In other words it must be equivalent to first issuing all the select statements for baby joins and then converting them into look-up tables and then scheduling the real grand full table scan.

  2. Hi Camuel
    When we say “a single full table scan” do you mean all columns required in query. Since Dremel is columnar storage, hence it need to fetch/scan desired columns only. You may think it is obvious, just want it to be explicit.


    1. Assuming it works on columnar storage then of course the meaning of FTS is scanning only projected columns. It also means not incurring any IO cost for non-projected columns.

      Dremel is designed to work only on columnar store, but Apache Drill backend is architected to be very flexible and allowa scanning row stores too… and even any arbitrary data-format as long as you supply appropriate data parser or a decoder for this format.

      The meaning of FTS is that the table is scanned from top to bottom, here and there optimizations are used in order to skip scanning obviously irrelevant parts, and skipping non-projected columns is one of such optimizations. From what I know on traditional RDBMS if you access more than ~5% of table rows than FTS is usually the way to go.

      A non-FTS scan, which is not supported by Dremel/Drill, would be to access the index first and then go and make pinpoint random accesses to the main table. This is justified when only a tiny fraction of rows are touched (for example one row out of millions which is a typical case for OLTP).

      Hope this helps and pardon me if I write obvious things here, may be for some folks it would be helpful.

Leave a Reply to Dharm Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>