ImpalaToGo announcement

During my work in I saw repeating problem our customers face. The problem is how to get efficient SQL on big data in the cloud.

Lets see a typical case.

First case – daily logs of some nature arrived and stored in the S3. There is a need to do a few dozens reports over new data each day.  Daily data size is from dozens to hundreds of gigabytes. There is also need to do ad-hoc reports on the older data.

What are our options?

  • to run Hive over S3. It will work, but slow. Transer from S3 will be big part of execution time. Hive itself is slow. Data will be transferred between S3 and EC2 instances for each query. We can do things a bit better by manually scripting to keep recent data in the local HDFS.

  • to run efficient engine – Cloudera Impala.  Problem that it does not work with S3 directly. To go this way we will need to script moving data to local HDFS, and cleaning it up when it outgrow the local space. In other words – we will need to treat local HDFS as a cache manually.

  • To use Redshift. It is good option, but will also require managing moving data from s3. it is also proprietary and will lock us in the AWS. And it is expensive if we do not commit for a several years reservations.

Second case is extension of first one. We have some data accumulated in the S3. And our data science team wants to slice and dice some data to find some interesting XYZ. What should they do?  Data scientists usually not so IT savvy to build and manage own hadoop cluster. Sometimes they just want to get SQLs on subset of data with interactive speeds.

So today they have the following options

  • To ask management to build Hadoop cluster with Impala , or Spark and ask DevOps to script data transfer , etc

  • They ask management for expensive MPP database and maintain modest amount of data inside, to avoid hefty per terabyte licensing

  • To master usage of EMR.

Just as a summary – it is very easy to store data in S3, but it is not easy to get efficient SQL over it. I

We decided to do live a bit easier for people who wants to analyze data living in the s3.

How we did it?

Conceptually we see the solution in organizing fast local drives and remote S3 into one storage. This blog post is devoted to this idea –

In order to field-test this concept we built ImpalaToGo – modification of Cloudera Impala which is capable of working directly with S3, while using local drives as a cache.

In practice it means the following advantages for the above cases:

  • We can define external tables over S3 data, in the same way we used to do it in Hive.

  • When we repeat query on the same data, it runs 3-10 times faster since data is accessed from local drives, which are much faster than s3.

  • When local drives are getting full – least used data will be deleted automatically, to spare you to manage it.

If you want to try it – we have pre built AMI and documentation how to organize them into cluster. It is completely free and open source (

If you have any questions – please write to


5 thoughts on “ImpalaToGo announcement”

  1. Hi David,
    Great job! sounds awesome.
    Do you have any benchmarks of : ImpalaToGo on S3 VS EMR Hive on S3(which does not require data transfer).
    Would help me decide if its worth checking out VS some “traditional” SQL workflows

  2. Hi Daniel,

    I will publish benchmark soon. I do expect benchmark to be in favor of ImaplaToGo, but it will be caused by two distinct factors:
    – Faster engine
    – Caching schema.
    And I am still thinking how to do the test so it will be clear which of them plays the role.

    Please drop me some info about your workload and we will see how to check it better.

  3. Hi David,
    Interesting topic, what type of caching-schema is implemented in ImpalaToGo vs HIVE ?
    and how does it differ from just indexing your columns properly ?

    For example in HIVE:
    CREATE INDEX employees_index
    ON TABLE employees (country)
    AS ‘org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler’
    IDXPROPERTIES (‘creator = ‘me’, ‘created_at’ = ‘some_time’)
    IN TABLE employees_index_table
    PARTITIONED BY (country, name)
    COMMENT ‘Employees indexed by country and name.';


  4. Hi Idan,
    Sorry for the delay in answer – i missed Your comment waiting approval…
    In high level – ImpalaToGo implemented LRU caching schema. So if you hot data set fits Your cluster local drives – you will get local performance. When you focus on data moves – cache will reflect it.
    In a bit more details ImpalaToGo use consistent hash to distribute files among servers. By doing so we can support re size relatively efficiently.
    Regarding the hive – it does not have any caching. Hive built under assumption that data is local, so when Hive works with S3 – it will download data each time You query it…
    As for indexing – Impala does not have indexes yet, but next release will support indexes in Parquet files, thus enabling efficient data skipping.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>