ImpalaToGo announcement

During my work in BigDataCraft.com I saw repeating problem our customers face. The problem is how to get efficient SQL on big data in the cloud.

Lets see a typical case.

First case – daily logs of some nature arrived and stored in the S3. There is a need to do a few dozens reports over new data each day.  Daily data size is from dozens to hundreds of gigabytes. There is also need to do ad-hoc reports on the older data.

What are our options?

  • to run Hive over S3. It will work, but slow. Transer from S3 will be big part of execution time. Hive itself is slow. Data will be transferred between S3 and EC2 instances for each query. We can do things a bit better by manually scripting to keep recent data in the local HDFS.

  • to run efficient engine – Cloudera Impala.  Problem that it does not work with S3 directly. To go this way we will need to script moving data to local HDFS, and cleaning it up when it outgrow the local space. In other words – we will need to treat local HDFS as a cache manually.

  • To use Redshift. It is good option, but will also require managing moving data from s3. it is also proprietary and will lock us in the AWS. And it is expensive if we do not commit for a several years reservations.

Second case is extension of first one. We have some data accumulated in the S3. And our data science team wants to slice and dice some data to find some interesting XYZ. What should they do?  Data scientists usually not so IT savvy to build and manage own hadoop cluster. Sometimes they just want to get SQLs on subset of data with interactive speeds.

So today they have the following options

  • To ask management to build Hadoop cluster with Impala , or Spark and ask DevOps to script data transfer , etc

  • They ask management for expensive MPP database and maintain modest amount of data inside, to avoid hefty per terabyte licensing

  • To master usage of EMR.

Just as a summary – it is very easy to store data in S3, but it is not easy to get efficient SQL over it. I

We decided to do live a bit easier for people who wants to analyze data living in the s3.

How we did it?

Conceptually we see the solution in organizing fast local drives and remote S3 into one storage. This blog post is devoted to this idea – http://bigdatacraft.com/archives/470

In order to field-test this concept we built ImpalaToGo – modification of Cloudera Impala which is capable of working directly with S3, while using local drives as a cache.

In practice it means the following advantages for the above cases:

  • We can define external tables over S3 data, in the same way we used to do it in Hive.

  • When we repeat query on the same data, it runs 3-10 times faster since data is accessed from local drives, which are much faster than s3.

  • When local drives are getting full – least used data will be deleted automatically, to spare you to manage it.

If you want to try it – we have pre built AMI and documentation how to organize them into cluster. It is completely free and open source (https://github.com/ImpalaToGo/ImpalaToGo/wiki/How-to-try).

If you have any questions – please write to david@bigdatacraft.com

 

Network virtualization for the Cloud: Open vSwitch study

In face of the current reality of ten thousand node data-centers and all the BigData jazz it seems like the network guys were slightly forgotten. We have enough hardware virtualization solutions but until now the network was left on the outskirts of the cloud hype. Let’s see what we can use right now and if it will get better in the future.

When people talk about network virtualization nowadays one name immediately springs into mind: Nicira, they invented OpenFlow, Open vSwitch (OVS) and… were acquired by VMware.

Why Nicira? They essentially designed the current state of network virtualization. OpenFlow is implemented in physical hardware and OVS is used by a lot of people to drive the software network stack in virtualized environments.

But is it any good? Let’s see. If you open the specification for OpenFlow it looks simple: let’s cut the hardware intervention at the Ethernet level and implement all other features in software. We essentially write a program (handler) that matches some fields in packet and acts according to simple rules: forward to port, drop, pass to other handler. But then, how do you install these handlers inside the switch? The solution is also not that complicated: you just write another more complex software that runs on something generic (like PC). It chooses handlers for particular flows by issuing a command to the switch, when switch encounters something it does not have handler for, it just passes it to this PC (controller) and controller either chooses a new handler for the switch or processes the packet internally.

Open Flow switch

 

What do we see here? It looks like there is an execution platform inside the hardware for running the network handlers and a controller which chooses the handler for each state. It looks very promising and flexible, and can be implemented not only in hardware but also in software only. And the same guys implemented it in OVS, shall we peek inside? Yes, I’ve answered to myself, and downloaded the OVS source.

When I looked inside the code I was little… how should I put it… surprised? OVS code has everything but a kitchen sink inside: serializeable hash table, JSON parser, VLAN tagging, several QoS implementations, STP implementation, Unix socket server, RPC over SSL server and the icing on the cake: their own database with a key/value + columnar storage engine. And everything implemented from scratch (or so it seems).

Ok, they have enough shock value already, but how does this thing work? It turns out that the operation is not that different from the hardware I’ve described above. It just has a kernel module instead of actual hardware and the flow handlers are just some functions inside the module. It looks like this.

Open vSwitch

Daemon uses netlink to talk to kernel module and install the handlers, database stores the configuration, controller talks to daemon via OpenFlow or plain JSON.

So, we got a software stack for network, why is it good for virtualization?

Short answer: because everybody uses Linux. And when your hypervisor runs on Linux why not use some of its capabilities for a nice network boost. But why OpenFlow/OVS?

The OVS docs describe it like this:

Open vSwitch is targeted at multi-server virtualization deployments, a landscape for which the previous stack is not well suited. These environments are often characterized by highly dynamic end-points, the maintenance of logical abstractions, and (sometimes) integration with or offloading to special purpose switching hardware.

But Linux always had a good network stack, easily manageable and extendible. What are the advantages? There are some.

  • OVS can migrate the network state with the bridge port (connected to the VM instance, for example). You will loose some packets, but the connections may still stay intact.
  • OVS can use metadata to tag frames (VM instance UUID, for example).
  • OVS can install event triggers inside the daemon to notify the controller on the state change (VM reboot, for example).

Does it justify a new kernel module + new DB engine + new RPC protocols? Maybe not. But you can’t get these capabilities from any other open source software anyway.

The only question I have right now is why Nicira did not implement STT in OVS, but certainly did in its proprietary NVP software?

Start-Up Chile

I’ve been frequently asked about my experiences in Start-Up Chile program. For the past half year that I’ve been participating in the program I could say that it was interesting and fulfilling experience.

On top of provided seed capital you get a supporting framework of mentors and fellow startupists. You can literally “feel” the surrounding  entrepreneurial spirit. And despite me being unlucky to find peer support with my infrastructure BigData@Cloud idea (most folks were doing consumer web kind of startups) I did found the framework highly encouraging.

Provided capital is equity-free which is especially nice and makes negotiating next financing round easier. Getting the money is paperwork-intensive process but the staff are friendly and helping.

I found Chileans hospitable and friendly to foreigner. Yet minimal Spanish seems to be mandatory. I found myself speaking Spanish after a few month in Santiago and that was  unplanned initially.

Santiago is a nice mountain-surrounded modern city and pretty safe I would say. I cannot count how many times locals warned me on how unsafe Santiago really is, but except permanently going strike/riot in the central part of the city I never experienced and never witnessed or heard about any incident. And I’m usually working deep into the night and walk extensively before retiring to bed. I lived in Centro but especially enjoyed walking in west-northern part of the town. Underground transprtation is quite efficient to get around, a little hot during mid-day in February I remember. I was mostly fully consumed by my startup so haven’t enough time to tour the rest of the country, and even Santiago only from walking experience guided by GPS in my Nokia. I really should rent a car one weekend and get out for a couple of days… In fact I did one weekend in Vinna del Mar / Valparaiso and found it quite a nice and relaxing place.

The local entrepreneourship and geekish community is also thriving and this is not including very visible Start-Up Chile folks. Go to meetup.com and choose your favorite topic or technology and I bet you will find a packed santiago interest group there.

Apache Hadoop over OpenStack Swift

This is a post by Constantine Peresypkin and David Gruzman.
Lately we were working on integrating Hadoop with OpenStack Swift. Hadoop doesn’t need an introduction neither does OpenStack. Swift is an object-storage system and the technology behind RackSpace cloud-files (and quite a few others like Korea Telecom object storage, Internap and etc…)
Before we go into details of Hadoop-Swift integration let’s get some relevant background:
  1. Hadoop already have integration with Amazon S3 and is widely used to crunch S3-stored data. http://wiki.apache.org/hadoop/AmazonS3
  2. NameNode is a known SPOF in Hadoop. If it can be avoided it would be nice.
  3. Current S3 integration stages all data as temporary files on local disk to S3. That’s because S3 needs to know content length in advance it is one of the required headers.
  4. Current S3 also suffers form 5GB max file limitation which is slightly annoying.
  5. Hadoop requires seek support which means that HTTP range support is required if it is run over an object-storage . S3 supports it.
  6. Append file support is optional for Hadoop, but it’s required for HBase. S3 doesn’t have any append support thus native integration can not use HBase over S3.
  7. While OpenStack Swift is compatible with S3, RackSpace CloudFiles is not. It is because RackSpace CloudFiles disables S3-compatibility layer in Swift. This prevents existing Swift users from integration with Hadoop.
  8. The only information that is available on Internet on Hadoop-Swift integration is that with using Apache Whirr! it should work. But for best of our knowledge it is relevant only to rolling out Block FileSystem on top of Swift not a Native FileSystem. In other words we haven’t found any solution on how to process data that is already stored in RackSpace CloudFiles without costly re-importing.
So instrumented with above information let’s examine what we got here:
  1. In general we instrumented Hadoop to run over Swift naively, without resorting to S3 compatibility layer.  This means it works with CloudFiles which misses the S3-compatibility layer.
  2. CloudFiles client SDK doesn’t have support for HTTP range functionality. Hacked it to allow using HTTP range, this is a must for Hadoop to work.
  3. Removed the need for NameNode in similar way it is removed with S3 integration for Amazon.
  4. As opposed to S3 implementation we avoided staging files on local disk to and from CloudFiles/Swift. In other words data directly streamed to/from compute node RAM into CloudFiles/Swift.
  5. Though the data is still processed remotely. Extensive data shipping takes place between compute nodes and CloudFiles/Swift. As frequent readers of this blog know we are working on technology that will allow to run code snippets directly in Swift. Look here for more details: http://www.zerovm.com. As next step we plan to perform predicate-pushdown optimization to process most of data completely locally inside ZeroVM-enabled object-storage system.
  6. Support for native Swift large objects is planned also (something that’s absent in Amazon S3)
  7. We also working on append support for Swift (this could be easily done through Swift large object support which uses versioning) so even HBase will work on top of Swift, and this is not the case with S3 now.
  8. As it is the case with Hadoop S3, storing BigData in native format on Swift provides options for multi-site replication and CDN

Futility of “tooling” a proprietary cloud.

I’v been pitched by a lot of entrepreneurs trying to make a better-than-original “tooling” for a proprietary cloud, particularly for AWS. Ain’t the attempt futile from the beginning? Amazon is smart, innovative and working hard to make its cloud offering comprehensive and has much larger arsenal to overdo anyone who dare to compete on their own turf. That is their party, the invitation cannot be taken for granted.

Let’s take NoSQL data-stores and DBMS vendors as examples. There are VC-backed companies out-there which are exclusively focused on outdoing Amazon with running MySQL/NoSQL on their own cloud, Xeround comes to mind, but many others also hoping their product will catch fire on EC2.

Well, if just single branding and plain convenience is not enough , how about these two exclusive and “unfair” competative advatnages in Amazon arselnal:

  • [My unverified assumption is] that DynamoDB has storage integrated into its fabric whenever all the rest must use slower EBS.
  • Not it is just integrated, but as announced by Amazon, it uses SSD-backed storage. SSD-backed storage is not available, as of today, for DynamoDB competitors running on AWS. So competitors must continue to use ordinary EBS. That is in fact a double-kick, first the mere fact of using different hardware for a competitive advantage and second, the announcment itself as catalyst to trigger migration.

So, future EMR, may also have an integrated storage as well as other hardware optimization, making Hadoop more efficient on AWS and good if so.  Same goes to RDS and other current and future PaaS-related services.

Do I accuse Amazon on wrongdoing? Of course not! They brought the cloud to the main-street while others were only talking about it they and made large-scale computing affordable to all and continue dropping prices passing their economies-of-scale savings on customers and also keeps optimizing and enhancing their infrastructure constantly, and were good to their shareholders also. However, as any proprietary and monopolistic platform,  they do hinder some outside-of-Amazon innovation. No matter how good they are, we don’t want only one company in the world doing cloud-infrastructure stuff for the rest. That’s why, OpenStack are so extremely important for the industry. If OpenStack will be widely adopted then infrastructural and “tooling” kind of innovation could go directly into OpenStack for greater good and fairer monetization model for the author.

 

 

 

 

 

 

 

 

Upcoming hardware renaissance era: part #2.

Some examples of upcoming hardware renaissance era:

1. Virtually all server vendors are pitching modularized data centers by now. MDC are boxes resembling shipping containers accommodating complete vritualized data-center inside. With MDC one just connects power, network and chilled water and gets access to the cloud in the box. Most MDC are good to be deployed outside and have built-in protection against weather elements. Of course all current offering are based on x86 commodity servers but here is a hint: once competition moves to comparing whose shipping container can stuff more storage and computing power inside and who has better price/performance and energy efficiency, we will see innovation in hardware skyrocketing.

2. On processor front…. ARM architecture has all ingredients to become next Intel. If I am not mistaken, ARM processors are outnumbering x86 10 to 1 with tens of billions of processors shipped. 95% of cellphones and advanced gadgets use ARM. ARM power efficiency puts x86 in shame. However, till now ARM was focused on gadgets and dismissed data-center market. Not anymore! With newer Cortex-A15 ARM took aim at x86 on datacenter territory. Calxeda already got ~$50M in venture money for commercializing ARM in datacenter. However, ARM is not alone here, Tilera with their server-vendor partner Quanta are already shipping 512 core server in 2U form-factor. Tilera took lean MIPS processor core and put some 100 of them into single die together with x8 10Gbit Ethernet channels and four DDR3 memory channels. Nvidia also claimed that they are not GPU-only vendor anymore and are readying general-purpose processors based on ARM architecture with ample amount of GPU muscle inside. That said Intel and AMD are also far from stagnating and moving into heterogeneous many-core designs. I think we never witnessed more innovation in processor space than now.

3. On memory front… Flash is making inroads to claim space in memory-hierarchy between DRAM and HDD. Disrupting DRAM market and high-performance 15K RPM HDD market. I think 15K RPM HDD and DRAM-based SSD products are already safe to be declared dead. Same about HDD smaller than 2.5 inch form-factor. I even think 2.5 HDD are also in risk. Only capacity-optimized HDDs would survive. Even without flash, the DRAM got such capacity that most datasets fits in RAM completely. And if not in RAM of single server than it surely fits in shared cluster RAM. This solid-memory advancements in DRAM and Flash disrupts storage market, especially making high-performance SAN redundant. The only storage tomorrow server will need is capacity-optimized and energy-optimized ones. That fact among other forced EMC to move into computing… and provide complete cloud in the box instead just storage in the box like it did in the past.

4. Networking… in my view networking is most stagnating hardware market here. Infiniband finally moves into mainstream and it is good. Does it? Or it will succumb to 10GbE? Remains to be seen. My bet is on Infiniband due to architectural superiority. Networking virtualization is still on whiteboards… unfortunately. So in networking there is no signs of renaissance but the potential is there.

Emerging Proprietary Hardware Renaissance

INTRO
I cannot count number of times I heard that cloud computing means innovation stagnation in the proprietary hardware business and that with cloud computing, hardware doesn’t matter anymore and will succumb sooner or later into boring razor-thin-margins oligopolistic commodity industry.

GAME OVER FOR FAT MARGINS IN PROPRIETARY HARDWARE?
Why folks think like that? Well… there is one reason that dominates their thinking – hardware products became components and worse of all they became a well standardized components. And as such, certain low-wage countries can quickly master how to assemble them in large quantities and win competition purely on cost and nothing else seriously matters in component business. No one carries about component extended enterprise feature set and premium brand and long list of ISV partnerships and etc… what matters is very well defined functionality and price, price and price. In fact it already happened to low-end gear like entry-level basic servers and routers. The situation is very different for enterprise IT products. I estimate that if IT costs will be tripled overnight for most enterprises it would not matter in their bottom line. So IT departments of most enterprises are cost-insensitive regarding IT gear. They will not blindly overpay in most cases but cost is not high in their priority list either. IT for most enterprises is more or less fixed cost amortized among very large number of its products or services. I guess if Coca Cola IT was tripled the price of single can of coke should be elevated for less than a cent. Unlikely that it is life-threatening to their business. Therefore the game was and largely still is to market hardware products directly to enterprise IT departments competing on enterprise feature set rather than price. Now with emerging cloud computing paradigm hardware products are components and are marketed to cloud operators which are essentially server farmers. And they are extremely cost-sensitive and marketing to them premium computing gear will be as successful as marketing a premium booze as fuel to alcohol-fueled car owner. If for cloud operator server costs will be tripled overnight the next morning it would be out of business. So without a doubt it is a game over for fat margins in hardware manufacturing/assembly business. What happened to enterprise software 5 years ago is happening to enterprise hardware right now – commoditization.

PROPRIETARY HARDWARE INNOVATION
So the common thinking goes, in the boring commodity business no one is going to invest in innovation so no one is going to invest much into proprietary hardware because no cloud vendor is going to buy premium hardware. The only hope is private cloud as a freshly invented loophole to continue to sell premium gear to enterprises. Well lets consider the following situation. XYZ startup manages to build a proprietary appliance which is essentially a cloud-in-the-box solution that through tight internal integration and optimization for one particular task achieves an order-of-magnitude better price-performance.Lets say it is an KV-store appliance. Would cloud-operators be interested? I bet they will. From outside it doesn’t matter if the functionality are backed up by cassandra-on-generic hardware or custom-hardware-appliance. So the cloud provider quietly rolling out such appliances can compete well both on price and on functionality (like latency) with other cloud providers running software-based KV-store on generic hardware. Other startup ABC may produce a computing appliance that can run ruby-on-rail or java or pyton applications order-of-magnitude more efficiently than a generic hardware and a cloud provider deploying such appliances from ABC startup could compete better serving RoR clients. Yet another startup may build a custom rack-sized box filled with fermi chip specifically designed for video processing. I can bring more examples but the trend is obvious. Use large chunk of dedicated hardware to do one specific task and do it extremely efficiently and you can have nice margins as hardware manufacturer. Before cloud – hardware must be generic because single enterprise server should be able to run a variety of different workloads. With cloud computing it is not longer the case. A vendor can build a dedicated hardware appliance optimized to do one specific workload and serve the whole world with it raising high barriers for the competitors. So despite popular belief I think cloud computing presents unique opportunities to creative hardware engineers, tough not in premium enterprise-feature set area as it used to be but in extreme-efficiency, acceleration and specialization areas.

Two Envelopes Problem: Am I just dumb?

It seems the recent craze about statistician being a profession of choice in the future gains steam. In future where we will be surrounded by quality BigData, capable computers and bug-free open source software including OpenDremel. Well the last one I made up… but the rest seems to be the current situation. Acknowledging this I was checking what is the state of open-source statistics software and who are the guys behind it and etc.. But it is not my today topic, it is the topic of one of my next posts. Today I want to talk about one of the strangest problems/paradoxes on the internet I have ever seen. The story is, that I encountered right now “the two envelops problem” or “paradox” as some put it. Having worked with math guys for a long time (having math mastery that I’ll never reach in my life) I immediately recognized that problem as I was teased by it few times a long time a go. However, I was never told that this is such a big deal of a problem. Wikipedia lists it under “Unsolved problems in statistics”. Heh? And I never understood what is so paradoxical or hard or even interesting in it? For me it seemed high-school grade problem at most. So I put “two envelopes problem” into Google and found tens of blogs trying to explain it and propose over-engineered over-complicated and long solutions to such a simple problem. I have a very strange feeling that I’m either totally dumb or a genius and I know I’m not a genius ;) In some sources it is mentioned that only Bayesian subjectivists suffer from this, however in large majority of other sources it is presented as an universal problem… Well enough talking lets dive in into simplest solution on internet (or I will be embarrassed by someone pointing my mistake or similar solution published elsewhere).

The problem description for those who never heard about it:

You are approached by a guy that shows you two identical envelopes. Both envelopes contain money. You are allowed to pick and keep any one of them for yourself. After you pick one, the guy makes you an offer to swap envelopes. The question is if one should swap. For me it is as clear as sunny day that it doesn’t matter if you swap and it is easily provable by simple math. Somehow most folks (some Ph.D. level!) get into very hairy calculations that suggest that one should swap and than even more hairy ones why one should not. Some mention subjectivity but most don’t.

The simplest solution on internet (joking… but seriously I haven’t found such simple one):

Let’s denote the smaller sum in one of envelopes as X and therefore the larger sum in the other will be 2X. Then expected value of current envelope selection before swap consideration is 1.5X. How I got to it? Very simply we have 0.5 probability of holding the envelope with larger sum that is 2X and 0.5 probability with holding an envelope with smaller sum witch is X. So:

0.5*2X+0.5*X = X+0.5X=1.5X

So far so good… let’s now calculate the expected value if we swap. If we swap, we will have same 0.5 probability of holding larger sum and same 0.5 probability of holding the smaller sum. Needless to repeat the calculation, you will get exactly same 1.5X as an expected value, meaning that the swap doesn’t matter. Or if time has any value it doesn’t make sense to waste it by swapping envelopes.

Do you see it as hard problem? I bet 10-year old will do fine with it, especially if offered some reward.

How come others get lost here?

The answer is that some try to apply Bayesian subjectivism probability theory and then innocent folks follow it and gets lost as well.

If you look to Wikipedia article for example you will find a classic wrong solution that allegedly is “obvious” and then a link outside Wikipedia to a “correct” solution. The correct solution seems a long post with a lot of formulae and usage of Bayes theorem that at the end came to correct answer.

Well… I see clearly a flaw with the solution published in wikipedia. That solution really looks artificial, but according to the number of followers it should be obvious for many. The blunder is in the third line:

The other envelope may contain either 2A or A/2

By A they denoted the sum in the envelope they are holding. The mistake is in “either 2A or A/2″, it should be “either 2A or A”, Then everything will be ok and no “paradox” will emerge in the end. The mistake stems from the fallacy of using same name for two separate variables that are dependent but not equal! And then repeatedly confusing them since they have same name. Here is a “patch” to be applied to wikipedia published reasoning:

1. I denote by A the amount in my selected envelope. => FINE
2. The probability that A is the smaller amount is 1/2, and that it is the larger amount is also 1/2. => FINE
3. The other envelope may contain either 2A or A/2. =>INCORRECT variable A denotes different values so it is highly confusing to write it this way.

let’s explicitly consider two cases here instead of implicit “either..or…”

in first case let’s assume we are holding the smaller sum then the other envelope contains 2A
in the second case let’s assume we have holding the larger sum then the other envelope contains A/2. However, the A is different from A of first case so let’s write it as _A/2

Moreover we know that _A is not just different from A but is exactly twice the other so
_A = 2A
So the expression “either 2A or A/2″ must be written as “either 2A or _A/2″ or substituting _A=2A as “either 2A or A”.

Then for calculating expected value you also substitute A instead of A/2 and get same expected value than before swap.

===============================

That said, I saw many people feeling so “enlightened” by reading a complicated “correct” solution that they erroneously think and argue that one should not accept the following offer thinking it is equivalent to the above problem (well not exactly this but I rephrased it for clarity):

One guy comes to you and says there are three envelopes. You are allowed to pick one and keep it. One envelope is red and two are white. All three contain money. One of white envelopes contain twice as many as red one. Another white one contains half of red one. The white envelopes are identical and there is no way to know which one contains double and which one contains half. The question is which envelope you should choose: the red one or one of the white ones. And the answer is that you should pick one of white envelopes! In fact the calculation errorneously applied to the two-envelopes problem is 100% correct to the three-envelopes-problem. And on average you will win choosing one of white envelopes rather than a red one.