Efficient usage of local drives in the cloud for big data processing

Cloud sounds like a perfect platform for the big data processing – you get as much processing power when you need it and release when you don’t. But why does a lot of big data processing happen outside of cloud? Lets try to find out:

The question came from following dilemma in big data processing in cloud :
Store data in S3 and process in EC2. It is elastic and economical per GB, you can resize you cluster as you wish, but you are limited by S3 bandwidth. EMR against S3 is popular example of this approach.

Or, you can also build HDFS or other distributed storage on top of local (ephemeral) drives. There appears to be a clear tradeoff: good bandwidth is available, but storage is going to be expensive and elasticity will suffer, because you can not remove nodes when their processing power is not needed. Redshift or hadoop with HDFS on local drives are the perfect examples of this approach.

Both solutions have drawbacks. Lets take a closer look.
Cloud storage, like s3, is built to store a lot of data in cheap and reliable way. Circa $30 per TB per month. It is also highly reliable: SLA with a lot of nines…
Local drives should be fast. Today it means SSD. This technology provides very good performance but price per GB is high.

The cost of HDD space is 20-25 times lower than on SSD. In Amazon cloud difference in cost of local drive space vs s3 space is even higher. For example 1TB of storage on c3.8xlarge instances costs aroud $2K per month. It is x60 (sixty times!!!) more expensive than to store data in s3.

What about throughput? The difference between access to local data and data on S3 is around 5 times. Moreover, bandwidth to S3 can be throttled by amazon, depending on the current load and other factors. As opposed to always reliable access to local drives.

There is possible counter-argument, that we do not need this storage bandwidth. Assuming that we process data in a speed matching the storage bandwidth – we do not need more of it. S3 can give us 50-100 MB/sec of data for big instance, like c3.8xlarge. If we process data using MapReduce or Hive – it is close to processing speed assuming MR processing to be about 5MB/Sec per core.
In case of more efficient engines – like Redshift or impala – the speed is about 100MB/sec per core or more…
So, we need faster access. To prove this point, you can pay attention that RedShift nodes has 2.4 GB/sec of disk IO. I can trust that AWS engineers know what they are doing.

Now lets recall that usually big data is a huge pile of cheap data. By cheap I mean – low value per GB. Should data be expensive (like financial transactions) it could happily live in Oracle + enterprise storage.

So, how do we utilize our resources more efficiently? On one hand we have a lot of slow and inexpensive storage, and on the other a bit of fast and expensive. The obvious solution is to use fast storage as cache. These days it is rather common: DRAM memory holds disk cache, SRAM memory inside CPU used as cache for DRAM.

In the above situation I suggest to use local SSD drives as a cache for cloud storage (s3).
So, what does it mean? Effective cache should meet the following requirements:
Store hot data set. It is main duty of the cache. Usual heuristics is LRU – last recently used. We assume that data recently used has good chance to be used again.

Prefetch: Predict what data will be needed and load it ahead of time. In case of disk cache – it’s read ahead (if we read first block of the file there’s a good chance we will need the next). In case of CPU – there are very advanced algorithms for pre-fetch. In case of usual data warehouse we can assume that recently added data has better chance to be of interest than old one…

To summarize: I believe that to be able to efficiently process big data in the cloud we need to use local drives as a cache of the dataset stored in the cloud storage. I also believe that other ways will not be efficient, as long as cloud storage is much slower and cheaper than local drives.

Debunking common misconceptions in SSD, particularly for analytics

1. SSD is NOT synonymous for flash memory.

First of all let’s settle on terms. SSD is best described as a concept of using semiconductor memory as disk. There is two common cases: DRAM-as-disk and flash-as-disk. And flash-memory is a semiconductor technology pretty similar to DRAM, just with slightly different set of trade-offs made.

Today there are little options to use flash memory in analytics beyond SSD. Nevertheless, it should not suggest that SSD is synonymous for flash memory. Flash memory can be used in products beyond SSD, and SSD can use non-flash technology, for example DRAM.

So the question is: do we have any option of using flash-memory in other form rather than flash-as-disk?

FusionIO is the only one and was always bold in claiming that their product are not-SSD but a totally new category product, called ioMemory. Usually I dismiss such claims automatically in subconscious as a common-practice of  term-obfuscation. However, in the case of FusionIO I found it to be a a rare exception and technically true. On hardware level there is no disk-related overhead in FusionIO solution and in my opinion FusionIO are closest to the flash-on-motherboard vision among all the rest of SSD manufacturers. That said, FusionIO succumbed to implementing a disk-oriented storage layer in software because unavailability of any other standards covering  flash-as-flash concept.

You can find a more in-depth coverage of New-Dynasty SSD versus Legacy SSD issue in recent article of Zsolt Kerekes on StorageSearch.com. Albeit I’m not 100% agree with his categorization.

2. SSD DOESN’T provide more throughput than HDD.

The bragging about performance density of SSD could safely be dismissed. There is no problem in stacking up HDDs together. As many as 48 of them can be put in single server 4U box providing aggregate throughput of 4GB/sec for fraction of SSD price. Same goes to power, vibration, noise and etc… The extent to which this properties are superior to disk is uninteresting and not justifying the associated premium in cost.

Further, for any amount of  money, HDD can provide significantly more IO throughput , than SSD of any of today vendor. On any workload: read, write or combined.  Not only this, but it will do so with an order of magnitude more capacity for your big data as additional bonus. However, a few nuances are to be considered:

  • If  data is accessed in random small chunks (let’s say 16KB chunks), then SSD will provide significantly more throughput (factor of x100 may be) than disk will do. Reading in chunks at least 1MB will put HDD as a winner in the throughput game again.
  • The flash memory itself, has great potential to provide an order of magnitude more throughput than disks. Mechanical “gramophone” technology of disks cannot compete in agility with the electrons. However, this potential throughput is hopelessly being left unexploited by the nowadays SSD controller. How bad it is? Pretty bad, SSD controllers pass on less than 10% of potential throughput. The reasons include: flash-management complexity, cost-constraints leading to small embedded DRAM buffers and computationally-weak controllers,  and the main reason being that there is no standards for 100 faster disk, neither legacy software could potentially keep with higher multi-gigabyte throughputs, so SSD vendors don’t bother and are obsessed with the laughable idea of bankrupting HDD manufacturers calling the technology disruptive which it is not by definition. So we have a much more expensive disk replacement that is only barely more performant, throughput-wise, than vanilla low-cost HDD array.

3. Array of SSDs DOESN’T provide larger number of useful IOPS than arrays of disks.

While it is true that one SSD can match disk array easily in IOPS, it should not suggest that array of SSD will provide larger number of useful IOPS. The reason is prosaic, array of disks provides an abundance of IOPS, many times more than enough for any analytic application. So any additional IOPS are not needed and astronomical number of IOPS in SSD arrays is a solution looking for a problem in analytics industry.

4. SSD are NOT disruptive to disks.

Well if it is true it is not according to Clayton Christiansen definition of “disruptiveness”.  As far as I remember Christiansen defines “disruptiveness” as technology A being disruptive to technology B when all following holds true:

  • A is worse than technology B in quality and features
  • A is cheaper than technology B
  • A is affordable to a large number of new users to whom technology B is appealing but too costly.

SSD-to-disk pair is clearly not true for any condition above so I’m puzzled how one can call it disruptive to disks?

Again. I’m not claiming that SSD or flash-memory is not disruptive to any technology I just claiming that SSD are not disruptive to HDD. In fact, I think flash-memory IS disruptive to DRAM. All three conditions above hold for flash-to-DRAM pair. Also a pair of directly attached SSDs are highly disruptive to SAN.


Make no mistake I’m a true believer in flash-memory as a game-changer for analytics just not in the form of disk replacement. I’ll explore in my upcoming posts the ideas where flash memory can make a change. I know I totally missed any quantification proofs for all the claims above but…. well…. let’s leave it for comment section.

Also one of best coverage of flash-memory for analytics (and not coming from a flash vendor) is of Curt Monash on DBMS2 blog: http://www.dbms2.com/2010/01/31/flash-pcmsolid-state-memory-disk/