Efficient usage of local drives in the cloud for big data processing

Cloud sounds like a perfect platform for the big data processing – you get as much processing power when you need it and release when you don’t. But why does a lot of big data processing happen outside of cloud? Lets try to find out:

The question came from following dilemma in big data processing in cloud :
Store data in S3 and process in EC2. It is elastic and economical per GB, you can resize you cluster as you wish, but you are limited by S3 bandwidth. EMR against S3 is popular example of this approach.

Or, you can also build HDFS or other distributed storage on top of local (ephemeral) drives. There appears to be a clear tradeoff: good bandwidth is available, but storage is going to be expensive and elasticity will suffer, because you can not remove nodes when their processing power is not needed. Redshift or hadoop with HDFS on local drives are the perfect examples of this approach.

Both solutions have drawbacks. Lets take a closer look.
Cloud storage, like s3, is built to store a lot of data in cheap and reliable way. Circa $30 per TB per month. It is also highly reliable: SLA with a lot of nines…
Local drives should be fast. Today it means SSD. This technology provides very good performance but price per GB is high.

The cost of HDD space is 20-25 times lower than on SSD. In Amazon cloud difference in cost of local drive space vs s3 space is even higher. For example 1TB of storage on c3.8xlarge instances costs aroud $2K per month. It is x60 (sixty times!!!) more expensive than to store data in s3.

What about throughput? The difference between access to local data and data on S3 is around 5 times. Moreover, bandwidth to S3 can be throttled by amazon, depending on the current load and other factors. As opposed to always reliable access to local drives.

There is possible counter-argument, that we do not need this storage bandwidth. Assuming that we process data in a speed matching the storage bandwidth – we do not need more of it. S3 can give us 50-100 MB/sec of data for big instance, like c3.8xlarge. If we process data using MapReduce or Hive – it is close to processing speed assuming MR processing to be about 5MB/Sec per core.
In case of more efficient engines – like Redshift or impala – the speed is about 100MB/sec per core or more…
So, we need faster access. To prove this point, you can pay attention that RedShift nodes has 2.4 GB/sec of disk IO. I can trust that AWS engineers know what they are doing.

Now lets recall that usually big data is a huge pile of cheap data. By cheap I mean – low value per GB. Should data be expensive (like financial transactions) it could happily live in Oracle + enterprise storage.

So, how do we utilize our resources more efficiently? On one hand we have a lot of slow and inexpensive storage, and on the other a bit of fast and expensive. The obvious solution is to use fast storage as cache. These days it is rather common: DRAM memory holds disk cache, SRAM memory inside CPU used as cache for DRAM.

In the above situation I suggest to use local SSD drives as a cache for cloud storage (s3).
So, what does it mean? Effective cache should meet the following requirements:
Store hot data set. It is main duty of the cache. Usual heuristics is LRU – last recently used. We assume that data recently used has good chance to be used again.

Prefetch: Predict what data will be needed and load it ahead of time. In case of disk cache – it’s read ahead (if we read first block of the file there’s a good chance we will need the next). In case of CPU – there are very advanced algorithms for pre-fetch. In case of usual data warehouse we can assume that recently added data has better chance to be of interest than old one…

To summarize: I believe that to be able to efficiently process big data in the cloud we need to use local drives as a cache of the dataset stored in the cloud storage. I also believe that other ways will not be efficient, as long as cloud storage is much slower and cheaper than local drives.

Multi-engine data processing

There is a lot of criticism of HDFS – it is slow, it has SPOF, it is read only, etc. All of the above is true. Systems built on top of a local file system are more efficient than those built on top of HDFS (like Cassandra vs. HBase). That is also true.
However, HDFS has brought us a new opportunity. We have a platform to share data among different database engines. Let me elaborate.
Today there is a number of database engines built on top of HDFS. Such as Hive, Pig, Impala, Presto, etc. Many Hadoop users develop their own partially generic map-reduce jobs which could be called specialized database engines.
Before we had such a common platform, we had to select database engine for our analytic and ETL needs. It was a tough choice and it was lame. Selected engine owns data by storing it in its own format. We were forced to write some custom data operations not in the language of our choice, but
in database’s internal language – like PL-SQL, Transact-SQL, etc… When our processing became less relational, we wrote a lot of logic in database internal language and had to pay database licence in order to run our own code…
The first spring bird of change was Hive’s external tables. At first glance, it was a minor feature – instead of letting Hive select the place and the format for the data, we got the chance to do this job for it. Conceptually, it was a big change – we are beginning to own our data. Now it can be produced by our own MR jobs, it can be processed in any way we like aside of Hive. And at the same time we can use Hive to query these data.
Then came Impala, and instead of defining its own metadata, it reuses the data from Hive and thus inherits those wonderful external tables. So we have two engines capable of processing the same data. As far as I remember, for the first time.
Why is it so important? Because we are no longer restricted by the given engine problems and bugs. Let’s say Impala is perfect at running 5 out of our 6 queries, but fails at the last one. What do we do? Right, we can run 5 queries using Impala and the last one using Hive, which is slower, but is a much less demanding creature than Impala.
Then came Spark with its real time capabilities, Presto from the Facebook, and, I am sure, many more are to come. How to select the right one?
So, in my opinion, we should stop selecting, but rather combine them. We should shift to single data – multiple engines paradigm. Thanks to the gods of the free open source software it does not involve licensing costs.
What makes it possible: common storage, open data formats and shared metadata. HDFS is a storage, data formats are CSV, Avro, Parquet, ORC, etc., and shared metadata is Hive metastore and HCatalog as its successor.
We do not have to do a revolution, but I think we should prefer engines running on top of common storage (like HDFS) data formats, which are understandable to many engines and database engines that do their best at data processing and give us freedom of data placement and formats.

Apache Drill Design Meeting

MapR folks invited me to participate in Apache Drill design meeting. Meetup site indicates that 60 people have been participated which sounds about right.

Tomer Shiran started the meeting with the overview of Apache Drill project. Then I (Camuel here) presented our team view for Apache Drill architecture. Jason Frantz of MapR continued touching technical aspects in follow on discussion. After a pizza break, Julian Hyde presented his view on logical/physical query plan separation and suggested using optiq framework for DrQL optimizer.

Overall my take away are as follows:

  1. There is very healthy interest in interactive querying for BigData.
  2. There were not even a single voice calling on making up vanilla Hadoop for this task.
  3. There is a general consensus on plurality of query languages and plurality of data formats.
  4. There is a general consensus that user always should be given freedom to supply manually authored physical query plan for execution, bypassing optimizer altogether and as opposed to hardcore hinting.
  5. Except me no one tried to challenge “common logical query model” concept. Since there are no real joins in Dremel and no indexes and only one data source with exactly one possibility – a single full table scan, I cannot see the justification for the complexity of optimizers and the logical query model. Dremel is an antidote concept to all this.

Thank you – MapR, for the Drill initiative, the great design meeting and the invitation.

Debunking common misconceptions in SSD, particularly for analytics

1. SSD is NOT synonymous for flash memory.

First of all let’s settle on terms. SSD is best described as a concept of using semiconductor memory as disk. There is two common cases: DRAM-as-disk and flash-as-disk. And flash-memory is a semiconductor technology pretty similar to DRAM, just with slightly different set of trade-offs made.

Today there are little options to use flash memory in analytics beyond SSD. Nevertheless, it should not suggest that SSD is synonymous for flash memory. Flash memory can be used in products beyond SSD, and SSD can use non-flash technology, for example DRAM.

So the question is: do we have any option of using flash-memory in other form rather than flash-as-disk?

FusionIO is the only one and was always bold in claiming that their product are not-SSD but a totally new category product, called ioMemory. Usually I dismiss such claims automatically in subconscious as a common-practice of  term-obfuscation. However, in the case of FusionIO I found it to be a a rare exception and technically true. On hardware level there is no disk-related overhead in FusionIO solution and in my opinion FusionIO are closest to the flash-on-motherboard vision among all the rest of SSD manufacturers. That said, FusionIO succumbed to implementing a disk-oriented storage layer in software because unavailability of any other standards covering  flash-as-flash concept.

You can find a more in-depth coverage of New-Dynasty SSD versus Legacy SSD issue in recent article of Zsolt Kerekes on StorageSearch.com. Albeit I’m not 100% agree with his categorization.

2. SSD DOESN’T provide more throughput than HDD.

The bragging about performance density of SSD could safely be dismissed. There is no problem in stacking up HDDs together. As many as 48 of them can be put in single server 4U box providing aggregate throughput of 4GB/sec for fraction of SSD price. Same goes to power, vibration, noise and etc… The extent to which this properties are superior to disk is uninteresting and not justifying the associated premium in cost.

Further, for any amount of  money, HDD can provide significantly more IO throughput , than SSD of any of today vendor. On any workload: read, write or combined.  Not only this, but it will do so with an order of magnitude more capacity for your big data as additional bonus. However, a few nuances are to be considered:

  • If  data is accessed in random small chunks (let’s say 16KB chunks), then SSD will provide significantly more throughput (factor of x100 may be) than disk will do. Reading in chunks at least 1MB will put HDD as a winner in the throughput game again.
  • The flash memory itself, has great potential to provide an order of magnitude more throughput than disks. Mechanical “gramophone” technology of disks cannot compete in agility with the electrons. However, this potential throughput is hopelessly being left unexploited by the nowadays SSD controller. How bad it is? Pretty bad, SSD controllers pass on less than 10% of potential throughput. The reasons include: flash-management complexity, cost-constraints leading to small embedded DRAM buffers and computationally-weak controllers,  and the main reason being that there is no standards for 100 faster disk, neither legacy software could potentially keep with higher multi-gigabyte throughputs, so SSD vendors don’t bother and are obsessed with the laughable idea of bankrupting HDD manufacturers calling the technology disruptive which it is not by definition. So we have a much more expensive disk replacement that is only barely more performant, throughput-wise, than vanilla low-cost HDD array.

3. Array of SSDs DOESN’T provide larger number of useful IOPS than arrays of disks.

While it is true that one SSD can match disk array easily in IOPS, it should not suggest that array of SSD will provide larger number of useful IOPS. The reason is prosaic, array of disks provides an abundance of IOPS, many times more than enough for any analytic application. So any additional IOPS are not needed and astronomical number of IOPS in SSD arrays is a solution looking for a problem in analytics industry.

4. SSD are NOT disruptive to disks.

Well if it is true it is not according to Clayton Christiansen definition of “disruptiveness”.  As far as I remember Christiansen defines “disruptiveness” as technology A being disruptive to technology B when all following holds true:

  • A is worse than technology B in quality and features
  • A is cheaper than technology B
  • A is affordable to a large number of new users to whom technology B is appealing but too costly.

SSD-to-disk pair is clearly not true for any condition above so I’m puzzled how one can call it disruptive to disks?

Again. I’m not claiming that SSD or flash-memory is not disruptive to any technology I just claiming that SSD are not disruptive to HDD. In fact, I think flash-memory IS disruptive to DRAM. All three conditions above hold for flash-to-DRAM pair. Also a pair of directly attached SSDs are highly disruptive to SAN.

—————

Make no mistake I’m a true believer in flash-memory as a game-changer for analytics just not in the form of disk replacement. I’ll explore in my upcoming posts the ideas where flash memory can make a change. I know I totally missed any quantification proofs for all the claims above but…. well…. let’s leave it for comment section.

Also one of best coverage of flash-memory for analytics (and not coming from a flash vendor) is of Curt Monash on DBMS2 blog: http://www.dbms2.com/2010/01/31/flash-pcmsolid-state-memory-disk/


Google Percolator: MapReduce Demise?

Here is my early thoughts after quickly looking into  Google Percolator and skimming the paper .

Major take-away: massive transactional mutating of tens-petabyte-scale dataset on thousands-node cluster is possible!

MapReduce is still useful for distributed sorts of big-data and few other things, nevertheless it’s “karma” has suffered a blow. Beforehand you could end any MapReduce dispute by saying “well… it works for Google”, however, nowadays before you say it you would hear “well…. it didn’t work for Google”. MapReduce is particularly criticized by having 1) too long latency, 2)too wasteful, requiring full rework of the whole tens-of-PB-scale dataset even if only a fraction of it had been changed and 3) inability to support kinda-real-time data processing (meaning processing documents as they are crawled and updating index appropriately). In short: welcome to disillusionment stage of MapReduce saga. And luckily Hadoop is not only MapReduce, I’m convinced Hadoop will thrive and flourish beyond MapReduce and MapReduce being an important big data tool will be widely used where it really makes sense rather than misused or abused in various ways. Aster Data and remaining MPP startups can relax on the issue a bit.

Probably a topic for another post, but I think MapReduce is best leveraged as ETL tool.

See also http://blog.tonybain.com/tony_bain/2010/09/was-stonebraker-right.html for another view on the issue. There are few others posts already published  on Precolator but I haven’t yet looked into them.

I’m very happy about my SVLC-hypothesis, I think I knew it for a long time, but somehow only now, after I have put it on paper, I felt that the reasoning about different analytics approaches became easier. It is like having a map instead of visualizing it. So where is Percolator in the context of SVLC? If it is still considered analytics, Percolator is an SVC system – giving up latency for everything else, albeit to a lot lesser degree than its successor MapReduce. That said Percolator has a sizable part that is not analytics anymore but rather transaction processing. And transaction processing  is not usefully modeled by my SVLC-hypothesis. In summary: Percolator is essentially the trade-off as MapReduce – sacrificing latency for volume-cost-sophistication but more temperate, more rounded,  less-radical.

Unfortunately I haven’t enough time to enjoy the paper as it should be enjoyed with easy weekend-style reading. So inaccuracies may have been infiltrated in:

  • Percolator is big-data ACID-compliant transaction-processing non-relational DBMS.
  • Percolator fits most NoSQL definitions and therefore it is a NoSQL.
  • Percolator  continuously mutates dataset (called data corpus in the paper) with full transactional semantics and in the sizes of tens of petabytes on thousands of nodes.
  • Percolator uses a message-queue style approach for processing crawled data. Meaning, it processes the crawled pages continuously as they arrive updating the index database transactionaly.
  • BEFORE Percolator: Indexing was done in stages taking weeks. All crawled data was accumulated and staged first, then pass-after-pass transformed into index. 100-passes were quoted in the paper as I remember. When the cycle was completed a new one was initiated. Few weeks latency after content was published and before it appeared in Google search results were considered too long in twitter age, so Google implemented some shortcuts allowing preliminary results to show in search before the cycle is completed.
  • Percolator  doesn’t have declarative query language.
  • No joins.
  • Self-stated ~3% single node efficiency relative to the state-of-the-art DBMS system on single node. That’s the price for handling (which is transactional mutating) high-volume dataset… and relatively cheaply. Kudos for Google to being so open on this and not exercising in term-obfuscation. On the other hand, they can afford it… they don’t have to sell it tomorrow on rather competitive NoSQL market ;)
  • Thread-per-transaction model. Heavily threaded many-core servers as I understand.

Architecturally reminds me MoM (Message Oriented Middleware) with transactional queues and guarantied delivery.

Definitely to be continued…

other Percolator blog posts:

http://www.infoq.com/news/2010/10/google-percolator
http://www.theregister.co.uk/2010/09/24/google_percolator
http://coolthingoftheday.blogspot.com/2010/10/percolator-question-is-how-does-google.html
http://www.quora.com/What-is-Google-Percolator

CAP equivalent for analytics?

CAP theorem deals with trade-off in transactional system. It doesn’t need an introduction, unless of course you have been busy on the moon for last couple of years. In this case you can easily Google for good intros. Here is a wikipedia entry on the subject.

I was thinking how would I build an ideal analytics system. Quickly came realization that all “care abouts” cannot be satisfied simultaneously, even assuming enough time for development. Some desirable properties must be sacrificed in favor for others, hence architectural trade-offs are unavoidable in principle. I immediately had déjà vu regarding CAP. So the following is my take on the subject:

SVLC hypothesis regarding architectural trade-offs in analytics

I haven’t came to rigorous definition yet, here is an intuitive one:  Current technology doesn’t allow implementation of a single analytics system that is SVLC which is simultaneously sophisticated, high-volume, low-latency and low-cost .One of these four properties must be sacrificed, the extent to which it is sacrificied determines the extend to which other properties could potentially be implemented.

Deep dive for the brave souls

Let’s reiterate the desired system properties first (see ideal analytics system):

  1. Deep Sophistication => …free-form SQL:2008 with multi-way joins of 2 and more big tables, sorts of big tables and all the rest of data heavy lifting.
  2. High Volume => …handling big data volumes, Let’s cap it 1PB meanwhile for easier thinking.
  3. Low Latency => …subsecond response time for the query on average. A more concrete description is that latency must be low enough to allow analyst working interactively in conversational manner with the system.
  4. Low-Cost => … I’ll define it as commodity hardware and software must not exceed hardware costs. More rigorously? $1/GB/month for actively queried data is my very rough estimation for low-cost.
  5. Multi-form => any data, relational, serialized objects, text etc….
  6. Security => can speak for itself

I found that multi-formness and security doesn’t interfere with implementing the rest of properties and can in principal always be implemented in satisfactory way without major compromises. Some nuances exists tough, but I’ll ignore them for clearness. So removing them and getting the following list:

1. Sophistication (deep)              => S

2. Volume (high)                          => V

3. Latency (low)                           => L

4. Cost (low)                                  => C

These four are highly inter-related and form а constraint system . Implementing one to full extent hampers the rest. Let’s see what trade-offs we have here. Four properties that is 6 potential simple 2-extremes trade-offs. Let’s settle on geometric tetrahedron to model the architectural trade-off space. Four properties correspond to four vertexes and six trade-offs correspond to six edges. Then we model particular trade-off by putting a point on the corresponding edge. So we get something like this:

Okey, so far so good. Now, I’ll try to be а devil advocate and challenge my point that any trade-offs are necessary in the first place. So let’s review the system denoted as

SVLC=> high-volume, low-latency, deep analytics, low cost

Because it is low-latency it will need I/O throughput adequate to scan whole dataset quickly and since it is high-volume (see above for quantitative definition) meaning aforementioned dataset is big, it will need a large number of individual nodes in cluster to provide the required aggregated adequate I/O throughput. The number of machines is further increased with low-cost requirement meaning that simpler servers that are in mainstream sweetspot must be purchased. Therefore system becomes extremely distributed and data being dispersed all over it. The low-cost networking usually mean TCP/IP that is high-overhead, high-latency and low-throughput. Deep Sophistication analytics requires performing complex data-intensive operations like full sorting of big datasets, joining big tables or just simple select distinct over big data that will inevitably have long latencies. Once latency is long enough that probability of node failing mid-query become non-neglectable. Latency increase becomes self-perpetuating because of required finer grain of  intermediate result materialization. This is needed to prevent never-ending query restarting and provide a kind of resumable queries. Not other solutions to resumable queries are documented except MapReduce-style intermediate result materialization. This ultimately makes latency batch-class long violating low-latency requirement.

I guess my proof miss the required rigor to be considered seriously by academics, I’m just an engineer :) I love to see it reworked to something more serious tough. I just hope to get the point across and to be of value to engineers and practicing architects.

Anyhow this is the base of my hypothesis showing that it impossible to achieve full SVLC using today technology.

Let’s consider other cases where we give up something. It is easy to visualize such trade-off as a 2D plane dissecting tetrahedron. The three points were three edges are cut corresponds to three trade-offs. For simplicity I’ll elaborate only radical trade-offs in this post. Radical trade-offs are those were on all six trade-off edges one extreme is selected and this corresponds to putting a trade-off point on one of vertices. Most real-world system make temperate trade-offs that corresponds to the plane that dissects the tetrahedron into sizable parts. Moreover real-world system, especially available from commercial vendors, are a toolbox kind of a system. Meaning that system consists of a set of tools where each one makes a different set of trade-offs. Then it is up to engineer to choose the right tool for the job to the toolbox. However, toolbox approach is not a loophole for this hypothesis, because properties of the different tools don’t add up in desired way. For example the simultaneous use of expensive tool and low-cost tool is still expensive; the simultaneous use of low-latency and high-latency is high-latency. Nevertheless, toolbox approach is best one for real-world problems. Because real-world problems are usually decomposed to a number of sub problems where each may require different tool.

Well…back to the radical systems… Let’s consider all four cases where we completely give up one property to max out the rest three:


SVL => high-volume, low-latency, deep analytics …giving up Cost… seems to be implementable. In its pure form it reminds classic national security style system. Subsecond querying petabyte-scale dataset with arbitrary joins. Heavily over provisioned Netezza / Exadata / Greenplum / Aster and other MPP-system could do it I believe. Data kept in huge RAM or on flash, huge I/O is available to scan the whole dataset in matter of seconds. High-speed, low-overhead networking is available to with huge bi-section network bandwidth capable to shuffle the whole dataset in matter of seconds. Infiniband/RDMA are the best probably. How bad Cost can be here? Well… unhealthy to imagine. Throw some numbers in anyway? Will do some back of envelope calculation in my future posts.

SVC => high-volume, low-cost, deep analytics …giving up Latency… seems to be implementable, in fact it is MapReduce territory, Hadoop natural habitat. Are ETL systems SVC? I think no, because while they given up Latency they haven’t kept on Volume. How bad is Latency? well… forget interactivity, create queuing system and get notified when the job is done. If too slow add servers. If some interactive experimentation is needed, use VLC first to develop and prove your hypothesis and only than crunch the data with SVC. Since cost is involved I guess Hadoop MapReduce is really a king here. Tough if Aster licenses for example are comparable to commodity cluster overall cost and is not many multiples of it, then it could fit the category nicely. Otherwise it will make suboptimal (considering my model context not in wider sense!) great SV system. The great MapReduce debate is not for nothing!

SLC => low-latency, low-cost, deep analytics …seems to be implementable in a minute, just start your favorite spreadsheet application ;) You will be shocked how much data Excel crunches in just few seconds, nowadays. Most traditional BI tools are in this category too. Heck, if not for BigData, the analytics industry will be as would become as exciting as enterprise payroll systems. Though, innovation is possible even there.  Heck, 99% of BI is fully feasible to be done completely  in-memory, often on single server and the deployment must be really low-risk low-cost very-rapid if done correctly. Most cloud BI vendors are also in this category. “R-project” is here too. This was Kickfire beloved spot as well as is now for QlikTech & GoofData, PivotLink and etc… So pretty much all BI vendors are here except MPP heavy-lifters. How bad Volumes are limited? Well with CPU-DRAM bandwidth being 50GB/sec and DRAM sizes 64GB on common commodity servers I think crunching few tens GB should be well possible in matter of seconds, if not for implementation sloppiness, and with literally pocket money (average enterprise’s pocket not mine… yet).

VLC => high-volume, low-latency, low-cost …giving up Sophistication…seems to be implementable, that is doing a simple scan and giving up the Sophistication, particularly joins. Dremel and BigQuery seem following this approach. How bad is giving up Sophistication? Well, it all depends on how pre-joined/nested dataset is. With normalized schemas, well… unavailability of joins makes it pretty much impractical implement any usable analytics. However, with star-schema and particularly nested data (with some extensive pre-joins even if it means some redundancy), this can work wonders to vast majority of queries, completing them in seconds on even large datasets. However, no pre-join strategy will work for 100% of queries and functions like COUNT DISTINCT must be approximated when run over big dataset like described in Dremel paper. Also I’ll assign sampling strategy to this category, because sophistication also means accuracy here. One clarification: only joins of several big tables are sacrificed here, joins of big table with even large number of small tables are perfectly okay and done on the fly during the scan. Sorts of big table before it was reduced significantly to manageable size is also sacrificed in this approach, however approximation algorithms can be used for this and then it will be okay too.

Hence the conclusion: only 3 of 4 SVLC properties can be implemented in full extend in single analytic system. The hypothesis goes that any attempt that allegedly violates it, in fact either is no a single system or impairment is latent in one or more properties.

[TODO: rewrite] The extended hypothesis for fractional cases:

  • Systems/trade-offs may be radical or temperate. Radical trade-off completely gives up one of four properties of the system. Temperate trade-off gives-up the property only fractionally on expense of giving-up other properties also fractionally.
  • Most real-world systems are complex. They are a set of tools, where each separate tool is a concrete trade-off. Then the user of such system can use different tools with different trade-off sequentially or simultaneously. This may seems as way out of the restraint; however it is not, because properties of separate tools don’t add up in desired way. For example the simultaneous use of expensive tool and low-cost tool is still expensive; the simultaneous use of low-latency and high-latency is high-latency.  Nevertheless, toolbox approach is best one for real-world problems. Because real-world problems are usually decomposed to a number of sub problems where each may require different tool.
  • Most often trade-offs of real-world systems are temperate.


Analytics Patterns

Unsatisfied by my previous post‘s Advanced Analytics definition and giving it a thought of what is advanced methods in analytics I realized that analytics industry miss a good analytics pattern catalog. A list of common problems followed by a list of common industry-consensus solutions to them. An equivalent of GoF design patterns to analytics. The list, where each list item starts from brief description of common recurring analytics problem and follows by elaboration by commonly accepted solutions to this problem followed by mandatory example section illustrating the solution using widely available tools.

Software engineers stolen this idea from the real architects (those dealing with a concrete structures not an abstract ones ;) ) 15 years ago.  They haven’t avoided initial short period of mass obsession and abuse of the concept… who does?  But eventually it worked out quite well for them us. I wonder if analytics industry could leverage these experience and create a catalog of some 25-50 most common patterns. Pattern descriptions in a catalog not to exceed few pages and number of patterns limited to few tens, making it wide industry adoption feasible.

What you think? Any ideas? I’ll try to make a first step by dumping patterns from my head right now (it is by no way a finished work):

I’ll call it analytics patterns:

1. Predictive Analytics. That was the easiest for me. I was involved into it for the first time some 12 years ago and developing what is now http://www.oracle.com/demantra/index.html. The system was used mostly to forecast sales taking into account an array of causal factors like seasonality, marketing campaigns, historical growth rates and etc. The problem is that there is a lot of time-based historic data available and it is required to forecast future values in the context of given historic data. The basic mechanism of implementing Predictive Analytics is to find or less preferably to develop a suitable mathematical model that can model closely (but be  cautious about overfitting) existing data, usually a time-series data and then use the model to induce forecasted values. In simple terms it is a case of extrapolation. Correct me if I’m wrong. As it was the case in 90-ties I’m pretty sure it is the case now, that exotic hardcore AI approaches like neural networks & genetic programming are best kept exclusively for moonlighting experiments and as material for cooler conversation the next morning. With deadlines defined and limited budget it is best to stick to proven techniques to achieve quick wins. I think the value of working forecasting is self evident.

2. Clustering. Well not the heavy noisy one in a cold hall :) but the statistics sub-discipline called better cluster-analysis. The problem here is that a lot of high-dimensionality data is available and it is required to discover groups with similar observations in other words automatically classify them. It is implemented by searching for correlations grouping the records according to the discovered correlations. What it is good for? Well in simple terms it helps to discriminate different kinds of objects and observe the specific properties of each kind. Without such grouping, one would be able only to observe properties that all objects exhibit or alternatively go object by object and observe it in isolation.

3. Risk Analysis - particularly through Monte-Carlo simulation. It is not called Monte-Carlo because it is invented there :) it called so because of reliance on random numbers akin Monte-Carlo casinos. Random numbers are proved most effective way to simulate mathematical model with large number of free-variables. With advent of computers it became a whole lot easier than using the book.

4. Given telecom event stream, run events through the rules engine to detect and prevent telecom fraud in real-time. This is essentially CEP engine and usually implemented by creating a state-machine per rule and running the events through it. Special version of stream sql is used. Similar scheme can be used for real-time click fraud prevention.
5. Given serialized object data or nested data allow running ad-hoc interactive queries over it in BigQuery fashion.
6. Given normalized relational model, allow running any ad-hoc queries. For common joins create a materialized view to speed up joins.
7. Canned reports. I guess they are good also for some cases…….
8. OLAP/Star schema when to use? ……

What else?

Of course it is just a first step and to do it correctly it will be a project in itself, in form of a book most probably. However, as one Chinese proverb  goes “A journey of a thousand miles begins with a single step”.

Feature list of ultimate BigData analytics

  • Volume Scalability => the solution must handle high volumes of data, meaning the cost must scale linearly in the range of 10GB – 10PB.
  • Latency Scalability => the solution must be interactive or batch, and cost must scale linearly in the range of 1 msec – 1 week.
  • Sophistication Scalability => the solution must support simple summing scans or complex multi-way joins and statistics functionality and the cost must scale linearly in the range of simplistic scans to full blown SQL:2008/MDX/imperative in-database-analytics/MapReduce. Report/index viewing is not considered as analytics at all and particularly as not low-sophistication analytics. Report/index creation is analytics and can be of varied sophistication degree. ETL systems is considered as independent analytic systems.
  • Security => any unauthorized access to data must be prevented and in the same time, in-place data analysis (like predicate evaluation) must be possible and resource-efficient.
    • Keeping data always encryption and keeping keys always on client will not work. It will require shipping all the data to the client and is non-starter for big data analytics. So compromises must be made. The issue is especially contentious in public cloud setting.
    • If data is stored encrypted and is continuously decrypted in-place for predicate evaluation, for example, it means that keys must kept in same place (at least temporarily) and it compromises the whole scheme altogether, flooring its cost-benefit factor. The cost of decryption is pretty high.
    • De-identification of all fields may work; random scaling may be applied to numeric fields with subsequent query/result rewrite.
    • Security-by-obscurity methods and defense-in-depth approach may have good cost-benefit factors matching or exceeding overall security for in-house approach.
  • Cost => must have low-TCO that scales linearly to dataset size and the load factor caused by submitted queries. The breakdown (assuming cloud):
    • Storage component linear to dataset size. Economies of scale must bring this cost down significantly. Eventually it must be cheaper than on-site storage.
    • Computing component linear to load with infinite intra-query automatic elasticity. Guarantied elasticity may bear a fixed premium proportional to guarantied capacity. Minor failures of cloud component must not restart long running queries.
    • Bandwidth component. Fedexing hard-drives are by far the cheapest way to upload data, and then query results are really small. How much information human can comprehend instantly after all?
  • Multi-form =>
    • normalized relational
    • star-schema
    • cubes
    • serialized objects / nested data.
    • text
    • media
    • spatial
    • bio / scientific
    • topographical
    • and other data forms must be equally well supported and cross-queried.