Debunking common misconceptions in SSD, particularly for analytics

1. SSD is NOT synonymous for flash memory.

First of all let’s settle on terms. SSD is best described as a concept of using semiconductor memory as disk. There is two common cases: DRAM-as-disk and flash-as-disk. And flash-memory is a semiconductor technology pretty similar to DRAM, just with slightly different set of trade-offs made.

Today there are little options to use flash memory in analytics beyond SSD. Nevertheless, it should not suggest that SSD is synonymous for flash memory. Flash memory can be used in products beyond SSD, and SSD can use non-flash technology, for example DRAM.

So the question is: do we have any option of using flash-memory in other form rather than flash-as-disk?

FusionIO is the only one and was always bold in claiming that their product are not-SSD but a totally new category product, called ioMemory. Usually I dismiss such claims automatically in subconscious as a common-practice of  term-obfuscation. However, in the case of FusionIO I found it to be a a rare exception and technically true. On hardware level there is no disk-related overhead in FusionIO solution and in my opinion FusionIO are closest to the flash-on-motherboard vision among all the rest of SSD manufacturers. That said, FusionIO succumbed to implementing a disk-oriented storage layer in software because unavailability of any other standards covering  flash-as-flash concept.

You can find a more in-depth coverage of New-Dynasty SSD versus Legacy SSD issue in recent article of Zsolt Kerekes on StorageSearch.com. Albeit I’m not 100% agree with his categorization.

2. SSD DOESN’T provide more throughput than HDD.

The bragging about performance density of SSD could safely be dismissed. There is no problem in stacking up HDDs together. As many as 48 of them can be put in single server 4U box providing aggregate throughput of 4GB/sec for fraction of SSD price. Same goes to power, vibration, noise and etc… The extent to which this properties are superior to disk is uninteresting and not justifying the associated premium in cost.

Further, for any amount of  money, HDD can provide significantly more IO throughput , than SSD of any of today vendor. On any workload: read, write or combined.  Not only this, but it will do so with an order of magnitude more capacity for your big data as additional bonus. However, a few nuances are to be considered:

  • If  data is accessed in random small chunks (let’s say 16KB chunks), then SSD will provide significantly more throughput (factor of x100 may be) than disk will do. Reading in chunks at least 1MB will put HDD as a winner in the throughput game again.
  • The flash memory itself, has great potential to provide an order of magnitude more throughput than disks. Mechanical “gramophone” technology of disks cannot compete in agility with the electrons. However, this potential throughput is hopelessly being left unexploited by the nowadays SSD controller. How bad it is? Pretty bad, SSD controllers pass on less than 10% of potential throughput. The reasons include: flash-management complexity, cost-constraints leading to small embedded DRAM buffers and computationally-weak controllers,  and the main reason being that there is no standards for 100 faster disk, neither legacy software could potentially keep with higher multi-gigabyte throughputs, so SSD vendors don’t bother and are obsessed with the laughable idea of bankrupting HDD manufacturers calling the technology disruptive which it is not by definition. So we have a much more expensive disk replacement that is only barely more performant, throughput-wise, than vanilla low-cost HDD array.

3. Array of SSDs DOESN’T provide larger number of useful IOPS than arrays of disks.

While it is true that one SSD can match disk array easily in IOPS, it should not suggest that array of SSD will provide larger number of useful IOPS. The reason is prosaic, array of disks provides an abundance of IOPS, many times more than enough for any analytic application. So any additional IOPS are not needed and astronomical number of IOPS in SSD arrays is a solution looking for a problem in analytics industry.

4. SSD are NOT disruptive to disks.

Well if it is true it is not according to Clayton Christiansen definition of “disruptiveness”.  As far as I remember Christiansen defines “disruptiveness” as technology A being disruptive to technology B when all following holds true:

  • A is worse than technology B in quality and features
  • A is cheaper than technology B
  • A is affordable to a large number of new users to whom technology B is appealing but too costly.

SSD-to-disk pair is clearly not true for any condition above so I’m puzzled how one can call it disruptive to disks?

Again. I’m not claiming that SSD or flash-memory is not disruptive to any technology I just claiming that SSD are not disruptive to HDD. In fact, I think flash-memory IS disruptive to DRAM. All three conditions above hold for flash-to-DRAM pair. Also a pair of directly attached SSDs are highly disruptive to SAN.

—————

Make no mistake I’m a true believer in flash-memory as a game-changer for analytics just not in the form of disk replacement. I’ll explore in my upcoming posts the ideas where flash memory can make a change. I know I totally missed any quantification proofs for all the claims above but…. well…. let’s leave it for comment section.

Also one of best coverage of flash-memory for analytics (and not coming from a flash vendor) is of Curt Monash on DBMS2 blog: http://www.dbms2.com/2010/01/31/flash-pcmsolid-state-memory-disk/


Terminology: Analysis vs. analytics and more…

I see a lot of confusion in the usage of newer terms in analytics. I do confuse them myself occasionally. I find it funny that the industry as serious as analytics tolerates constant renewal of its basic terminology. Yet, I confess, I’m very guilty of it myself. I do enjoy the freshness and the novelty of newer terms even being fully aware that is fake by a large extent.

In this post I’ll take a step to clear the confusion on few most basic terms: analysis vs. analytics vs. BI and all their common derivatives.

The Spoiler (the quick answer):

Analysis is the examination process itself where analytics is the supporting technology and associated tools. BI is quite synonymous to analytics in IT context. Advanced Analytics, Business Analytics, Data Analytics, Analytics Software, Analytics Technology are almost always marketing pleonasms (redundant expressions) and can be safely substituted by just ‘analytics’. ‘Data analysis’ is yet another pleonasm. Compound expressions of these words such as ‘BI Analytic Technology’ are yet again pleonasms albeit of higher degrees. Some nuances exist tough and are elaborated in this post.

The deep dive for the brave souls:

Let’s attempt to properly define the terms and then carefully examine the alleged differences.

Before we dive in, a word of caution: definition by synonyms is wrong. It causes stack overflow in the mind of programmers. For example “analysis” => “critical examination” => “examination” => “critical inspection” => “inspection” = “critical examination” => “f…”=> “why I just don’t make myself a cup of coffee?”.

You can check what makes a good definition and common mistakes following……. Well apparently I haven’t found in a quick look a good material  on proper definition but for fallacies there is a nice wikipedia article. If you find a good article on what makes a good definition drop me a note / comment, if so it would include a definition definition.

Let’s start….

What is analysis?

Analysis is a pretty old, well understood term and essentially means “breaking down” or “decomposition”. More accurately – “the process of decomposing complex entity into simpler components for easier comprehension of aforementioned entity”. As a child I did a lot of it to the toys and electronic appliances around me. I challenge you to find a better and more concise definition than mine above (it is a matter of taste but anyway). Here is some links to save you time:

http://www.google.com/search?q=define:+analysis

http://en.wikipedia.org/wiki/Analysis

http://en.wiktionary.org/wiki/analysis

http://thesaurus.com/browse/analysis

What is analytics?

Analytics is a newer term related to analysis and looking it up will usually only add to confusion since definitions vary and are fuzzy and seems to be context-dependent. Focusing on IT context I went through many usage examples and definitions. My verdict is that analytics just means: the technology and the associated tools for data analysis.

If so, then ‘data analytics technology’ is a double redundant (or more accurately pleonasmic) term because analytics is a technology by itself and it’s clearly obvious that in IT context only data can be analyzed. Hence the above phrase can be abbreviated as ‘analytics’ without any impairment to the meaning. Same goes to ‘data analytics tools’. However, when IT context is not implied, something like  ‘data analytics software’ could be appropriate. In this case ‘data’ links it to IT and ‘software’ further narrows its meaning.

Incorrect usage (according to my interpretation):

Software company most probably doesn’t develop “next-gen data-analysis” but “next-gen data-analytics”.  And by the same token “cloud computing analysis” means examining cloud computing concept not using cloud computing as a tool for doing analysis. In latter case “cloud analytics” must be used.

Analyst performs in-database analysis or applies in-database analytics to calculate something. However analyst doesn’t performs in-database analytics.

If you look the terms used by QlikView folks you will find pretty much all the above terns used interchangeably, including the statement that they “provide fast, powerful and visual in-memory business analysis”. One may think that they provide business advise for companies in memory business. Terminology aside no bashing QlikView, it is excellent analytics software and one of very few that just works out of the box.

What is analytical?

In regard to data it means that it compiled using analysis. In regard to the tool it means that it is intended for analysis.

Data Analysis and Data Analytics

As already mentioned in IT context both are pleonasms and non-data analysis or non-data analytics are both oxymorons. So why stress data anyway? Mostly there is no reason and in other cases it is there to hint IT context. For example for bankers it is ‘financial analytics’ but for IT folks in the bank it is ‘data analytics’.

What ‘advanced analytics’ hints then?

Well, I guess it is a way for a vendor to indicate that their analytics is less stagnating than of their competitors :) Seriously tough, I guess it means, where it really used to mean anything that statistics methods are implemented like: predictive modeling and clustering. Also it has strong connotations with Gartner press-release naming it second most promising technology for 2010.

What is wrong with just sticking with older BI term?

It is a fashion thing I guess…. who said IT is boring? We could easily challenge Parisian fashion industry on that. Seriously tough, BI is considered as more comprehensive approach encompassing many aspect and is usually cross departmental, notorious for high project failure rate.  At least that way younger startups portrait it. On the other hand ‘data analytics’ is portrait something more simple and more of a ‘quick wins’ departmental solution. Something akin ‘Data mart’. And don’t ask me what is the difference with data marts. Have I mentioned fashion thing.

Well aside of fashion, there are more rational reasons too of course. Startup pitching BI, sounds boring at best with Microsoft, IBM, Oracle dominating it. It must define a new disruptive category and then dominate it. Who read Christiansen could remember that no new terms is necessary for disruption. Somehow it is easier to communicate using new terms. I would love to believe that it is not deceiving. In fact masquerading advanced analytics as something completely distinct may work all the way from investors to the customer’s CIO that may find suspicious that he is purchasing too many BI solutions, and purchasing first “advanced analytics” solution and early enough may seems quite smart and a sign that his organization is far from being in stagnation, especially just after reading Gartner press-release.

UPDATE:

Another view on the subject: http://www.b-eye-network.com/view/13797

Yet another one: http://blogs.forrester.com/boris_evelson/10-06-07-bi_vs_analytics