Analytics Patterns

Unsatisfied by my previous post‘s Advanced Analytics definition and giving it a thought of what is advanced methods in analytics I realized that analytics industry miss a good analytics pattern catalog. A list of common problems followed by a list of common industry-consensus solutions to them. An equivalent of GoF design patterns to analytics. The list, where each list item starts from brief description of common recurring analytics problem and follows by elaboration by commonly accepted solutions to this problem followed by mandatory example section illustrating the solution using widely available tools.

Software engineers stolen this idea from the real architects (those dealing with a concrete structures not an abstract ones ;) ) 15 years ago.  They haven’t avoided initial short period of mass obsession and abuse of the concept… who does?  But eventually it worked out quite well for them us. I wonder if analytics industry could leverage these experience and create a catalog of some 25-50 most common patterns. Pattern descriptions in a catalog not to exceed few pages and number of patterns limited to few tens, making it wide industry adoption feasible.

What you think? Any ideas? I’ll try to make a first step by dumping patterns from my head right now (it is by no way a finished work):

I’ll call it analytics patterns:

1. Predictive Analytics. That was the easiest for me. I was involved into it for the first time some 12 years ago and developing what is now http://www.oracle.com/demantra/index.html. The system was used mostly to forecast sales taking into account an array of causal factors like seasonality, marketing campaigns, historical growth rates and etc. The problem is that there is a lot of time-based historic data available and it is required to forecast future values in the context of given historic data. The basic mechanism of implementing Predictive Analytics is to find or less preferably to develop a suitable mathematical model that can model closely (but be  cautious about overfitting) existing data, usually a time-series data and then use the model to induce forecasted values. In simple terms it is a case of extrapolation. Correct me if I’m wrong. As it was the case in 90-ties I’m pretty sure it is the case now, that exotic hardcore AI approaches like neural networks & genetic programming are best kept exclusively for moonlighting experiments and as material for cooler conversation the next morning. With deadlines defined and limited budget it is best to stick to proven techniques to achieve quick wins. I think the value of working forecasting is self evident.

2. Clustering. Well not the heavy noisy one in a cold hall :) but the statistics sub-discipline called better cluster-analysis. The problem here is that a lot of high-dimensionality data is available and it is required to discover groups with similar observations in other words automatically classify them. It is implemented by searching for correlations grouping the records according to the discovered correlations. What it is good for? Well in simple terms it helps to discriminate different kinds of objects and observe the specific properties of each kind. Without such grouping, one would be able only to observe properties that all objects exhibit or alternatively go object by object and observe it in isolation.

3. Risk Analysis - particularly through Monte-Carlo simulation. It is not called Monte-Carlo because it is invented there :) it called so because of reliance on random numbers akin Monte-Carlo casinos. Random numbers are proved most effective way to simulate mathematical model with large number of free-variables. With advent of computers it became a whole lot easier than using the book.

4. Given telecom event stream, run events through the rules engine to detect and prevent telecom fraud in real-time. This is essentially CEP engine and usually implemented by creating a state-machine per rule and running the events through it. Special version of stream sql is used. Similar scheme can be used for real-time click fraud prevention.
5. Given serialized object data or nested data allow running ad-hoc interactive queries over it in BigQuery fashion.
6. Given normalized relational model, allow running any ad-hoc queries. For common joins create a materialized view to speed up joins.
7. Canned reports. I guess they are good also for some cases…….
8. OLAP/Star schema when to use? ……

What else?

Of course it is just a first step and to do it correctly it will be a project in itself, in form of a book most probably. However, as one Chinese proverb  goes “A journey of a thousand miles begins with a single step”.

Terminology: Analysis vs. analytics and more…

I see a lot of confusion in the usage of newer terms in analytics. I do confuse them myself occasionally. I find it funny that the industry as serious as analytics tolerates constant renewal of its basic terminology. Yet, I confess, I’m very guilty of it myself. I do enjoy the freshness and the novelty of newer terms even being fully aware that is fake by a large extent.

In this post I’ll take a step to clear the confusion on few most basic terms: analysis vs. analytics vs. BI and all their common derivatives.

The Spoiler (the quick answer):

Analysis is the examination process itself where analytics is the supporting technology and associated tools. BI is quite synonymous to analytics in IT context. Advanced Analytics, Business Analytics, Data Analytics, Analytics Software, Analytics Technology are almost always marketing pleonasms (redundant expressions) and can be safely substituted by just ‘analytics’. ‘Data analysis’ is yet another pleonasm. Compound expressions of these words such as ‘BI Analytic Technology’ are yet again pleonasms albeit of higher degrees. Some nuances exist tough and are elaborated in this post.

The deep dive for the brave souls:

Let’s attempt to properly define the terms and then carefully examine the alleged differences.

Before we dive in, a word of caution: definition by synonyms is wrong. It causes stack overflow in the mind of programmers. For example “analysis” => “critical examination” => “examination” => “critical inspection” => “inspection” = “critical examination” => “f…”=> “why I just don’t make myself a cup of coffee?”.

You can check what makes a good definition and common mistakes following……. Well apparently I haven’t found in a quick look a good material  on proper definition but for fallacies there is a nice wikipedia article. If you find a good article on what makes a good definition drop me a note / comment, if so it would include a definition definition.

Let’s start….

What is analysis?

Analysis is a pretty old, well understood term and essentially means “breaking down” or “decomposition”. More accurately – “the process of decomposing complex entity into simpler components for easier comprehension of aforementioned entity”. As a child I did a lot of it to the toys and electronic appliances around me. I challenge you to find a better and more concise definition than mine above (it is a matter of taste but anyway). Here is some links to save you time:

http://www.google.com/search?q=define:+analysis

http://en.wikipedia.org/wiki/Analysis

http://en.wiktionary.org/wiki/analysis

http://thesaurus.com/browse/analysis

What is analytics?

Analytics is a newer term related to analysis and looking it up will usually only add to confusion since definitions vary and are fuzzy and seems to be context-dependent. Focusing on IT context I went through many usage examples and definitions. My verdict is that analytics just means: the technology and the associated tools for data analysis.

If so, then ‘data analytics technology’ is a double redundant (or more accurately pleonasmic) term because analytics is a technology by itself and it’s clearly obvious that in IT context only data can be analyzed. Hence the above phrase can be abbreviated as ‘analytics’ without any impairment to the meaning. Same goes to ‘data analytics tools’. However, when IT context is not implied, something like  ‘data analytics software’ could be appropriate. In this case ‘data’ links it to IT and ‘software’ further narrows its meaning.

Incorrect usage (according to my interpretation):

Software company most probably doesn’t develop “next-gen data-analysis” but “next-gen data-analytics”.  And by the same token “cloud computing analysis” means examining cloud computing concept not using cloud computing as a tool for doing analysis. In latter case “cloud analytics” must be used.

Analyst performs in-database analysis or applies in-database analytics to calculate something. However analyst doesn’t performs in-database analytics.

If you look the terms used by QlikView folks you will find pretty much all the above terns used interchangeably, including the statement that they “provide fast, powerful and visual in-memory business analysis”. One may think that they provide business advise for companies in memory business. Terminology aside no bashing QlikView, it is excellent analytics software and one of very few that just works out of the box.

What is analytical?

In regard to data it means that it compiled using analysis. In regard to the tool it means that it is intended for analysis.

Data Analysis and Data Analytics

As already mentioned in IT context both are pleonasms and non-data analysis or non-data analytics are both oxymorons. So why stress data anyway? Mostly there is no reason and in other cases it is there to hint IT context. For example for bankers it is ‘financial analytics’ but for IT folks in the bank it is ‘data analytics’.

What ‘advanced analytics’ hints then?

Well, I guess it is a way for a vendor to indicate that their analytics is less stagnating than of their competitors :) Seriously tough, I guess it means, where it really used to mean anything that statistics methods are implemented like: predictive modeling and clustering. Also it has strong connotations with Gartner press-release naming it second most promising technology for 2010.

What is wrong with just sticking with older BI term?

It is a fashion thing I guess…. who said IT is boring? We could easily challenge Parisian fashion industry on that. Seriously tough, BI is considered as more comprehensive approach encompassing many aspect and is usually cross departmental, notorious for high project failure rate.  At least that way younger startups portrait it. On the other hand ‘data analytics’ is portrait something more simple and more of a ‘quick wins’ departmental solution. Something akin ‘Data mart’. And don’t ask me what is the difference with data marts. Have I mentioned fashion thing.

Well aside of fashion, there are more rational reasons too of course. Startup pitching BI, sounds boring at best with Microsoft, IBM, Oracle dominating it. It must define a new disruptive category and then dominate it. Who read Christiansen could remember that no new terms is necessary for disruption. Somehow it is easier to communicate using new terms. I would love to believe that it is not deceiving. In fact masquerading advanced analytics as something completely distinct may work all the way from investors to the customer’s CIO that may find suspicious that he is purchasing too many BI solutions, and purchasing first “advanced analytics” solution and early enough may seems quite smart and a sign that his organization is far from being in stagnation, especially just after reading Gartner press-release.

UPDATE:

Another view on the subject: http://www.b-eye-network.com/view/13797

Yet another one: http://blogs.forrester.com/boris_evelson/10-06-07-bi_vs_analytics