Better Business Analytics: August 2014

Analytics are for everyone !

Analytics are for everyone! Well, not building analytics, no. That needs a high level of expertise in statistics, machine-learning, optimization, programming, database skills, a healthy does of domain knowledge for the problem being addressed and a pretty wide masochistic streak too.

Using analytics, now that is for everyone, or at least it should be. We all use analytics, and, I think, the best examples, we use without thinking about just how complex it is.

Is there anyone out there that hasn't used an electronic mapping service (GPS) for directions? Even ignoring the electronics, these are remarkable pieces of engineering! An extensive, detailed database of road systems and advanced routing analytics to help you find the best route from A to B without sending you backwards down one-way roads or across half-finished bridges.

Perhaps you're thinking it's not that hard? Could you build it? What if I got the data for you? No? But you can use it right? They are not perfect, mostly I think because of data cleanliness problems, but they are close enough that I don't travel far from home without one.

More examples. Anyone used a search engine? How about an on-line weather forecast? How about web-sites that predict house-values? Recommendation engines like those used by Amazon and Netflix? All heavy analytic cores wrapped in an easy to consume, highly usable front-end.

These are, I think, among the exceptions in analytic applications - good analytics AND good delivery.

I talked about pseudo-analytics in a recent post: shams with no basis in science wrapped in a User Interface with the hope that nobody asks too many questions about what's under the hood. This is not good analytics.

Unfortunately even good analytic tools get under-used if they have not been made accessible to the poor people that have to use them. Spreadsheet tools probably top the list for unusable analytic applications: unusable that is by anyone except the person that wrote them. Sadly though, I have seen many examples both in reporting and applications where so little effort was put in to User Experience that any good analytics is almost completely obscured.

Building new analytic capability is a highly skilled job. Delivering analytic results in an easy to consume format so that it gets used is also a highly skilled and, frankly, often forgotten step in the process. After all we do build analytic tools so that they get used. Don't we? Sometime I wonder.

Next Generation DSRs - Bring the Analytics to the data

Under old world analytics, you move data from the DSR to your analytic server, build models, then write results (sometimes models too) back out for integration into the DSR.

Now, consider this:

DSR datasets are often enormous. (2 years of data for a DSR I worked with recently input to a model was approx. 270 GB)
Analytic tools are small. (The R base software, all 150 packages I have installed and the development environment is 625 MB)
Analytic models are tiny. (Expressing a 10 component regression model in SQL, just 288 bytes and most of that is down to variable names)

Let's try that visually.

The input data is huge, everything needed to run R (my analytics tool of choice) is barely a blip on the scale and the resulting model can't be seen on this scale at all. And today we move the DSR data to the analytic server to run the analytics.... anyone else having an issue with this ?

Where the data is small enough that we can pull what we need via query over an ODBC connection and hold it in memory to run the analytics, perhaps you can live with the network overhead.

Similarly, if the DSR and analytic servers are co-located with a big fat data pipe connecting them, it doesn't matter so much. It's not same machine I'm after necessarily, but same rack would be nice.

What happens though, when the data is too big and the connection too slow (think wide area network) to be feasible? Now we need to build database structures on the analytic server, load the data (taking a copy), and if we are to re-run the analytics routinely, keep it in sync with the source on an ongoing basis. This is a lot of (non-analytic) maintenance work before we can even get started on the analytics.

So why do we do this?

"The analytic server is a high power, high memory machine great for analytics!" That's true but chances are your database servers have the same thing.

There are also valid concerns around how an analytic tool connecting directly to a database may impact other users. I do have a little sympathy for this, certainly much more than I used to, but think on this: a DSR is not a mission critical system. The failure of a mission-critical systems stops your business. If the DSR stops (and the chances are very good that you will have no issue at all), your reports are a bit late. Relax !

I have a suspicion that some of this is related to licensing. If you pay a small fortune for your analytic tools and they are priced per server, per CPU or per core, I can see why you would not want to go installing that software everywhere you might want to use it. Cheaper perhaps to bring the data to the software. Working with free open-source tools, it's not been an issue for me to install co-located or even on the same machine as needed.

Recently a number of database and BI vendors have moved to integrate analytic tools (often R, sometimes SAS) into their offerings trying to deliver real in-database analytics. I do think this is a great direction to move in though I have some concerns about the level of integration currently available. see my post on Analytic Power ! for more details.

Even if you can't execute true in-database analytics (which should be a Next Generation feature) there are still things you should be able to do to bring the analytics to the data.

First let's make a distinction between model-building (the act of creating new models from data) and model-scoring (running existing models against new data to make new predictions). All predictive analytic models I can think of can have this same split. (Descriptive and Prescriptive analytics do not)

Model-building is an intensive task, this is where all the heavy lifting happens in analytic work so processing and memory needs can be substantial though this varies widely depending on the analytic method and to some extent the implementation. If you have installed analytic tools directly on your database servers this may be enough to cause something of a slow-down. OK - try to co-locate instead. If you absolutely must replicate data to an analytic server on the other side of the world and try to keep your data in sync, I pity you.

Model-scoring is fast. A model is just a set of simple calculations. Deciding exactly what simple calculations you needed was the job of model-building but now you have done that, scoring new data against that model is quick.

This is what the result of a simple regression model looks like (in SQL):

[Variable_1] *-49.8916 + [Variable_2] *-24.2773 + [Variable_3] *-48.1305 + [Variable_4] -253.7238 + [Variable_5] *-20.7173 + [Variable_6] *17.722 + [Variable_7] *12.9865 + [Variable_8] *-17.4036 + [Variable_9] *2.2738 + [Variable_10] *-7.9186 + 6.668 AS Prediction

If you think it looks complex, look again, it's just a set of input variables multiplied by specific weights (as found by model-building) and then added together. This is easy work for the database. More complicated models will have more complex expressions, you may see logs, exponents, trig., perhaps an if..then..else statement. Nothing the database will find difficult to execute if it's expressed in the right language.

Unless models change with every input of new data (and so need re-building) there is no excuse not to score the model directly against the data. How you execute the model scoring is a different question and you have some options:

you may load the model, new data and score directly in your analytic tools. This is using a sledgehammer to crack a nut, but it's easy to do if a little heavyweight/slow.
for simpler models converting the model into SQL is not that difficult (though you do need to know SQL pretty well and have permission to build it into the database as a view, stored procedure or user defined function. This is probably the most difficult but fastest to execute.
try converting the model to PMML (predictive model markup language) and use a server based tool designed to execute PMML against your database. (Many analytic tools have an option to export models as PMML.) A PMML enabled DSR would be a great enhancement for the Next Generation.

Bring your analytics to the data , spend more time doing analytics and less data time wrangling.

Next Generation DSRs - Analytic Freedom !

Current Demand Signal Repositories don't play well with others. Their data is locked away behind layers of security and you can only access it through the shackles of their chosen front-end for reporting. There is no good way to get that rich dataset into other tools: you have to copy it into a new database and new data structures. (In some cases you may have to do this twice, once to rearrange the data from the DSR into a format you can understand, then again to match the data structure needs of the downstream tool.)

For small-scale models (do we do those anymore?) that sip data from the original repository you can do this through the reporting engine and live with the pain, for large scale modeling it's really not an option.

I want freedom. Freedom to analyze with whatever tools I need: The freedom to report in Business Objects, visualize in Tableau, analyze in R and run existing applications (order-forecasting, master-data-checking, clustering, assortment optimization, etc.) directly against this data. (I'm not endorsing any of these tools and you can replace the named software above with anything you deem relevant - that's kind of the point).

Much of this freedom comes from a simplified data model, enabled by new database technologies (massively parallel processing, scale-out, in-memory and columnar). See more details at data handling.

It also needs a security model that is handled by the database NOT the reporting layer or as soon as you get to the underlying data you can see lot's of interesting things you shouldn't :-)

I suppose I could live with a little less freedom if a DSR offered all the tools I need but I don't think that's realistic. Not all DSR reporting layers are equal, data visualization is hit and miss, and as I posted in An Analytic name is not enough while there are some good DSR based analytic applications you will find many use pseudo-analytics and some have no analytic basis at all.

Do you think, perhaps, that the Next Generation DSR will provide the best reporting, visualization and analytic tools available? Sorry, I don't think so. DSRs cover a dizzying array of analytic need and developing robust, flexible analytic applications, even assuming easy access to the data, is an expensive proposition for any DSR vendor to do alone. I anticipate a few strong analytic "flag-ship" tools will emerge alongside more me-too/check-the-box applications packed with pseudo-analytics.

So, what can the Next Generation DSR do to help?

make it (much) easier to get at the data in large quantities,
make it (much) easier to bring analytics to bear on that data. (Perhaps with an integrated analytic toolset)
open up the system to whatever analytic tools work best for you
make it easy for other software vendors to provide add-in analytics on the DSR data/analytics platform.

Think about that last point for a moment, no DSR vendor is big enough to provide state of the art analytic applications in all areas, but make it easy enough to integrate with and it could enable specialist analytics vendors to offer their tools as add-ins to the platform. (This could be good news for the analytics vendor too, it removes the need for them to install and maintain their own DSR just to enable the analytics)

Let's look at an example.

Today if you want assortment-optimization capability, you can

wait for your DSR vendor to develop it and hope they use real analytics; or
search for another solution and work to interface the (very large) quantities of data you need between the applications;
write your own (always fun, but you had better know what you are doing) and you will still need to interface the data.
decide not to bother

All but the last one of these are slow - I'm guessing 12 months plus.

In the NextGen world, if you want to new analytic capability, you could still write your own, it's easy to hook up the analytic engine, or, just go to the DSR's analytic market-place and shop for it.

Next Generation DSRs - An Analytic name is not enough

You need not always build your analytic tools, sometimes you should buy in. If the chosen application does what you need that often makes good economic sense... as long as you know what you are buying.

Let's be clear, an Analytic name does NOT mean there are any real Analytics under the hood.

For many managers, Analytics is akin to magic. They do not know how an analytics application works in a meaningful way and have no real interest in knowing. At the same time, there is no business standard for what makes up "forecasting", "inventory optimization", "cluster analysis", "pricing analysis", "shopper analytics", "like products" or even (my favorite) "optimization". Don't buy a lemon!

Pages

Analytics are for everyone !

Next Generation DSRs - Bring the Analytics to the data

Next Generation DSRs - Analytic Freedom !

Next Generation DSRs - An Analytic name is not enough

Let's be clear, an Analytic name does NOT mean there are any real Analytics under the hood.