Under old world analytics, you move data from the DSR to your analytic server, build models, then write results (sometimes models too) back out for integration into the DSR.
Now, consider this:
- DSR datasets are often enormous. (2 years of data for a DSR I worked with recently input to a model was approx. 270 GB)
- Analytic tools are small. (The R base software, all 150 packages I have installed and the development environment is 625 MB)
- Analytic models are tiny. (Expressing a 10 component regression model in SQL, just 288 bytes and most of that is down to variable names)
Let's try that visually.
The input data is huge, everything needed to run R (my analytics tool of choice) is barely a blip on the scale and the resulting model can't be seen on this scale at all. And today we move the DSR data to the analytic server to run the analytics.... anyone else having an issue with this ?
Where the data is small enough that we can pull what we need via query over an ODBC connection and hold it in memory to run the analytics, perhaps you can live with the network overhead.
Similarly, if the DSR and analytic servers are co-located with a big fat data pipe connecting them, it doesn't matter so much. It's not same machine I'm after necessarily, but same rack would be nice.
What happens though, when the data is too big and the connection too slow (think wide area network) to be feasible? Now we need to build database structures on the analytic server, load the data (taking a copy), and if we are to re-run the analytics routinely, keep it in sync with the source on an ongoing basis. This is a lot of (non-analytic) maintenance work before we can even get started on the analytics.
So why do we do this?
"The analytic server is a high power, high memory machine great for analytics!" That's true but chances are your database servers have the same thing.
There are also valid concerns around how an analytic tool connecting directly to a database may impact other users. I do have a little sympathy for this, certainly much more than I used to, but think on this: a DSR is not a mission critical system. The failure of a mission-critical systems stops your business. If the DSR stops (and the chances are very good that you will have no issue at all), your reports are a bit late. Relax !
I have a suspicion that some of this is related to licensing. If you pay a small fortune for your analytic tools and they are priced per server, per CPU or per core, I can see why you would not want to go installing that software everywhere you might want to use it. Cheaper perhaps to bring the data to the software. Working with free open-source tools, it's not been an issue for me to install co-located or even on the same machine as needed.
Recently a number of database and BI vendors have moved to integrate analytic tools (often R, sometimes SAS) into their offerings trying to deliver real in-database analytics. I do think this is a great direction to move in though I have some concerns about the level of integration currently available. see my post on
Analytic Power ! for more details.
Even if you can't execute true in-database analytics (which should be a Next Generation feature) there are still things you should be able to do to bring the analytics to the data.
First let's make a distinction between model-building (the act of creating new models from data) and model-scoring (running existing models against new data to make new predictions). All predictive analytic models I can think of can have this same split. (Descriptive and Prescriptive analytics do not)
Model-building is an intensive task, this is where all the heavy lifting happens in analytic work so processing and memory needs can be substantial though this varies widely depending on the analytic method and to some extent the implementation. If you have installed analytic tools directly on your database servers this may be enough to cause something of a slow-down. OK - try to co-locate instead. If you absolutely must replicate data to an analytic server on the other side of the world and try to keep your data in sync, I pity you.
Model-scoring is fast. A model is just a set of simple calculations. Deciding exactly what simple calculations you needed was the job of model-building but now you have done that, scoring new data against that model is quick.
This is what the result of a simple regression model looks like (in SQL):
[Variable_1] *-49.8916 + [Variable_2] *-24.2773 + [Variable_3] *-48.1305 + [Variable_4] -253.7238 + [Variable_5] *-20.7173 + [Variable_6] *17.722 + [Variable_7] *12.9865 + [Variable_8] *-17.4036 + [Variable_9] *2.2738 + [Variable_10] *-7.9186 + 6.668 AS Prediction
If you think it looks complex, look again, it's just a set of input variables multiplied by specific weights (as found by model-building) and then added together. This is easy work for the database. More complicated models will have more complex expressions, you may see logs, exponents, trig., perhaps an if..then..else statement. Nothing the database will find difficult to execute if it's expressed in the right language.
Unless models change with every input of new data (and so need re-building) there is no excuse not to score the model directly against the data. How you execute the model scoring is a different question and you have some options:
- you may load the model, new data and score directly in your analytic tools. This is using a sledgehammer to crack a nut, but it's easy to do if a little heavyweight/slow.
- for simpler models converting the model into SQL is not that difficult (though you do need to know SQL pretty well and have permission to build it into the database as a view, stored procedure or user defined function. This is probably the most difficult but fastest to execute.
- try converting the model to PMML (predictive model markup language) and use a server based tool designed to execute PMML against your database. (Many analytic tools have an option to export models as PMML.) A PMML enabled DSR would be a great enhancement for the Next Generation.
Bring your analytics to the data , spend more time doing analytics and less data time wrangling.