Let’s assume that we have a few Terabytes of data and that it's sat in an industrial-strength
database (Oracle, SQL*Server, MySQL, DB2, …) - one that can handle the data
volume without choking. Each of these databases has its own dialect of the querying
language (SQL) and while you can do a lot of sophisticated data manipulation, even a simple analytical routine like calculating correlations is a chore.
Here's an example:
Here's an example:
SELECT
(COUNT(*)*SUM(x.Sales*y.Sales)-SUM(x.Sales)*SUM(y.Sales))/(
SQRT(COUNT(*)*SUM(SQUARE(x.Sales))-SQUARE(SUM(x.Sales)))*
SQRT(COUNT(*)*SUM(SQUARE(y.Sales))-SQUARE(SUM(y.Sales))))
correlation
FROM BulbSales x JOIN BulbSales y ON x.month=y.month
WHERE x.Year=1997 AND y.Year=1998
extracted from the O'Reilly Transact SQL Cookbook
This calculates just one correlation coefficient between 2 years of sales. If you want to calculate a correlogram showing correlation coefficients across all pairs of fields in a table this could take some time to code as you are re-coding the math every time you use it with the distinct possibility of human error. It can be done, but it’s neither pretty nor simple. Something slightly more complex like
regression analysis is seriously beyond the capability of SQL.
Currently, we would pull the data we
need into an analytic package (like SAS or R) to run analysis with the
help of a statistician. As the data gets
bigger the overhead/delay in moving it across into another package becomes a more
significant part of your project, particularly if you do not want to do that much with it when it gets
there. It also limits what you can do
on-demand with your end user reporting tools.
So, how can you bring better analytics
to bear on your data in-situ? This is the developing area of in-database
analytics: Extending the analytical capability
of the SQL language so that analytics can be executed, quickly, within the
database. I think it fair to say that it’s
still early days but with some exciting opportunities:
- SAS, the gold standard for analytical software, has developed some capability but, so far, only for databases I'm not using (Teradata, Neteeza, Greenplum, DB2) SAS in-database processing
- Oracle recently announced new capability to embed R (an open source tool with a broad range of statistical capability) which sounds interesting but I have yet to see it. Oracle in database announcement
- It’s possible to build some capability into Microsoft’s SQL Server using .NET/CLR and I have had some direct (and quite successful) experience doing this for simpler analytics. Some companies seem to be pushing it further still and I look forward to testing out their offerings. (Fuzzy Logix, XLeratorDB).
For complex modeling tasks, I am certain
we will need dedicated, offline analytic tools for a very long time. For big data that will mean similarly large application servers for your statistical tools and fast connections to your
data mart.
For simpler analysis, in-database
analytics appears to be a great step forward, but I’m wondering what this means
in terms of the skills you need in your analysts: when the analysis is done in a sophisticated statistics package, it tends to get
done by trained statisticians who should know what they are doing and make good
choices around which tools to deploy and how.
Making it easier to apply analytical
tools to your data is very definitely a good thing.
Applying these tools badly because you do not have the skills or knowledge to apply
them effectively could be a developing problem.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.