But scale and performance needs aside, why have most (though not all) CPGs chosen to buy rather than build the capability? After all, it is primarily a business-intelligence/database application and most businesses run a number of them. One key reason is that it's challenging to get business reporting performance at this data scale from existing technology.
This post looks at how this problem gets solved today and how newer database technology can change that landscape.
Handling the data volume (1) Cube it
One existing approach to the problem is to use 2 databases, one to store the detailed granular data in relational form and another with data "cubes" containing pre-calculated summaries (aggregations) of the relational data. Most uses of the data will involve working with summaries so you can save users a lot of time by pre-calculating them.
Once built, analyzing data within a cube is fast but you do have to decide a number of things up-front to populate the cube.
- what aggregation levels do you need e.g.:
- county, state, region for store locations
- brand, pack-type, category for product
- what facts do you want included e.g.
- pos $ and units for the sales cube
- on hand and in-transit inventory and forecasts for the supply chain cube.
The more data (and aggregation levels) you add to the cube the longer it will take to build; to take hours is normal, days is not unknown. Additionally, once a cube is built, it is essentially disconnected from any changes in the underlying database until it is next rebuilt. If your master data is assigning the wrong category to a product, fixing it won't help your reports until you rebuild that cube.
Handling the data volume (2) Hyper-complex data models
Logically we can do everything we want in a standard relational database like SQL Server or Oracle: the data structures are not actually that complex: we need master/lookup tables for product, location and time and one fact table to store all the information collected for each product, location and time bucket (POS sales, inventory, store receipts etc.) That's just 4 tables. Yes we could get more complex by adding other data sources with additional dimensions but it would still be a simple structure. Build this in your favorite relational SQL database and it will work but but it is most definitely not fast.
To get speed in these systems, developers have created some very complex, novel but nonetheless effective data-models. (Complex enough that an unwary developer taking their first look inside could be forgiven for a little lot of bad language.)
These data structures enable rapid reporting with no intermediary steps, no aggregations, no cubes. Once the data is loaded it is ready to go. Re-load some POS data or change a product category and it is immediately reflected in the next report. Now that is very cool, and for analytic or reporting projects where you need ad-hoc aggregation against product groups that did not exist this morning, and were not 100% correct until the 5th interation sometime this afternoon, a very important feature.
The complexity of the data model comes at a price though.
- You will probably only ever use the Business Intelligence tool supplied with the DSR. This tool has have been extensively configured, customized, or even written, to handle the complex nature of the data structure it sits upon, Putting another tool on top is a huge investment and would most likely be need an additional, simpler database, either (slow) relational or (slow to build) cubes that would be populated from the main DSR occasionally but otherwise disconnected from the data source and subsequent changes. That rather defeats the point, doesn't it?
- These models spread data across a multitude of tables in the database. Not a big problem for most reporting which aggregates each fact table to the desired level (e,g. brand by country) then stitches together the relatively small result sets for a human-readable report. For predictive analytics however, we want the lowest level of data and need all of the facts in the same table before we can start modeling. Sadly, the database just doesn't store it that way, so every analytic project starts with a complex data-manipulation project.
Handling the data volume (3) Next Generation
Database technology is evolving rapidly and I believe we are at the point that it can now provide good performance with no pre-aggregation of data, no cubes and a data-model that is easily understood so you can bring your own Business Intelligence tools or analytic apps to bear on it.
I'm an analyst not a database expert so I would not want to put too much money on which of the competing approaches will win out longer term but I think the key words to follow here are "columnar", "massively parallel", "in memory" and maybe, perhaps, possibly..."Hadoop".
Columnar databases change the way that data is stored in the database. This makes them relatively slow for transaction updates but dramatically faster for report-style aggregations even with a simple data-model. (See my previous post here for an example: )
Existing systems typically run on a single server. Want it to run faster? Then you need to buy a bigger server. MPP (Massively Parallel Processing) systems use clusters of hardware, dividing up the work across multiple, relatively cheap, servers (nodes). If you need more performance add more nodes to the cluster. Do this with a cloud-based service and you can flex the number of nodes in your cluster to meet processing demand: double-up as needed for your data-load or your predictive-model run.
In memory databases deliver speed increases by pulling the data off disk storage and loading the whole thing into memory (and accessing data in memory is certainly much faster than reading it off disk.). I've not tried one of these yet and I would be interested to hear comments from those that have. It sounds good but I don't think the price-point is yet where I could justify the use. 10TB of RAM is certainly uch cheasper than it was 10 years ago, but my gut-feel is that the economics will suggest a hybrid RAM and disk/SSD model for some time to come. There is a thoughtful blog post on SQL Server's new in-memory offering, including a few limitations, here.
Finally, let's talk Hadoop. I know it's "sexy" and often appears in the same sentence as "Big-Data" but I'm not yet convinced that it's appropriate for this use where we want rapid response on a very large number of typically small and ad-hoc queries. I could be wrong though, a friend and colleague that I respect has recently moved to Cloudera after a lifetime of SQL/Oracle and is very excited about the possibilities using Hadoop/Hbase/Impala. Looking at these benchmark results comparing a number of Hadoop based systems to Redshift (columnar, mpp) he may well have a point. I will try to keep an open mind.
Are there other options? You bet! A number I have deliberately ignored and, I'm certain, plenty out I have not heard of, but this set will do the job and if another technology will do it even better now or in 5 years time, great ! The bottom line is that database speed and storage capability is growing faster than the amount of data you want in your DSR. We need to take advantage of it,
So what does this get us ?
Using database technology to increase speed and to get a simpler data structure is a big win. Simpler, faster systems come with less maintenance, lower learning curves, more productivity and, I strongly believe, the capability for better insights. Slow response times are an "insight killer" (more on this in an upcoming blog post).
The simpler data structure means that it's relatively easy to swap out the front-end for the BI, analytics or visualization tool of your choice. Want that data in Business Objects or Tableau? No problem! Connect from R/SPSS/SAS/RapidMiner? Absolutely!
What does this mean for DSR vendors ?
- The ability to handle DSR-sized data volume is no longer a competitive advantage.
- If it's easy to set up any new BI, visualization or analytic tool against the database providing the best" user interface is of limited value.
- Rapidly loading new data is important
- Providing clean data is important (and often overlooked)
- Helping users navigate the data ocean to find the things that must be done via process specific exceptions and workflow is important.
- Helping users drive better decisions by embedding the analytics against the right data in real-time... now that's really important.
Great post! I wanted to mention a hybrid architecture I have been hearing about lately in discussions with Microsoft and certain other vendors. I am hearing about other shops using MS Analysis Services cubes running in full ROLAP mode on top of a massively parallel rdbms such as SQL Server 2012 PDW or HP Vertica. The idea here is that you get the benefits of a cube, especially the custom analytic functionality and custom rollup capabilities supported by MDX and the semi-additive aggregation methods of SSAS without having to process the cubes and build aggregates in the cubes. The key is to have a fast MPP relational database underneath the cube. I haven’t seen it first-hand but it has come up in conversations with technology vendors several times lately as an approach being used right now for data warehousing and by at least one DSR vendor that I am aware of.
ReplyDeleteThanks for the comment Brian. That's a very interesting idea and I will be looking into it.
Delete