Next Generation DSRs - data blending

Over the last few months I've written a series of posts on Demand Signal Repositories.  These are the specialized database and reporting tools primarily used by CPGs for reporting against retail Point of Sale data.  
There are a number of good tools in the market-place and you can derive substantial value from them today but the competitive landscape is Existing tools found a market because they are capable of sourcing, loading and reporting against vast amounts of data quickly.  To do so they have employed a variety of complicated architectures that are now largely obsolete with recent advances in technology that can make solutions: faster, cheaper and more flexible.
Cheaper alone may be a win in the market today, but if all we do with this new power is report on "what I sold last week" more quickly and at a lower price-point I think we are missing the point.  
The promise of a DSR has always been to explain what happened but much more importantly why and existing tools struggle with this:
  • they do not hold a rich enough repository of data to test out hypotheses.
  • their primary analytic tools are report-writers and pivot-tables (by which I mean that they really don't have any)
We'll come to analytics in a later post, but for now let's think data because without that there isn't very much to analyze.
Imagine that I've spent a few hundred thousand acquiring point of sale data into my own DSR and now I want to really figure out what it is that drives my sales.  
How about weather.  Ignore for the moment whether or not a future forecast is useful, but how about using weather data to explain some of the strange sales in history so that I don't trend them forwards into the coming year?  I can get very detailed weather data from a number of sources, but can I, a system user, get that data into my DSR to start reporting against it and better yet, modeling?  Probably not
How about SNAP, the US government 's benefit program that funds grocery purchases for roughly 1 in 6  US households?  SNAP can drive huge spikes in demand for key products and I can easily go to and find out exactly when SNAP dollars are dropped into the marketplace by day of the month and by state.  With a little time on Google I can even see when this schedule has changed in the past few years.  Can I, a system user, get this data into the DSR for reporting/modeling?  Nope.
The same is true for many additional data sources you wish to work with (Promotional  records, Twitter feeds, Sentiment analysis, Google trends, Shipment history, master data, geographic features, proximity to competitor stores, demographic profiles, economic time series, exchange rates etc.).  
These are all relatively easy to source datasets but if the DSR vendor has not set it up as part of the standard product, you are out of luck: the technical sophistication necessary to source, load and , especially, match key fields data is beyond what a super-user, and in many cases, a system administrator can handle.   Can it be done?  Maybe, depending on your system, skill-level and security-access, but it's going to cost you in time and money.

Matching data in particular can be a real bear - it will be rare that you are matching products at the same level of granularity (item, location, date) and with the exact same key fields.  Far more common to be matching weekly or monthly data to daily,  state or county data to zip-codes and product groups to shoppable items.  And do it without losing any data, sensibly handling missing data and flagging suspect data for manual follow-up.
So if you really want to do some analysis against e.g. SNAP what must you do?  Download a small ocean of detailed POS data so you can (carefully) join it to your few hundred records of SNAP release data in a custom database or analytic app, build the models and then (because you can't write the results back out to the DSR) build a custom reporting engine against these results.  This makes no sense to me.  
The solution is something called data-blending which tries to reduce the pain of integrating multiple data sources to a level that you could contemplate it in near real-time.  While I have not yet seen a solution I would call perfect the contrast with the standard, locked-down, DSR scenario is impressive.  
Much of what I have seen so far happens at the individual's level: where you are doing the match in-memory and without impacting the underlying database or fellow users in any way.  In many cases, particularly for exploratory work, this is preferable, but it's far from an ideal solution if you need to process against the detail of the entire database or have multiple needs for the same data.
The future, I think, will include such ad-hoc capability, but I suspect it also includes a more flexible data model that let's an administrator rapidly integrate new data sources into the standard offering.

Averages work ! (At least for ensemble methods)

After an early start, I was sitting at breakfast downtown enjoying a burrito and an excellent book on "ensemble methods".  (Yes, I do that sometimes... don't judge)

  1. 1.