Better Business Analytics

Next Generation DSRs - data blending

Over the last few months I've written a series of posts on Demand Signal Repositories. These are the specialized database and reporting tools primarily used by CPGs for reporting against retail Point of Sale data.

There are a number of good tools in the market-place and you can derive substantial value from them today but the competitive landscape is changing...fast. Existing tools found a market because they are capable of sourcing, loading and reporting against vast amounts of data quickly. To do so they have employed a variety of complicated architectures that are now largely obsolete with recent advances in technology that can make solutions: faster, cheaper and more flexible.

Cheaper alone may be a win in the market today, but if all we do with this new power is report on "what I sold last week" more quickly and at a lower price-point I think we are missing the point.

The promise of a DSR has always been to explain what happened but much more importantly why and existing tools struggle with this:

they do not hold a rich enough repository of data to test out hypotheses.
their primary analytic tools are report-writers and pivot-tables (by which I mean that they really don't have any)

We'll come to analytics in a later post, but for now let's think data because without that there isn't very much to analyze.

Imagine that I've spent a few hundred thousand acquiring point of sale data into my own DSR and now I want to really figure out what it is that drives my sales.

How about weather. Ignore for the moment whether or not a future forecast is useful, but how about using weather data to explain some of the strange sales in history so that I don't trend them forwards into the coming year? I can get very detailed weather data from a number of sources, but can I, a system user, get that data into my DSR to start reporting against it and better yet, modeling? Probably not

How about SNAP, the US government 's benefit program that funds grocery purchases for roughly 1 in 6 US households? SNAP can drive huge spikes in demand for key products and I can easily go to usda.gov and find out exactly when SNAP dollars are dropped into the marketplace by day of the month and by state. With a little time on Google I can even see when this schedule has changed in the past few years. Can I, a system user, get this data into the DSR for reporting/modeling? Nope.

The same is true for many additional data sources you wish to work with (Promotional records, Twitter feeds, Sentiment analysis, Google trends, Shipment history, master data, geographic features, proximity to competitor stores, demographic profiles, economic time series, exchange rates etc.).

These are all relatively easy to source datasets but if the DSR vendor has not set it up as part of the standard product, you are out of luck: the technical sophistication necessary to source, load and , especially, match key fields data is beyond what a super-user, and in many cases, a system administrator can handle. Can it be done? Maybe, depending on your system, skill-level and security-access, but it's going to cost you in time and money.

Matching data in particular can be a real bear - it will be rare that you are matching products at the same level of granularity (item, location, date) and with the exact same key fields. Far more common to be matching weekly or monthly data to daily, state or county data to zip-codes and product groups to shoppable items. And do it without losing any data, sensibly handling missing data and flagging suspect data for manual follow-up.

So if you really want to do some analysis against e.g. SNAP what must you do? Download a small ocean of detailed POS data so you can (carefully) join it to your few hundred records of SNAP release data in a custom database or analytic app, build the models and then (because you can't write the results back out to the DSR) build a custom reporting engine against these results. This makes no sense to me.

The solution is something called data-blending which tries to reduce the pain of integrating multiple data sources to a level that you could contemplate it in near real-time. While I have not yet seen a solution I would call perfect the contrast with the standard, locked-down, DSR scenario is impressive.

Much of what I have seen so far happens at the individual's level: where you are doing the match in-memory and without impacting the underlying database or fellow users in any way. In many cases, particularly for exploratory work, this is preferable, but it's far from an ideal solution if you need to process against the detail of the entire database or have multiple needs for the same data.

The future, I think, will include such ad-hoc capability, but I suspect it also includes a more flexible data model that let's an administrator rapidly integrate new data sources into the standard offering.

Averages work ! (At least for ensemble methods)

After an early start, I was sitting at breakfast downtown enjoying a burrito and an excellent book on "ensemble methods". (Yes, I do that sometimes... don't judge)

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions (Synthesis Lectures on Data... by Giovanni Seni, John Elder and Robert Grossman (Feb 24, 2010)

The right tools for (structured) BIG DATA handling - more Redshift

In my recent post on The right tools (structured) BIG DATA handling, I looked at using AWS Redshift to generate summaries from a large fact table and compared it to previous benchmark results using a columnar database on a fast, SSD drive.

RedShift performed very well indeed, especially so as the number of facts returned by the queries increased. In this initial testing I was aggregating the entire fact table to get comparable tests to the previous benchmark, but that's typically not how a reporting (or analytic) system would access the data. In this follow-up post then, let's look at how Redshift performs when we want to aggregate across particular records.

Recently, I have been working with a new-to-me BI tool that has reminded me just how much speed matters. I'm not mentioning any names here, and it's not a truly bad tool, it's just too slow and that's an insight killer!

Continuing my series on Next Generation DSRs, let's look at how speed impacts the exploratory process and the ability to generate insight and, more importantly, value.

Many existing DSRs do little more than spit out standard reports on a schedule and if that's all you want, it doesn't matter too much if it takes a while to build the 8 standard reports you need. Pass off the build to the cheapest resource capable of building them and let them suffer. Once built, if it takes 30 minutes to run when the scheduler kicks it off, nobody is going to notice.

Exploratory, ad-hoc, work is a different animal and one that can generate much more value than standard reports. It's a very iterative/interactive process. Define a query, see what results you get back and kick off 2-3 more queries to explain the anomalies you've discovered: filter it, order it, plot it, slice it, summarize it, mash it up with data from other sources, correlate, .., model. This needs speed.

Visualizing Forecast Accuracy. When not to use the "start at zero" rule ?

I recently joined a discussion on Kaiser Fung's blog Junk Charts , When to use the start-at-zero rule concerning when charts should force a 0 into the Y-axis. BTW - If you have not done so, add his blog to your RSS feed, it's superb and I have become a frequent visitor.

On this particular post, I would completely agree with his thoughts was it not for this one metric I have problems visualizing, Forecast Accuracy.

Pages

Next Generation DSRs - data blending

Averages work ! (At least for ensemble methods)

Ensemble Methods in Data Mining: Improving Accuracy Through Combining Predictions (Synthesis Lectures on Data... by Giovanni Seni, John Elder and Robert Grossman (Feb 24, 2010)

The right tools for (structured) BIG DATA handling - more Redshift

Next Generation DSRs - it's all about speed !

Visualizing Forecast Accuracy. When not to use the "start at zero" rule ?

Recommended Reading: The Definitive Guide To Inventory Management