Better Business Analytics: Big-Data

Showing posts with label Big-Data. Show all posts

Next Generation Point of Sale Analytics

Over the last few months I've been exploring the features I want to see in a next generation platform for point of sale analytics: It's simpler, faster and cheaper, supports rapid blending of new data sources and is powered up with real analytic capability. Looking back there are a lot of posts on this topic so here is a quick summary with links back to the detail.

Note

I have no immediate plans to build such a system for sale but I do use systems with many of these features for ad-hoc analytics as they are flexible yet relatively easy to set up and tear-down without incurring substantial overheads. Consider this series more of a manifesto/buyers-guide.
I do see changes in the marketplace suggesting that a number of DSR vendors are at least considering a move in this direction. As to which one will get there first, I think it will be whoever feels least weighed down by their existing architecture.

Database technology has moved on dramatically over the last few years. For this scale of data, analytic solutions should be columnar, parallel and (possibly) in memory. This enables speed, scalability and a simple data structure that makes it easy to hook up whatever analytic or BI tools you wish.

If the only data you have in the system is pos sales for a single retailer, you can build a reporting system ("what sold well last week") but you will struggle to understand why sales change. Bringing in other data sources: multi-retailer, demographics, weather information, promotional calendars, competitor activity, socio-economic trends, Google trends, social media, etc. allow for much more insighful analtyics. It's not easy to do this though, particularly if your source database is locked down so that it takes a software engineer to add tables

The term "Analytics" in general use covers a lot of activities most of which involve little more than reporting. In some instances you can slice and dice your way through a dataset to find insight, reporting is not without value but it's not analytics. Not even close.

Reporting is NOT Analytics

Can you buy good analytics? Yes, but there are also a number of pseudo-analytic solutions in the market that have little to no analytic power - caveat emptor!

An Analytic name is not enough

To get to real, deep insights you need real analytic tools. Depending on the taxonomy you are used to, we are talking about predictive and prescriptive analytics,machine learning, statistics, optimization or data science. Most of these tools are not new but they are not generally found in standard BI offerings and even when they are (e.g. reporting level R integration) you may struggle to apply the analytic tools at scale.

Finally, whether you build your own analytic tools or buy them in to run on your platform, clever math is not enough. If a user cannot comprehend the tool or it's suggestions due to poor user interface design and /or bad visualization choices it's worth precisely ... squat.

Analytics are for everyone

Next Generation DSRs - Bring the Analytics to the data

Under old world analytics, you move data from the DSR to your analytic server, build models, then write results (sometimes models too) back out for integration into the DSR.

Now, consider this:

DSR datasets are often enormous. (2 years of data for a DSR I worked with recently input to a model was approx. 270 GB)
Analytic tools are small. (The R base software, all 150 packages I have installed and the development environment is 625 MB)
Analytic models are tiny. (Expressing a 10 component regression model in SQL, just 288 bytes and most of that is down to variable names)

Let's try that visually.

The input data is huge, everything needed to run R (my analytics tool of choice) is barely a blip on the scale and the resulting model can't be seen on this scale at all. And today we move the DSR data to the analytic server to run the analytics.... anyone else having an issue with this ?

Where the data is small enough that we can pull what we need via query over an ODBC connection and hold it in memory to run the analytics, perhaps you can live with the network overhead.

Similarly, if the DSR and analytic servers are co-located with a big fat data pipe connecting them, it doesn't matter so much. It's not same machine I'm after necessarily, but same rack would be nice.

What happens though, when the data is too big and the connection too slow (think wide area network) to be feasible? Now we need to build database structures on the analytic server, load the data (taking a copy), and if we are to re-run the analytics routinely, keep it in sync with the source on an ongoing basis. This is a lot of (non-analytic) maintenance work before we can even get started on the analytics.

So why do we do this?

"The analytic server is a high power, high memory machine great for analytics!" That's true but chances are your database servers have the same thing.

There are also valid concerns around how an analytic tool connecting directly to a database may impact other users. I do have a little sympathy for this, certainly much more than I used to, but think on this: a DSR is not a mission critical system. The failure of a mission-critical systems stops your business. If the DSR stops (and the chances are very good that you will have no issue at all), your reports are a bit late. Relax !

I have a suspicion that some of this is related to licensing. If you pay a small fortune for your analytic tools and they are priced per server, per CPU or per core, I can see why you would not want to go installing that software everywhere you might want to use it. Cheaper perhaps to bring the data to the software. Working with free open-source tools, it's not been an issue for me to install co-located or even on the same machine as needed.

Recently a number of database and BI vendors have moved to integrate analytic tools (often R, sometimes SAS) into their offerings trying to deliver real in-database analytics. I do think this is a great direction to move in though I have some concerns about the level of integration currently available. see my post on Analytic Power ! for more details.

Even if you can't execute true in-database analytics (which should be a Next Generation feature) there are still things you should be able to do to bring the analytics to the data.

First let's make a distinction between model-building (the act of creating new models from data) and model-scoring (running existing models against new data to make new predictions). All predictive analytic models I can think of can have this same split. (Descriptive and Prescriptive analytics do not)

Model-building is an intensive task, this is where all the heavy lifting happens in analytic work so processing and memory needs can be substantial though this varies widely depending on the analytic method and to some extent the implementation. If you have installed analytic tools directly on your database servers this may be enough to cause something of a slow-down. OK - try to co-locate instead. If you absolutely must replicate data to an analytic server on the other side of the world and try to keep your data in sync, I pity you.

Model-scoring is fast. A model is just a set of simple calculations. Deciding exactly what simple calculations you needed was the job of model-building but now you have done that, scoring new data against that model is quick.

This is what the result of a simple regression model looks like (in SQL):

[Variable_1] *-49.8916 + [Variable_2] *-24.2773 + [Variable_3] *-48.1305 + [Variable_4] -253.7238 + [Variable_5] *-20.7173 + [Variable_6] *17.722 + [Variable_7] *12.9865 + [Variable_8] *-17.4036 + [Variable_9] *2.2738 + [Variable_10] *-7.9186 + 6.668 AS Prediction

If you think it looks complex, look again, it's just a set of input variables multiplied by specific weights (as found by model-building) and then added together. This is easy work for the database. More complicated models will have more complex expressions, you may see logs, exponents, trig., perhaps an if..then..else statement. Nothing the database will find difficult to execute if it's expressed in the right language.

Unless models change with every input of new data (and so need re-building) there is no excuse not to score the model directly against the data. How you execute the model scoring is a different question and you have some options:

you may load the model, new data and score directly in your analytic tools. This is using a sledgehammer to crack a nut, but it's easy to do if a little heavyweight/slow.
for simpler models converting the model into SQL is not that difficult (though you do need to know SQL pretty well and have permission to build it into the database as a view, stored procedure or user defined function. This is probably the most difficult but fastest to execute.
try converting the model to PMML (predictive model markup language) and use a server based tool designed to execute PMML against your database. (Many analytic tools have an option to export models as PMML.) A PMML enabled DSR would be a great enhancement for the Next Generation.

Bring your analytics to the data , spend more time doing analytics and less data time wrangling.

The right tools for (structured) BIG DATA handling - more Redshift

In my recent post on The right tools (structured) BIG DATA handling, I looked at using AWS Redshift to generate summaries from a large fact table and compared it to previous benchmark results using a columnar database on a fast, SSD drive.

RedShift performed very well indeed, especially so as the number of facts returned by the queries increased. In this initial testing I was aggregating the entire fact table to get comparable tests to the previous benchmark, but that's typically not how a reporting (or analytic) system would access the data. In this follow-up post then, let's look at how Redshift performs when we want to aggregate across particular records.

The right tools for (structured) BIG DATA handling - columnar, mpp and cloud - AWS Redshift

Today, I'm coming back a little closer to the series of promised posts on the Next Generation DSR to look at some benchmark results for the Amazon Redshift database. Some time ago I wrote a couple of quite popular posts on using columnar databases and faster (solid state) storage to dramatically (4100%) improve the speed of aggregation queries against large data sets. As data volumes even for ad-hoc analyses continue to grow though, I'm looking at other options.

Here's the scenario I've been working with: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal.

Next-Generation DSRs - data handling

This post continues my look at the Next Generation DSR. A DSR (Demand Signal Repository) holds data, typically Point of Sale data, and that data volume is big, not Google-search-engine big, but compared to a CPG's transaction systems, it's huge. Furthermore, the system is required to rapidly load large quantities of new data, clean it, tie it into known data dimensions and report against it in very limited time-frames.

But scale and performance needs aside, why have most (though not all) CPGs chosen to buy rather than build the capability? After all, it is primarily a business-intelligence/database application and most businesses run a number of them. One key reason is that it's challenging to get business reporting performance at this data scale from existing technology.

This post looks at how this problem gets solved today and how newer database technology can change that landscape.

The Next-Generation DSR

CPGs have had access to Point of Sale (POS) data now for many years and many of them use a Demand Signal Repository (DSR) to gather, clean and report on this data. (Actually most of them use a number of DSR's and even when they do have just one, still can't handle truly cross-retailer analytics).

I've been involved with a number of these systems as a software-buyer, a system-administrator, consultant and most recently, leading the analytic development at Orchestro.

There are some excellent tools available and, in their current form, they can help you drive both additional revenue and reduced costs when used well. However, in my experience many of these tools have been sold in under the guise of "saving time" through reporting automation. That's valuable, but it's not "finding a new sales opportunity" valuable.

I think we are still in the infancy of DSR development: systems are operating at the limits of the technology they were built on and necessary trade-offs mean that being good at one thing (e.g. speed) makes it more challenging to be good at others (e.g. analytics).

The next generation of DSR can be dramatically more effective. In particular, it will be:

The right tools for (structured) BIG DATA handling (update)

A couple of weeks ago, I ran a somewhat rough benchmark to show just how much faster large database queries can run if you use better tools.

The right tools for (structured) BIG DATA handling Here's the scenario: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal. Sounds familiar?

I looked at the value of upgrading hard-drives (to make sure the CPU is actually busy) and the benefit of using columnar storage which let's the database pull back data in larger chunks and with fewer trips to the hard-drive. The results were ..staggering. A combined 4100% increase in processing speed so that I could read and aggregate 10 facts from a base table with over 40 million records on my laptop in just 37 seconds.

At the time I promised an update on a significantly larger data-set to see whether the original results scaled well. I also wanted to see whether query times scaled well to fewer facts. Ideally querying against 5 facts should take about 50% of the original 10 fact aggregation queries.

The right tools for (structured) BIG DATA handling

Here's the scenario: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal. Sounds familiar?

Let's use Point of Sale data as an example as POS data can easily generates more data-volume than the ERP system. The data is simple and easily organized in conventional relational database tables - you have a number of "facts" (sales-revenue, sales-units, inventory, etc.) defined by product, store and day going back a few years and then some additional information about products, stores and time stored in master ("dimension") tables,

The problem is that you have thousands of stores, thousands of products and hundreds (if not thousands) of days - this can very quickly feel like "big data". Use the right tools and my rough benchmarks suggests you can not only handle the data but see a huge increase in speed.

Ignore SNAP and your product may not be on the shelf when it's most needed - and that means lost sales.

SNAP is the “Supplemental Nutrition Assistance Program” (formerly known as “Food Stamps”) in the United States which puts food on the table for 46 million people every month.

SNAP can drive big spikes in sales at the store. These spikes are large but short-lived and often pass undetected by reporting and forecasting systems.

Our whitepaper covers the causes of SNAP spikes, why they vary so much across regions and products, how to identify sales spikes and what you should be doing to maximize sales.

Download it now or visit our website for more information.

Better Business Analytics - Christmas list

It's that time of year again: my kids have written, re-written and re-re-written their Christmas lists now so we all hope Santa will read them carefully and take notice.

With just a few days left before the holiday season hits I wanted to do something a little more fun, so I've pulled together a list of things that I think every Business Analyst should want. Some are free to acquire, just costing your time to learn, others you may wish to ask Santa (or your CFO) to provide.

SNAP Analytics (2) - Purchase Patterns

Roughly 15% of the United States population receives SNAP funding to help pay for food and beverage items. We know that when SNAP (food stamp) funding is released in each state (see SNAP Analytics (1) - Funding and spikes) this is accompanied by significant sales spikes on some products,

If 15% of all shoppers visit your store within a 2-3 day period you should see a sales spike on everything they buy, SNAP funded or not . So, why do we not see a spike on everything? Why are some spikes so much bigger than others?

Do you need daily Point of Sale data? Do you like selling more product?

Most people report on their Point of Sale data in weekly or perhaps even monthly buckets . If you are interested in seeing a long-term trend or annual seasonality that's OK, but if you really want to know what's going on, to ensure you have product on shelf, and promotions running when your target shoppers are in store - you need daily POS data. Don't believe me? Let's look at an example...

Point of Sale Data – Supply Chain Analytics

I’ve spent a large part of my career working in Analytics for Supply Chain. It’s an area blessed with a lot of data and I’ve been able to use predictive analytics and optimization very successfully to drive cost out of the system. Much of what I learned in managing CPG supply chains translates directly to Retailer supply chains it’s just that there is much more data to deal with.

Point of Sale Data – Sales Analytics

I’m assuming that you now have a DSR (see [Point Of Sale Data - Basic Analytics] ) so you can manipulate the large quantities of data necessary to do this work, you have your routine reports automated and use the DSR for ad-hoc queries against the POS data.

The DSR provides a great foundation for analytic work: use it to integrate multiple data sources, clean the data, handle very large data volumes as though it was all sat on your desktop and it will help you build reports that summarize that history with ease. Typically, the DSR does not provide much help for you with predictive-analytics.

Let’s look at an example related to what really drives sales. Do you know? Can you quantify it? Knowing these answers with quantified detail can help you better explain your sales history and plan for the future. Better promotions, better pricing, supply chains that anticipate peaks in demand and make sure the product is on the shelf when it’s needed. Here are some of the things that could drive your sales:

Point of Sale Data – Basic Analytics

You've got access to Point of Sale Data…now, what are you going to do with it?

For the purpose of this blog entry, I’m assuming that we have daily aggregated data by product and by store. We will certainly get measures of sales (both units sold and currency received). We may also get other useful measures like inventory on-hand, inventory in-transit, store-receipts, mark-downs taken at the store and perhaps some data around warehouse activity too.

Bringing your analytical guns to bear on Big Data – in-database analytics

I've blogged before about the need to use the right tools to hold and manipulate data as data quantity increases (Data Handling the Right Tool for the Job). But, I really want to get to some value-enhancing analytics and as data grows it becomes increasingly hard to apply analytical tools.

Let’s assume that we have a few Terabytes of data and that it's sat in an industrial-strength database (Oracle, SQL*Server, MySQL, DB2, …) - one that can handle the data volume without choking. Each of these databases has its own dialect of the querying language (SQL) and while you can do a lot of sophisticated data manipulation, even a simple analytical routine like calculating correlations is a chore.

Reporting is NOT Analytics

Reporting is about what happened; Analytics is about answering "what if" and "what's best" questions. Most of the materials that land on a VP/Director’s desk (or inbox) are examples of reporting with no analytical value added.

Data handling - the right tool for the job

Microsoft Excel must be the most widely used analytic tool available today but, it’s not the only tool in my tool-bag and it should not be the only one in yours either. It’s perfect for building small models, has broad embedded functionality, good charting capabilities, pivot-tables (a superb tool worthy of its own post), comes with a programming language so you can extend its capability even further and almost everybody has a copy already. It’s awesome and I use it every day.

But...the data we analyze is getting much bigger and a lot more complex. Even with newer versions of Excel allowing over 1 million records in a sheet, what can you do usefully in Excel with 1 million records? Certainly, you don’t want to print it, it’s near impossible to chart or model against, bottom line you are using the wrong tool for the job. To do the job well you need to find, learn and use the right tool. Don’t believe me? Try chopping down a tree with a hammer!

Pages