Better Business Analytics: Database

Showing posts with label Database. Show all posts

Next Generation Point of Sale Analytics

Over the last few months I've been exploring the features I want to see in a next generation platform for point of sale analytics: It's simpler, faster and cheaper, supports rapid blending of new data sources and is powered up with real analytic capability. Looking back there are a lot of posts on this topic so here is a quick summary with links back to the detail.

Note

I have no immediate plans to build such a system for sale but I do use systems with many of these features for ad-hoc analytics as they are flexible yet relatively easy to set up and tear-down without incurring substantial overheads. Consider this series more of a manifesto/buyers-guide.
I do see changes in the marketplace suggesting that a number of DSR vendors are at least considering a move in this direction. As to which one will get there first, I think it will be whoever feels least weighed down by their existing architecture.

Database technology has moved on dramatically over the last few years. For this scale of data, analytic solutions should be columnar, parallel and (possibly) in memory. This enables speed, scalability and a simple data structure that makes it easy to hook up whatever analytic or BI tools you wish.

If the only data you have in the system is pos sales for a single retailer, you can build a reporting system ("what sold well last week") but you will struggle to understand why sales change. Bringing in other data sources: multi-retailer, demographics, weather information, promotional calendars, competitor activity, socio-economic trends, Google trends, social media, etc. allow for much more insighful analtyics. It's not easy to do this though, particularly if your source database is locked down so that it takes a software engineer to add tables

The term "Analytics" in general use covers a lot of activities most of which involve little more than reporting. In some instances you can slice and dice your way through a dataset to find insight, reporting is not without value but it's not analytics. Not even close.

Reporting is NOT Analytics

Can you buy good analytics? Yes, but there are also a number of pseudo-analytic solutions in the market that have little to no analytic power - caveat emptor!

An Analytic name is not enough

To get to real, deep insights you need real analytic tools. Depending on the taxonomy you are used to, we are talking about predictive and prescriptive analytics,machine learning, statistics, optimization or data science. Most of these tools are not new but they are not generally found in standard BI offerings and even when they are (e.g. reporting level R integration) you may struggle to apply the analytic tools at scale.

Finally, whether you build your own analytic tools or buy them in to run on your platform, clever math is not enough. If a user cannot comprehend the tool or it's suggestions due to poor user interface design and /or bad visualization choices it's worth precisely ... squat.

Analytics are for everyone

Next Generation DSRs - Bring the Analytics to the data

Under old world analytics, you move data from the DSR to your analytic server, build models, then write results (sometimes models too) back out for integration into the DSR.

Now, consider this:

DSR datasets are often enormous. (2 years of data for a DSR I worked with recently input to a model was approx. 270 GB)
Analytic tools are small. (The R base software, all 150 packages I have installed and the development environment is 625 MB)
Analytic models are tiny. (Expressing a 10 component regression model in SQL, just 288 bytes and most of that is down to variable names)

Let's try that visually.

The input data is huge, everything needed to run R (my analytics tool of choice) is barely a blip on the scale and the resulting model can't be seen on this scale at all. And today we move the DSR data to the analytic server to run the analytics.... anyone else having an issue with this ?

Where the data is small enough that we can pull what we need via query over an ODBC connection and hold it in memory to run the analytics, perhaps you can live with the network overhead.

Similarly, if the DSR and analytic servers are co-located with a big fat data pipe connecting them, it doesn't matter so much. It's not same machine I'm after necessarily, but same rack would be nice.

What happens though, when the data is too big and the connection too slow (think wide area network) to be feasible? Now we need to build database structures on the analytic server, load the data (taking a copy), and if we are to re-run the analytics routinely, keep it in sync with the source on an ongoing basis. This is a lot of (non-analytic) maintenance work before we can even get started on the analytics.

So why do we do this?

"The analytic server is a high power, high memory machine great for analytics!" That's true but chances are your database servers have the same thing.

There are also valid concerns around how an analytic tool connecting directly to a database may impact other users. I do have a little sympathy for this, certainly much more than I used to, but think on this: a DSR is not a mission critical system. The failure of a mission-critical systems stops your business. If the DSR stops (and the chances are very good that you will have no issue at all), your reports are a bit late. Relax !

I have a suspicion that some of this is related to licensing. If you pay a small fortune for your analytic tools and they are priced per server, per CPU or per core, I can see why you would not want to go installing that software everywhere you might want to use it. Cheaper perhaps to bring the data to the software. Working with free open-source tools, it's not been an issue for me to install co-located or even on the same machine as needed.

Recently a number of database and BI vendors have moved to integrate analytic tools (often R, sometimes SAS) into their offerings trying to deliver real in-database analytics. I do think this is a great direction to move in though I have some concerns about the level of integration currently available. see my post on Analytic Power ! for more details.

Even if you can't execute true in-database analytics (which should be a Next Generation feature) there are still things you should be able to do to bring the analytics to the data.

First let's make a distinction between model-building (the act of creating new models from data) and model-scoring (running existing models against new data to make new predictions). All predictive analytic models I can think of can have this same split. (Descriptive and Prescriptive analytics do not)

Model-building is an intensive task, this is where all the heavy lifting happens in analytic work so processing and memory needs can be substantial though this varies widely depending on the analytic method and to some extent the implementation. If you have installed analytic tools directly on your database servers this may be enough to cause something of a slow-down. OK - try to co-locate instead. If you absolutely must replicate data to an analytic server on the other side of the world and try to keep your data in sync, I pity you.

Model-scoring is fast. A model is just a set of simple calculations. Deciding exactly what simple calculations you needed was the job of model-building but now you have done that, scoring new data against that model is quick.

This is what the result of a simple regression model looks like (in SQL):

[Variable_1] *-49.8916 + [Variable_2] *-24.2773 + [Variable_3] *-48.1305 + [Variable_4] -253.7238 + [Variable_5] *-20.7173 + [Variable_6] *17.722 + [Variable_7] *12.9865 + [Variable_8] *-17.4036 + [Variable_9] *2.2738 + [Variable_10] *-7.9186 + 6.668 AS Prediction

If you think it looks complex, look again, it's just a set of input variables multiplied by specific weights (as found by model-building) and then added together. This is easy work for the database. More complicated models will have more complex expressions, you may see logs, exponents, trig., perhaps an if..then..else statement. Nothing the database will find difficult to execute if it's expressed in the right language.

Unless models change with every input of new data (and so need re-building) there is no excuse not to score the model directly against the data. How you execute the model scoring is a different question and you have some options:

you may load the model, new data and score directly in your analytic tools. This is using a sledgehammer to crack a nut, but it's easy to do if a little heavyweight/slow.
for simpler models converting the model into SQL is not that difficult (though you do need to know SQL pretty well and have permission to build it into the database as a view, stored procedure or user defined function. This is probably the most difficult but fastest to execute.
try converting the model to PMML (predictive model markup language) and use a server based tool designed to execute PMML against your database. (Many analytic tools have an option to export models as PMML.) A PMML enabled DSR would be a great enhancement for the Next Generation.

Bring your analytics to the data , spend more time doing analytics and less data time wrangling.

Next Generation DSRs - An Analytic name is not enough

You need not always build your analytic tools, sometimes you should buy in. If the chosen application does what you need that often makes good economic sense... as long as you know what you are buying.

Let's be clear, an Analytic name does NOT mean there are any real Analytics under the hood.

For many managers, Analytics is akin to magic. They do not know how an analytics application works in a meaningful way and have no real interest in knowing. At the same time, there is no business standard for what makes up "forecasting", "inventory optimization", "cluster analysis", "pricing analysis", "shopper analytics", "like products" or even (my favorite) "optimization". Don't buy a lemon!

To handle real Analytics (see my recent post Reporting is NOT Analytics) you need real Analytic power. BI tools are based on the language they use to interrogate the database (typically SQL) and with no library of analytic tools - it's not nearly enough.

The right tools for (structured) BIG DATA handling - more Redshift

In my recent post on The right tools (structured) BIG DATA handling, I looked at using AWS Redshift to generate summaries from a large fact table and compared it to previous benchmark results using a columnar database on a fast, SSD drive.

RedShift performed very well indeed, especially so as the number of facts returned by the queries increased. In this initial testing I was aggregating the entire fact table to get comparable tests to the previous benchmark, but that's typically not how a reporting (or analytic) system would access the data. In this follow-up post then, let's look at how Redshift performs when we want to aggregate across particular records.

Recently, I have been working with a new-to-me BI tool that has reminded me just how much speed matters. I'm not mentioning any names here, and it's not a truly bad tool, it's just too slow and that's an insight killer!

Continuing my series on Next Generation DSRs, let's look at how speed impacts the exploratory process and the ability to generate insight and, more importantly, value.

Many existing DSRs do little more than spit out standard reports on a schedule and if that's all you want, it doesn't matter too much if it takes a while to build the 8 standard reports you need. Pass off the build to the cheapest resource capable of building them and let them suffer. Once built, if it takes 30 minutes to run when the scheduler kicks it off, nobody is going to notice.

Exploratory, ad-hoc, work is a different animal and one that can generate much more value than standard reports. It's a very iterative/interactive process. Define a query, see what results you get back and kick off 2-3 more queries to explain the anomalies you've discovered: filter it, order it, plot it, slice it, summarize it, mash it up with data from other sources, correlate, .., model. This needs speed.

Next-Generation DSRs (multi-retailer)

This post continues my look at the Next Generation DSR. Demand Signal Repositories collect, clean, report-on and analyze Point of Sale data to help CPGs drive increased revenues and reduce costs.

Most CPG implementations of a DSR support just one retailer's POS data. OK before someone get's back to me with "but we have multiple retailers' POS data in our system", I'll clarify:

Having Walmart and Sam's Club data in the same DSR does not count (as the data comes from the same single source, RetailLink) and I bet you are still limited as to what you can report on across them.
If you have multiple-retailer's POS data set up in isolated databases using the same front-end... it does not count
If you have the data in the same database but without common data standards ... it does not count.
If you have the data in the same database but with no way to run analysis or reports across multiple retailers at once... it does not count.

So, yes, a number of CPGs have DSRs that support multi-retailer POS data sources, very, very few (if any?) have integrated that data into a single database with common data standards so they can report and analyze across multiple POS sources at the same time.

Does it matter? I think so, multi-retailer ability opens up big opportunities around promotional-effectiveness, assortment planning, supply-chain forecasting (demand sensing) and ease of use.

The right tools for (structured) BIG DATA handling - columnar, mpp and cloud - AWS Redshift

Today, I'm coming back a little closer to the series of promised posts on the Next Generation DSR to look at some benchmark results for the Amazon Redshift database. Some time ago I wrote a couple of quite popular posts on using columnar databases and faster (solid state) storage to dramatically (4100%) improve the speed of aggregation queries against large data sets. As data volumes even for ad-hoc analyses continue to grow though, I'm looking at other options.

Here's the scenario I've been working with: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal.

Next Generation DSRs - Scale Out !!

Last week, I posted my thoughts on how new technology enables a simpler and faster database to support your DSR applications.Next Generations DSRs (data handling). Over the next few posts I'll extend that idea to show how speed and simplicity are essential to your personal productivity, user experience and the ability to apply powerful analytic tools to your data.

In the meantime though I came across a great post in Rob Klopp's "Database Fog Blog" regarding Redshift. Redshift is Amazon's cloud-based, columnar, parallel database.

Remember that my interest in database technology is all about feeding my insatiable desire for data to drive value-added analytics, my own area of expertise. To that end, I have become adept in a number of programming languages and relational database systems and while I'm a lot better than "competent" I am not "expert". Rob clearly is an expert in this field and I will be following his posts carefully.

Here's a highlight from his post Thoughts on AWS Redshift:

... if you can add nodes and scale out to improve query response then why not throw hardware at performance problems rather than build a fragile infrastructure of aggregate tables, cubes, pre-joined/de-normalized marts, materialized views, indexes, etc. Each of these performance workarounds are both expensive to build and expensive to operate.

He goes on to talk about why scale-out has not been generally adopted and how Amazon Redshift changes the game by making it easy to acquire and release processing power on demand.

The answer does not have to be Redshift, perhaps it's Impala or Hekaton or... whatever. Bottom line for me is that new technology enables DSR's that are simpler and faster and that creates a fundamental shift in system capability.

FYI - I have done some DSR-scale testing with Redshift and the results were very impressive. More on that soon.

Next-Generation DSRs - data handling

This post continues my look at the Next Generation DSR. A DSR (Demand Signal Repository) holds data, typically Point of Sale data, and that data volume is big, not Google-search-engine big, but compared to a CPG's transaction systems, it's huge. Furthermore, the system is required to rapidly load large quantities of new data, clean it, tie it into known data dimensions and report against it in very limited time-frames.

But scale and performance needs aside, why have most (though not all) CPGs chosen to buy rather than build the capability? After all, it is primarily a business-intelligence/database application and most businesses run a number of them. One key reason is that it's challenging to get business reporting performance at this data scale from existing technology.

This post looks at how this problem gets solved today and how newer database technology can change that landscape.

The Next-Generation DSR

CPGs have had access to Point of Sale (POS) data now for many years and many of them use a Demand Signal Repository (DSR) to gather, clean and report on this data. (Actually most of them use a number of DSR's and even when they do have just one, still can't handle truly cross-retailer analytics).

I've been involved with a number of these systems as a software-buyer, a system-administrator, consultant and most recently, leading the analytic development at Orchestro.

There are some excellent tools available and, in their current form, they can help you drive both additional revenue and reduced costs when used well. However, in my experience many of these tools have been sold in under the guise of "saving time" through reporting automation. That's valuable, but it's not "finding a new sales opportunity" valuable.

I think we are still in the infancy of DSR development: systems are operating at the limits of the technology they were built on and necessary trade-offs mean that being good at one thing (e.g. speed) makes it more challenging to be good at others (e.g. analytics).

The next generation of DSR can be dramatically more effective. In particular, it will be:

The right tools for (structured) BIG DATA handling (update)

A couple of weeks ago, I ran a somewhat rough benchmark to show just how much faster large database queries can run if you use better tools.

The right tools for (structured) BIG DATA handling Here's the scenario: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal. Sounds familiar?

I looked at the value of upgrading hard-drives (to make sure the CPU is actually busy) and the benefit of using columnar storage which let's the database pull back data in larger chunks and with fewer trips to the hard-drive. The results were ..staggering. A combined 4100% increase in processing speed so that I could read and aggregate 10 facts from a base table with over 40 million records on my laptop in just 37 seconds.

At the time I promised an update on a significantly larger data-set to see whether the original results scaled well. I also wanted to see whether query times scaled well to fewer facts. Ideally querying against 5 facts should take about 50% of the original 10 fact aggregation queries.

The right tools for (structured) BIG DATA handling

Here's the scenario: you are a business analyst charged with providing reporting and basic analytics on more data than you know how to handle - and you need to do it without the combined resources of your IT department being placed at your disposal. Sounds familiar?

Let's use Point of Sale data as an example as POS data can easily generates more data-volume than the ERP system. The data is simple and easily organized in conventional relational database tables - you have a number of "facts" (sales-revenue, sales-units, inventory, etc.) defined by product, store and day going back a few years and then some additional information about products, stores and time stored in master ("dimension") tables,

The problem is that you have thousands of stores, thousands of products and hundreds (if not thousands) of days - this can very quickly feel like "big data". Use the right tools and my rough benchmarks suggests you can not only handle the data but see a huge increase in speed.

Business Analytics - The Right Tools For The Job

Whether your analytic tool of choice is Excel or R or Access or SQL Server or ... whatever, if you've worked a reasonable range of analytic problems I will guarantee that at some point you have tried to make your preferred tool do a job it is not intended for or that it is ill-suited for. The end result is an error-prone, maintenance nightmare and there is a better way.

It's that time of year again: my kids have written, re-written and re-re-written their Christmas lists now so we all hope Santa will read them carefully and take notice.

With just a few days left before the holiday season hits I wanted to do something a little more fun, so I've pulled together a list of things that I think every Business Analyst should want. Some are free to acquire, just costing your time to learn, others you may wish to ask Santa (or your CFO) to provide.

Better Business Reporting in Excel

What do you think is the most-used reporting tool in the world? There are a lot of them available, at times it can be hard to move without running into a new Business Intelligence (BI) tool, but I'm going to hazard a guess with a high degree of confidence that the most used tool is still Excel.

Excel may not be where the data originated from and there may well have been a database involved to crunch numbers and aggregate to the point that the data would fit comfortably in Excel, but I do believe that it's still the most widely used tool for final 'analysis' and presentation of data.

This is not because Excel is the best tool for the job as almost every feature of Excel is handled better by another tool. Joining separate tables of data with Lookup functions is a very pale shadow of using a properly defined database and the SQL language. There are substantially better graphics, charting and visualization tools available. The analysis tools (Correlation, Regression, ANOVA, Linear-programming) are the poor relation to more industrial strength packages.

Bringing your analytical guns to bear on Big Data – in-database analytics

I've blogged before about the need to use the right tools to hold and manipulate data as data quantity increases (Data Handling the Right Tool for the Job). But, I really want to get to some value-enhancing analytics and as data grows it becomes increasingly hard to apply analytical tools.

Let’s assume that we have a few Terabytes of data and that it's sat in an industrial-strength database (Oracle, SQL*Server, MySQL, DB2, …) - one that can handle the data volume without choking. Each of these databases has its own dialect of the querying language (SQL) and while you can do a lot of sophisticated data manipulation, even a simple analytical routine like calculating correlations is a chore.

Data handling - the right tool for the job

Microsoft Excel must be the most widely used analytic tool available today but, it’s not the only tool in my tool-bag and it should not be the only one in yours either. It’s perfect for building small models, has broad embedded functionality, good charting capabilities, pivot-tables (a superb tool worthy of its own post), comes with a programming language so you can extend its capability even further and almost everybody has a copy already. It’s awesome and I use it every day.

But...the data we analyze is getting much bigger and a lot more complex. Even with newer versions of Excel allowing over 1 million records in a sheet, what can you do usefully in Excel with 1 million records? Certainly, you don’t want to print it, it’s near impossible to chart or model against, bottom line you are using the wrong tool for the job. To do the job well you need to find, learn and use the right tool. Don’t believe me? Try chopping down a tree with a hammer!

Data Cleansing: boring, painful, tedious and very, very important

I've been working recently on a category management project and I'm reminded of just how essential clean, well-organized data is. We are working to group stores into 'clusters' of similar stores; later we will see what geographic and demographic data best helps us to predict cluster membership and optimize product assortment by cluster.

As a first pass, and under a severe time crunch, we took the data available, ran it through the model and while it processed, I was unhappy with the predictive power we found. Of course, this approach was ridiculously optimistic: so, back to look at the product characteristics we were using.

Pages