As a first pass, and under a severe time crunch, we took the data available, ran it through the model and while it processed, I was unhappy with the predictive power we found. Of course, this approach was ridiculously optimistic: so, back to look at the product characteristics we were using.
While the data were cleaner than expected they still suffered from a range of problems visible to someone who does not know the products that well:
- missing, invalid and inconsistent values
- inconsistency across related products (flavor variations with different weights and pricing).
- product characteristics that should really be split into multiple characteristics (because the options are not mutually exclusive)
Clustering is a relatively simple statistical process: once set up, I can teach someone with limited predictive modeling skills to re-run models with sensible defaults and to interpret the outputs. Cleaning the data and presenting it correctly to the modeling tools (so you get useful answers) takes more skill.
So, if you are knee-deep in a modeling project and have not paused to check your data quality, perhaps now is the time…