Better Business Analytics: Business Analytics - finding the balance between complexity and readability

In this blog I try to present analytic material for a non-analytic audience. I focus on point of sale and supply chain analytics: it's a complex area and frankly, it's far too easy whether writing for a blog or presenting to a management-team to slip into the same language I would use with an expert.

So, I was inspired by a recent post on Nathan Yau's excellent blog FlowingData to look at the "readability" of my own posts and apply some simple analytics to the results.

I've followed Nathan's blog for a couple of years now for the many and varied examples of data-visualization he builds and gathers from other sources. One that particularly caught my eye was this one published by the Guardian just before the recent State of the Union address in the United States (click to enlarge).

The Guardian plotted the Flesch-Kincaid grade levels for past addresses. Each circle represents a state of the union and is sized by the number of words used. Color is used to provide separation between presidents. For example, Obama's state of the union last year was around the eighth-grade level, and in contrast, James Madison's 1815 address had a reading level of 25.3.

Neither the original post nor Nathan's go into much detail around why the linguistic standard has declined. Within this period, the nature of the address and the intended audience has certainly changed. Frankly, having scanned a few of the earlier addresses I think we can all be grateful not to be on the receiving end of one of them.

So, I was inspired to find out the reading level of my own blog. It's intended to present analytic concepts to a non-analytic audience. I can probably go a little higher than recent presidential addresses (8th-10th grades, roughly ages 13-15) but I don't want to be writing college-level material either.

All the books my kids read are graded in this (or a very similar) way but I had never thought about how such a grading system could be constructed. The Flesch-Kincaid grade level estimate is based on a simple formula:

$0.39 \left ( \frac{\mbox{total words}}{\mbox{total sentences}} \right ) + 11.8 \left ( \frac{\mbox{total syllables}}{\mbox{total words}} \right ) - 15.59$

That's just a linear combination of :

average words per sentence;
average syllables per word
a constant term.

In fact (though I have not yet found details of how it was constructed) it looks to be the result of a regression model. (Simple) data science in action from the 1970's.

Note that Flesch-Kincaid says nothing about the length of the book or the nature of the vocabulary it's all down to long sentences and the presence of multi-syllabic words.

(BTW - the preceding sentence has a Flesch-Kincaid grade score of 13.63, calculated with this online utility). Now that's pretty high, worthy of an early 1900's president and (supposedly) understandable by young college students. The sentence is longer than typical; 31 words vs. my average of 18 (see below) and words like "vocabulary", "sentences" and "multi-syllabic" are not helping me either.

Approach

I could have used copy/paste into the online utility I used above, recorded the results in a spreadsheet and pulled some stats from that. That would work, but if I ever want to repeat the exercise or modify it, perhaps to use a different readability index, I must do all that work again. At the time of writing, there are currently 44 published posts on this blog - there must be a better way.

Actually there are probably many better ways but as I also wanted to flex some R-programming muscle I built a web-scraper in R to do the work for me and analyze the results (more on this in a later post).

Results

Let's start with some simple summaries of the results I collected.

Histograms showing the % of posts from this blog (prior to 2/14/13), the average (mean) value shown in red. There is some variety in the grade reading level indicated by Flesch-Kincaid for my blog posts, averaging around 10 but ranging from 7 through 14. I average about 750 words, but occasionally go much longer and have a number of very short "announcement" style posts. Average words per sentence of 18.

OK, so now I know, but is that good? I don't know that I have a definitive source but according to at least one source the target range on Flesch-Kincaid for Techical or Industry readers is 7-12, so I'm feeling pretty good about that.

I did wonder whether there was any other, hidden, structure to the data though. I know the equation is based on words per sentence and syllables per word so there is no point looking at those, obviously I'll find a relationship. But is my writing style influenced by anything else?

Flesch-Kincaid grade level vs. the number of words by post on this blog. Other than a handful of long posts that rate lower in the range 8-10, I don't see much going on here.

Flesch-Kincaid grade level vs. the publication date by post on this blog. The size of each post (in words) is shown by the area of each point, color is used purely to help visually differentiate each of the points. Apart from a couple of recent "complex" posts this does seem to be showing a trend, so I added a regression line and labeled the more extreme posts. Point (b) is a very short "announcement" style post (you can hardly see the point at all) and I could probably ignore it completely. Point (e) is a more fun piece I did around using pie-charts that's probably not very representative of the general material either.

If you want to compare readability for yourself here are the top (and bottom) posts ranked by Flesch-Kincaid grade level

Rank	Post	Flesch-Kincaid grade level	words	sentences
1	Analytic tools "so easy a 10 year-old can use it"	13.3	784	33
2	Point of Sale Analytics - newsletter released	13.1	82	4
3	Point of Sale Data – Category Analytics	12.8	676	29
4	How to save real money in truckload freight (Part I)	12.8	723	31
5	The Primary Analytics Practitioner	12.7	541	29
6	Reporting is NOT Analytics	12.4	891	43
7	Point of Sale Data – Sales Analytics	12.1	478	24
8	Data handling - the right tool for the job	11.9	762	38
9	Data Cleansing: boring, painful, tedious and very, very important	11.8	297	16
10	Point of Sale Data – Supply Chain Analytics	11.6	958	41

35	The right tools for (structured) BIG DATA handling	9.0	1878	114
36	Better Point of Sale Reports with "Variance Analysis": Velocity...	8.9	1264	78
37	Better Point of Sale Reports with Variance Analysis (update)	8.5	177	10
38	Better Business Reporting in Excel - XLReportGrids 1.0 released	8.4	70	5
39	What's driving your Sales? SNAP?	8.3	651	42
40	Do you need daily Point of Sale data?...	8.2	1395	83
41	SNAP Analytics (1) - Funding and spikes.	8.1	531	32
42	SNAP Analytics (2) - Purchase Patterns	7.9	773	44
43	Business Analytics - The Right Tool For The Job	7.6	483	36
44	Are pie charts truly evil or just misunderstood ?	7.1	1097	70

Conclusions

It appears that my material is (largely) written at a level that should be accessible to the reader. And I am using more readable language in recent blogs which sounds like a good thing.

But there remains a key question for me that these stats can't really answer. Am I getting better at explaining the complex (my goal) or just explaining simpler things ? What do you think ?

In case you are wondering, this post has a Flesch-Kincaid grade level of about 8. So if you can follow the "State of the Union" address you should have been just fine with this.

Pages

Business Analytics - finding the balance between complexity and readability

Approach

Results

Conclusions

2 comments: