In this blog I try to present analytic material for a non-analytic audience.  I focus on point of sale and supply chain analytics: it's a complex area and frankly, it's far too easy whether writing for a blog or presenting to a management-team to slip into the same language I would use with an expert.  
So, I was inspired by a recent post on Nathan Yau's excellent blog FlowingData to look at the "readability" of my own posts and apply some simple analytics to the results.
I've followed Nathan's blog for a couple of years now for the many and varied examples of data-visualization he builds and gathers from other sources. One that particularly caught  my eye was this one published by the  Guardian just before the recent State of the Union address in the United States (click to enlarge).
Neither the original post nor Nathan's go into much detail around why the linguistic standard has declined. Within this period, the nature of the address and the intended audience has certainly changed. Frankly, having scanned a few of the earlier addresses I think we can all be grateful not to be on the receiving end of one of them.
The Guardian plotted the Flesch-Kincaid grade levels for past addresses. Each circle represents a state of the union and is sized by the number of words used. Color is used to provide separation between presidents. For example, Obama's state of the union last year was around the eighth-grade level, and in contrast, James Madison's 1815 address had a reading level of 25.3. 
So, I was inspired to find out the reading level of my own blog. It's intended to present analytic concepts to a non-analytic audience. I can probably go a little higher than recent presidential addresses (8th-10th grades, roughly ages 13-15) but I don't want to be writing college-level material either.
All the books my kids read are graded in this (or a very similar) way but I had never thought about how such a grading system could be constructed. The Flesch-Kincaid grade level estimate is based on a simple formula:

That's just a linear combination of : 
- average words per sentence;
- average syllables per word
- a constant term.
In fact (though I have not yet  found details of how it was constructed) it looks to be the result of a regression model.  (Simple) data science in action from the 1970's.
Note that Flesch-Kincaid says nothing about the length of the book or the nature of the vocabulary it's all down to long sentences and the presence of multi-syllabic words.  
(BTW - the preceding sentence has a Flesch-Kincaid grade score of 13.63, calculated with this online utility). Now that's pretty high, worthy of an early 1900's president and (supposedly) understandable by young college students. The sentence is longer than typical; 31 words vs. my average of 18 (see below) and words like "vocabulary", "sentences" and "multi-syllabic" are not helping me either.
(BTW - the preceding sentence has a Flesch-Kincaid grade score of 13.63, calculated with this online utility). Now that's pretty high, worthy of an early 1900's president and (supposedly) understandable by young college students. The sentence is longer than typical; 31 words vs. my average of 18 (see below) and words like "vocabulary", "sentences" and "multi-syllabic" are not helping me either.
Approach
I could have used copy/paste into the online utility I used above, recorded the results in a spreadsheet and pulled some stats from that. That would work, but if I ever want to repeat the exercise or modify it, perhaps to use a different readability index, I must do all that work again.   At the time of writing, there are currently 44 published posts on this blog - there must be a better way.
Actually there are probably many better ways but as I also wanted to flex some R-programming muscle I built a web-scraper in R to do the work for me and analyze the results (more on this in a later post).
Results
Let's start with some simple summaries of the results I collected.
OK, so now I know, but is that good? I don't know that I have a definitive source but according to at least one source the target range on Flesch-Kincaid for Techical or Industry readers is 7-12, so I'm feeling pretty good about that.
I did wonder whether there was any other, hidden, structure to the data though.  I know the equation is based on words per sentence and syllables per word so there is no point looking at those, obviously I'll find a relationship.   But is my writing style influenced by anything else?
|  | 
| 
Flesch-Kincaid grade level vs. the number of words by post on this blog.  Other than a handful of long posts that rate lower in the range 8-10,  I don't see much going on here. | 
If you want to compare readability for yourself here are the top (and bottom) posts ranked by Flesch-Kincaid grade level
| 
Rank | 
Post | 
 Flesch-Kincaid grade level | 
words | 
sentences | 
| 
1 | 
13.3 | 
784 | 
33 | |
| 
2 | 
13.1 | 
82 | 
4 | |
| 
3 | 
12.8 | 
676 | 
29 | |
| 
4 | 
12.8 | 
723 | 
31 | |
| 
5 | 
12.7 | 
541 | 
29 | |
| 
6 | 
12.4 | 
891 | 
43 | |
| 
7 | 
12.1 | 
478 | 
24 | |
| 
8 | 
11.9 | 
762 | 
38 | |
| 
9 | 
11.8 | 
297 | 
16 | |
| 
10 | 
11.6 | 
958 | 
41 | |
| 
35 | 
  9.0 | 
1878 | 
114 | |
| 
36 | 
  8.9 | 
1264 | 
78 | |
| 
37 | 
  8.5 | 
177 | 
10 | |
| 
38 | 
  8.4 | 
70 | 
5 | |
| 
39 | 
  8.3 | 
651 | 
42 | |
| 
40 | 
  8.2 | 
1395 | 
83 | |
| 
41 | 
  8.1 | 
531 | 
32 | |
| 
42 | 
  7.9 | 
773 | 
44 | |
| 
43 | 
  7.6 | 
483 | 
36 | |
| 
44 | 
 7.1 | 
1097 | 
70 | 
Conclusions
It appears that my material is (largely) written at a level that should be accessible to the reader. And I am using more readable language in recent blogs which sounds like a good thing.But there remains a key question for me that these stats can't really answer. Am I getting better at explaining the complex (my goal) or just explaining simpler things ? What do you think ?
In case you are wondering, this post has a Flesch-Kincaid grade level of about 8. So if you can follow the "State of the Union" address you should have been just fine with this.
 



 
This comment has been removed by a blog administrator.
ReplyDeleteNathan's blog is useful. It's a good guide.
ReplyDelete