Tuesday, July 04, 2017

The ‘i Before e, Except After c’ Rule Is a Giant Lie

Wonkblog explains The ‘i before e, except after c’ rule is a giant lie "'This addendum to the rule completely useless,’ Cunningham writes. ‘You still have roughly three to one odds that the ‘i’ goes first.’"


Karl said...

The first thing that jumped out at me when reading the article was that simple word count is not the best metric to measure the efficacy of a spelling rule. Weighting the ratio based on how likely it is to encounter the words in question seems more useful. This will also normalize word variations being counted as multiple instances.

Let’s assume the weighted word list from http://www.wordfrequency.info/ for the top 5000 American English words is representative. A simple word count shows 72% ie:ei, which is pretty close to the original article. If instead we do a ratio of the total frequency of ie words to ei words, it actually gets worse, falling only to 57%. The biggest impact is from "their", a very common word which accounts for 65% of the weighted violations. But there/their/they're usage is one of the most commonly pointed out issues, so lets assume someone who cares about spelling knows "their" outside the i before e rule. If I go through the list and eliminate words which are known exceptions or have otherwise overriding rules (e.g. prefix/suffix usage), the ratio goes up to 85%. That's not too bad as a guideline.

Howard said...

I appreciate the analytical approach and didn't know about wordfrequency.info.

1. I get that "their" accounts for a huge percentage of uses, but as you point out it is one of the most common problems. And I'm not sure it's fair to discount because people who care about spelling will know it. I remember being in elementary school and spelling badly and not particularly caring about it except when grades came back. Such a common exception made it harder for me to remember the other uses.

2. I know the list of top 5000 words is free, but limiting the analysis to them is also problematic. Given the commonness of these words, if I cared about spelling :), I might see these often enough to know how to spell them. The next few thousand words, that I might use only occasionally and still spell wrong is perhaps more problematic.

karl said...

I did what I could with what I could find. Starting from scratch, I think its a poor guideline. The manipulations were along the lines of, if we use the guideline, how can we make the most of it. If you consider, even a rule with even odds will get you the right answer half the time, while randomly choosing has a much larger variance.

As far as the short list goes, with the weighted metric, the later words will have increasingly smaller impact (at most 1/2 the median short list, or 1/200 "their"). If many of the words above 5000 come are impacted by the rule it can add up, but There wasn't a clear grouping of good/bad v. word frequency. I'm not sure there is a reason to suspect a significantly different distribution for later words. Throw that in with the assumption that the corpus is representative.

Your point that common words are less likely to benefit from the rule, since they are known, is well taken. It highlights the difficulty in selecting representative metrics.