Chi-Squared Scoring

Chi-Squared ($\chi^2$) Scoring

As we've seen previously, frequency analysis can be helpful with some ciphers, especially the Caesar cipher. However, trying to arrange characters in frequency order to make a correspondence between plaintext and ciphertext letters is not easy, and we're no closer to finding a method better that's than brute force and visual inspection to crack the Affine and Keyword ciphers.

The barchart was helpful when cracking Caesar cipher messages because it allowed us to quickly observe that the ciphertext was in fact enciphered using Caesar, since we could see the A/E, H/I, etc., spikes, and quickly determine what the best possible key would be based on how closely the ciphertext distribution matched the English language distribution. If the ciphertext distribution was shifted by 12, then the key was 12. We'll build on this idea, comparing the ciphertext to what we'd expect English to look like, but instead of visually inspecting the two distributions, we'll instead create a scoring method to rate each key used to decipher a ciphertext message. We can then use the score to rank the possible keys and choose the key with the best score.

Developing the Score

To create a score for each possible key we'll use a common method in mathematics and statistics, summing squared errors. In statistics, we call the difference between the expected result and the actual result an error. Errors are positive when the actual value is larger than the predicted value, and negative when the actual value is less than the predicted value. In the case of our use, we'll be using the count of each character in the message as our value. The actual value will be how many times the character appears in our deciphered ciphertext, and the expected value would be how many times we would have expected the character to appear in the deciphered ciphertext based on known letter frequencies.

For example, using the same ciphertext from the Frequency Analysis section:

KRETI JUKRP TUCHI GRDPT UHUJK XUDET IVVKP RIPER EYPWD KHPWO UPTIJ ULKJJ UDOKP TPTUU RDYIL OIHAY ERDER IINWY AUJJR IHWUP EDHWV EHUYE RDWTI JUOKP TRIPT KRCKR KPPIY KPDIO RIRIH PIUEP KPOEY ETIVV KPTIJ UERDP TEPAU ERYMI ALIHP KPTED EZUHL UMPJW HIGRD DIIHJ KSUEZ IHPTI JUZEK RPUDC HUURO KPTEY TKRWW UJJIO VHEYY SRIVK RPTUU FEMPA KDDJU PTUDI IHIZU RUDIR PIEPG VUYTE ZUDTE JJJKS UEPGR RUJEX UHWMI ALIHP EVJUP GRRUJ OKPTI GPYAI SUOKP TZERU JJUDO EJJYE RDLJI IHYPK JUDER DMEHZ UPUDZ HIXKD UDOKP TZIJK YTUDM TEKHY ERDJI PYERD JIPYI LZUCY LIHTE PYERD MIEPY PTUTI VVKPO EYLIR DILXK YKPIH YPTUP GRRUJ OIGRD IRERD IRCIK RCLEK HJWVG PRIPQ GKPUY PHEKC TPKRP IPTUY KDUIL PTUTK JJPTU TKJJE YEJJP TUZUI ZJULI HAERW AKJUY HIGRD MEJJU DKPER DAERW JKPPJ UHIGR DDIIH YIZUR UDIGP ILKPL KHYPI RIRUY KDUER DPTUR IRERI PTUHR ICIKR CGZYP EKHYL IHPTU TIVVK PVUDH IIAYV EPTHI IAYMU JJEHY ZERPH KUYJI PYILP TUYUO EHDHI VUYTU TEDOT IJUHI IAYDU XIPUD PIMJI PTUYS KPMTU RYDKR KRCHI IAYEJ JOUHU IRPTU YEAUL JIIHE RDKRD UUDIR PTUYE AUZEY YECUP TUVUY PHIIA YOUHU EJJIR PTUJU LPTER DYKDU CIKRC KRLIH PTUYU OUHUP TUIRJ WIRUY PITEX UOKRD IOYDU UZYUP HIGRD OKRDI OYJII SKRCI XUHTK YCEHD URERD AUEDI OYVUW IRDYJ IZKRC DIORP IPTUH KXUH

This message is $994$ characters long. Since we know that $8.167%$ of English characters are A's, we would expect around $994 \cdot 0.08167 = 81.17998$ A's in the plaintext. Likewise we would expect, $994 \cdot 0.01492 = 14.83048$ B's in the plaintext. Continuing these calculations we can generate a table:

Letter Expected
A 81.17998
B 14.83048
C 27.11631
D 42.27482
... ...
Y 19.62156
Z 0.73556

If we attempted to decipher this message using akey = 10 and mkey = 7 and counted the number of each character, we obtain the following actual values:

Letter Expected Actual
A 81.17998 18
B 14.83048 0
C 27.11631 16
D 42.27482 62
... ... ...
Y 19.62156 60
Z 0.73556 18

We can see that there's a difference between what we expected and what actually happened when we deciphered the message with our guessed key values. Our job is to quantify how different these two distributions are. We can start by calculating the error, or difference between actual and expected values. We will subtract expected from actual.

Letter Expected Actual Error
A 81.17998 18 -63.17998
B 14.83048 0 -14.83048
C 27.11631 16 -11.11631
D 42.27482 62 19.72518
... ... ... ...
Y 19.62156 60 41.37844
Z 0.73556 18 17.25444

That's still a lot to keep track of, so it may be tempting to sum all the individual error values to get a total error value. However, the sum of the individual errors will always be 0. This is because if there are $994$ letters in the message, and there are 11 less C's than expected, those 11 characters are going to appear as "extra" in the count for other letters. An error in one character will always get balanced out by errors in other letters. In the end the total positive errors will completely balance out the total negative errors. One way we can avoid this is by ensuring all errors are positive, so when they are summed the total is guaranteed to be positive. A common way to make values positive is to square them. So for A, the squared error would be $\left(-63.17998\right)^2 = 3991.7098728$.

Letter Expected Actual Error Squared Error
A 81.17998 18 -63.17998 3991.7098728
B 14.83048 0 -14.83048 219.94313703
C 27.11631 16 -11.11631 123.572348016
D 42.27482 62 19.72518 389.082726032
... ... ... ... ...
Y 19.62156 60 41.37844 1712.17529683
Z 0.73556 18 17.25444 297.715699714

You now have error values for each character that are positive, but comparing between different letters wouldn't make much sense since the size of the squared error should be considered in context to the expected number of characters.

After all, having a squared error of 4 for the letter A (actual was 2 off from the expected) means very different things if you were expecting 5 A's (pretty bad) or 5,000 'A's (pretty good) in your plaintext. One way to account for this is to divide the squared error by the expected count. This will help normalize the squared error so you can compare the value between different letters even though the letters have different expected counts. So for A, the normalized squared error would be $3991.7098728 / 81.17998 = 49.1711118037$

Letter Expected Actual Error Squared Error Normalized Squared Error
A 81.17998 18 -63.17998 3991.7098728 49.1711118037
B 14.83048 0 -14.83048 219.94313703 14.83048
C 27.11631 16 -11.11631 123.572348016 4.55712255893
D 42.27482 62 19.72518 389.082726032 9.20365186728
... ... ... ... ...
Y 19.62156 60 41.37844 1712.17529683 87.2598966051
Z 0.73556 18 17.25444 297.715699714 404.746995098

Using the normalized squared error scores you can see that the error for A is actually less significant than the error we had for Y, even though A had a higher squared error than Y. We were expecting there to be a lot more A's than Y's, so the error of $\approx 63$ is less significant than the error of $\approx 41$. The normalized squared error gives an easy way to compare the error for each letter.

Summing all the normalized squared errors yields a single number known as a chi-squared statistic. For this passage of text the chi-squared score would be $3894.0128$.

More generally written, the formula to calculate this score would be:

$$\chi^2 = \sum_{i=A}^Z \frac{\left(A_i - E_i\right)^2}{E_i}$$

Where $A_i$ represents the actual number of character $i$ was in the message and $E_i$ represents the expected number of character $i$ in the message.

When this number is very low, it means that there was very little difference between the actual count of each letter and the expected count of each letter. When this number is very high, it means there were many differences. We'll know that we used a good key value if our deciphered message has a small chi-squared value, since that means there was little difference between the actual and expected values.

Using the Chi-Squared Statistic

You just went through calculating the chi-squared score for the ciphertext. However, in practice you won't be concerned with the score for the ciphertext, since we already know the ciphertext doesn't have the characteristics of English. Instead, you'll want to attempt to decipher the ciphertext with all possible keys and score the "deciphered" text. Almost all of these decipherments will not be the plaintext message, and as a result will have a relatively high chi-squared score. However, one attempted decipherment should actually be the plaintext and as a result have a very low chi-squared score. By scoring the output from each possible key you can rank the keys by their corresponding chi-squared scores. The key with the lowest chi-squared score produces a deciphered text that most closely matches the English language, and is therefore most likely the correct key to return the plaintext message.

Scoring all 312 possible key combinations for the Affine cipher yields the following top 5 choices, sorted from lowest chi-squared score to highest:

a-key:  4   m-key: 17   decipherment: inaholeinthegroundtherelivedah   chi-squared score:   66.868
a-key: 20   m-key:  7   decipherment: ghulcraghdlaqncyhfdlanargtaful   chi-squared score: 1312.935
a-key: 20   m-key: 15   decipherment: ifstubaifrtaenugflrtanabivalst   chi-squared score: 1423.628
a-key:  7   m-key: 15   decipherment: vsfghonvsegnrahtsyegnanovinyfg   chi-squared score: 1447.688
a-key: 22   m-key: 25   decipherment: mfsdoncmfhdcupoqfthdcpcnmzctsd   chi-squared score: 1449.188


You can see that the correct key values produced the lowest chi-squared score by far. This technique works for any mono-alphabetic cipher (Caesar, Affine, Keyword, etc.) as long as trying all possible key values is feasible. As a result, this works very well for Caesar and Affine ciphers, but using this for a keyword cipher is particularly challenging since, as previously mentioned, there are $26!$ possible keys for a Keyword cipher, which would take even a modern computer some time to try them all.