Search
Automating Vigenère Cipher Cracking

Automating Vigenère Cipher Cracking

Using the Kasiski examination technique works particularly well when messages contain repeating ciphertext fragments. However, if the author of a message is particularly careful they may be able to avoid these repeating fragments. At this point it would be difficult to determine the keylength manually. Fortunately, with the Index of Coincidence and Chi-Squared scoring, you can almost completely automate the cracking of Vigenère ciphers.

Determining the Key Length using Index of Coincidence

While the Kasiski method elegantly uses repeated ciphertext fragments and the distance between them to logically deduce the length of the keyword used in a Vigenère cipher, this method is a brute force attack to determine key length.

We know from earlier in this chapter that a monoalphabetic cipher has an index of coincidence near $0.066$ while a polyalphabetic cipher has an index of coincidence near $0.038$. We also know that if a Vigenère enciphered text is split into $n$ groups, where $n$ is the length of the keyword, that every letter in each group was enciphered with the same key letter, resulting in a monoalphabetic distribution of letters within each group.

We can use Python to guess key lengths, split the text into groups based on the guesses, and check to see if the groups of letters are monoalphabetic or polyalphabetic. If the groups appear to be monoalphabetic then we'll know we've guessed the correct key length.

Suppose you have the following ciphertext, stored to the string ciphertext:

ZQQTK PQUWD PGMWD BQTXY LFQWL SHAJB UCIPV KUQEJ RBAAC LRSIZ ZCRWT LDFMT PGYXF ISOSE ASZXN PHTAY HHIIR ADDIJ LBFOE VKUWW VFFLV TCEXG HFFXF ZVGXF BFQEI ZOSEZ UGFGF UJUGK PCZWZ UQQJI VAFLV CSDCX YOPYR SQTEI HQFII VTAYI LRGGR AWARN LAGWK JCZXZ UIMPC FTAVX LHMRU LAMRT PDMXV VIDWV SJQWW YCYOE VKXIU NSBVV CWAYJ SMMGH BWDIU DSYYJ AGQXR ZWPIF SRZSK PCZWR URQQS YOOIW YSELF USEEE KOEAV SSMVE DSYYJ APQHR PZKYE SSMVE PBSWF TSFLZ UUILZ JVUXY HGOSJ AIERF ZAMPC SONSL YOZHR ULUIK FHAET XIUVV HBPXY PGPMW MWOYC AMMXK HQTIJ PHEIC MAAVV JZAWV SMFSR UOSIZ UKTMT ODDSX YSEWY HGSEZ USPEJ AFARX HGOIE KSZGP VJQVG YSVYU PQQEE KWZAY PQTTV YGARJ HBPXY PBSWR YSPEP IMPEP MWZHZ UUFLV PFDIR SZQZV SWZPZ LIAJK OSUVT VBHIE AWARR SJMPL LHTIJ HAQTI PBOMG SSEAY PQTLR CSEAV WHMAR FHDEU PHUSE HZMFL ZSEEE KKTMT OODID HYURX YOBMU OOHST HAARX AVQVV CSZYV ZCRWZ USOYI PGFWR UREXI PDBME NHTIK OWZXR DRDCM LWXJI VAMXK YOOXZ CSEYG LFEXZ AWARJ HFQAF YYURX HGMGK PJQPP PBXMK LFMXL YSMWZ UGAGZ LHKXY LQDIU BZUXP VTARV DFUXV YCDXY LDMVK POXMK FCREE VHTII MWZHJ HGBSN LFRYC HHAYT OGFSE LOZHR ZKTSC LGAQV HQTEJ AWEID LBFME AVQLV HZFLP ZQQTK PQUWD VTMXV TDQVR ASOPR ZGAJR UHMKF UWEXJ HGFLV KFQED ZCRGF UGQVM HHUWD VFFLV PABSJ AIDIJ VTBPL YOXMJ AGURV JIDIJ PBFLV JVGVT OVUWK VFKEE KHDEU PHUSE DVQXY LFAJR UQUIE ACDGF TDMVR AWHIC FFQGV UHFMD LGMVV ZINNV JHQHK VJQVP KWRJV YSZXY HBPPZ UURVF THTEK DVUGY AVQME KIXKV UQQSI JFQHL SWFCF MTAVD LFMKV ZQAYC KOXPF DAQVV ZHMXV TSZXJ HFQNV HZAYJ SMIEK JVQHR URFLV TCFMM LGAJK OSIVZ ASDJF YAMWZ TDAVK HBFEE PBSVV KWQRK PBFLV HBMPP ZWESW OWELZ ZHAVP HGFLV MOOXJ OSDIT VFPWG YCNES PZUXP PGMTF DSDJL SOZHK YCGFC LGAQV ASEXR URUXZ ZPKXY PGFVF BPXIJ VAQWK HBPEI KHTEK HZMVX LDAVK PCZSW OWEXF YWOEC LJUHV UQQMJ ZWRXV KQARJ PGFIE JMUWE VZQWJ WSDXZ UOOMF BGMRU LLMGK PBSME PHEHV TOZHJ PBNVZ LTFSN YWFIR OWEXF YMIID BGFOE VKYSI LHTEE TSDIW HQFWY BAMRE HHGVV CWQAV KIZHV YOZME KIOXZ VBAJV EHQRU LRQBG LFUIE JSUWK OSNIJ AVQPG ACFLV JFUXZ JWEQF MVGQR UVUWK VFKLZ ZHAVZ JOXGY HFMGK LFEGR UCZPP ISQWK PAMXV KPKXY LGFEE KODHN OWOLY BAMRV EDQVZ LBOIN OSFLV YOOXL HZAVK YOPMK PCZEI FVMWW BFZMJ OSPXF MCDQT VFDIT AJUIN ZCRME KWHMU BOXWN LAGWK YSSEI KHTID HGRSI TWZKG HFFWF MOSVV HHILF SSIID BGFQV HGGVV AVQQS FHTIZ YFQPR AWARK VHTID HGESW ISURX ZPKAY VAFLV FODIJ BFDSL URQHR URURT VBFID WZMXZ UUFLV PBOMU LBFWZ UHTIZ YZUZV ZCDGF URUXZ VBILZ JVFVR KWFMF UVMWY HBPIU KCIRK VIEAV TIEXI HHTII JCZWZ KSDXY LUQRV YOXFV HFURX VTFLV DVAPV UODVR AWHIK OOZXY LFQWG LQFMM LDDSS HPUPZ AMAJZ AGPIK HWXW

Trying the key lengths between $1$ and $8$, and averaging the index of coincidence values for all groups of letters generated.

Key Length Guess: 1	 Average Index of Coincidence: 0.0414
Key Length Guess: 2	 Average Index of Coincidence: 0.0411
Key Length Guess: 3	 Average Index of Coincidence: 0.0412
Key Length Guess: 4	 Average Index of Coincidence: 0.0409
Key Length Guess: 5	 Average Index of Coincidence: 0.0658
Key Length Guess: 6	 Average Index of Coincidence: 0.0410
Key Length Guess: 7	 Average Index of Coincidence: 0.0417
Key Length Guess: 8	 Average Index of Coincidence: 0.0406

it can be seen that key length of $5$ is the only guess so far that results in groups of letters that are monoalphabetic. As a result, this is our best best for the correct key length.

Multiples of the Key Length

It's worth pointing out that multiples of the key length will also be likely to have index of coincidence values that are close to $0.066$ as well. To see why that would be the case, take a key length guess of $10$, even though the actual key is $5$. The letters in groups 1 and 6 would have all been enciphered by the first letter in the keyword, since the keyword repeats after its first use. Groups 2 and 7 would have had their letters enciphered with the same second letter in the keyword, and so on. As long as there are a sufficient amount of letters in each group to accurately calculate the index of coincidence, every multiple of $5$ will calculate an index of coincidence close to $0.066$

Key Length Guess: 5	 Average Index of Coincidence: 0.0658
Key Length Guess: 10	 Average Index of Coincidence: 0.0648
Key Length Guess: 15	 Average Index of Coincidence: 0.0647
Key Length Guess: 20	 Average Index of Coincidence: 0.0647
Key Length Guess: 25	 Average Index of Coincidence: 0.0652
Key Length Guess: 30	 Average Index of Coincidence: 0.0639
Key Length Guess: 35	 Average Index of Coincidence: 0.0674
Key Length Guess: 40	 Average Index of Coincidence: 0.0645
Key Length Guess: 45	 Average Index of Coincidence: 0.0643
Key Length Guess: 50	 Average Index of Coincidence: 0.0632
Key Length Guess: 55	 Average Index of Coincidence: 0.0647
Key Length Guess: 60	 Average Index of Coincidence: 0.0645
Key Length Guess: 65	 Average Index of Coincidence: 0.0685
Key Length Guess: 70	 Average Index of Coincidence: 0.0681
Key Length Guess: 75	 Average Index of Coincidence: 0.0645
Key Length Guess: 80	 Average Index of Coincidence: 0.0634
Key Length Guess: 85	 Average Index of Coincidence: 0.0667
Key Length Guess: 90	 Average Index of Coincidence: 0.0649
Key Length Guess: 95	 Average Index of Coincidence: 0.0675
Key Length Guess: 100	 Average Index of Coincidence: 0.0621

This ciphertext has approximately $2,000$ characters, and even splitting it into $100$ groups of $20$ characters is enough to see a monoalphabetic distribution of characters.

Using Chi-Squared to Determine the Keyword

Once you've determined the most likely length of the keyword, you can determine which key value (0-26) will result in the minimum chi-squared score for each group. This key value is equivalent to the key letter in the keyword.

Group: 1	 Min Score: 36.90910	 Key value: 7	 Key Letter: H
Group: 2	 Min Score: 41.31034	 Key value: 14	 Key Letter: O
Group: 3	 Min Score: 19.01105	 Key value: 12	 Key Letter: M
Group: 4	 Min Score: 36.14494	 Key value: 4	 Key Letter: E
Group: 5	 Min Score: 19.28371	 Key value: 17	 Key Letter: R
Keyword: HOMER

Now that the keyword is known, you can attempt to decipher the message to determine if the analysis was correct.

print( vigenereDecipher( ciphertext, 'HOMER' ) )
scepticismisasmuchtheresultofknowledgeasknowledgeisofscepticismtobecontentwithwhatweatpresentknowisforthemostparttoshutourearsagainstconvictionsincefromtheverygradualcharacterofoureducationwemustcontinuallyforgetandemancipateourselvesfromknowledgepreviouslyacquiredwemustsetasideoldnotionsandembracefreshonesandaswelearnwemustbedailyunlearningsomethingwhichithascostusnosmalllabourandanxietytoacquireandthisdifficultyattachesitselfmorecloselytoanageinwhichprogresshasgainedastrongascendencyoverprejudiceandinwhichpersonsandthingsaredaybydayfindingtheirreallevelinlieuoftheirconventionalvaluethesameprincipleswhichhavesweptawaytraditionalabusesandwhicharemakingrapidhavocamongtherevenuesofsinecuristsandstrippingthethintawdryveilfromattractivesuperstitionsareworkingasactivelyinliteratureasinsocietythecredulityofonewriterorthepartialityofanotherfindsaspowerfulatouchstoneandaswholesomeachastisementinthehealthyscepticismofatemperateclassofantagonistsasthedreamsofconservatismortheimposturesofpluralistsinecuresinthechurchhistoryandtraditionwhetherofancientorcomparativelyrecenttimesaresubjectedtoverydifferenthandlingfromthatwhichtheindulgenceorcredulityofformeragescouldallowmerestatementsarejealouslywatchedandthemotivesofthewriterformasimportantaningredientintheanalysisofhishistoryasthefactsherecordsprobabilityisapowerfulandtroublesometestanditisbythistroublesomestandardthatalargeportionofhistoricalevidenceissiftedconsistencyisnolesspertinaciousandexactinginitsdemandsinbrieftowriteahistorywemustknowmorethanmerefactshumannatureviewedunderaninductionofextendedexperienceisthebesthelptothecriticismofhumanhistoryhistoricalcharacterscanonlybeestimatedbythestandardwhichhumanexperiencewhetheractualortraditionaryhasfurnishedtoformcorrectviewsofindividualswemustregardthemasformingpartsofagreatwholewemustmeasurethembytheirrelationtothemassofbeingsbywhomtheyaresurroundedandincontemplatingtheincidentsintheirlivesorconditionwhichtraditionhashandeddowntouswemustratherconsiderthegeneralbearingofthewholenarrativethantherespectiveprobabilityofitsdetails

And it can be seen that this is a passage from Homer's The Illiad.

Exercises for the Reader

Can you write Python functions to:

  • Split ciphertext into a larger and larger number groups until the average index of coincidence is greater than $0.060$?
  • Use chi-squared scoring and a determined keyword length to determine the keyword letters?