Contents

Caesar Cipher Breaking with Least Squares

Get notified about new problems by e-mail

Overview

The Caesar Shift Cipher encrypts the text message in such a way that each letter is substituted by another, next in the alphabet.

For example, if the alphabet shift value, or key is 2 then A is substituted with C, E with G etc. So the word CROW will be encrypted as ETQY.

Though simple, this method could be easily extended to create more complicated systems. For example Vigenere cipher is really just a set of several Caesar's ciphers with different keys - so that with set of 4 ciphers the first one is applied, for example to 1-st, 5-th, 9-th, ... letters of the original message, another cipher is applied to 2-nd, 6-th, 10-th, ... letters, third is applied to 3-rd, 7-th... and the last is applied to 4-th, 8-th....

This article discusses breaking of such encryption system. Thought cipher itself is not practical nowadays due to its simplicity, understanding of its automated decryption provides several ideas important in wide range of code-breaking tasks and other problems.

Code-breaking brute-force approach

Suppose we intercept an encrypted message, like the following:

IBBIKS IB BPM LIEV EQBP IUJCAP QV BPM KIVGWV

But of course no one told us the key which should be used for deciphering.

Surely, since only 25 values for key are possible (from 1 to 25), we can try each of them and check whether the result looks like proper phrase.

However, it becomes too complicated if we need to process more messages. For example, if each sentence of the message is encrypted with different key, or with Vigenere cipher described above - the task will require too much efforts.

The other problem is that we are going to check result manually to see whether it is normal text or not. Of course using computers we need some more clever idea.

Letter frequency naive method

If you read The Gold-Bug short story by Edgar Allan Poe, you remember the "single-substitution" cipher was broken starting with the fact that E is the most frequent letter in English. Searching for the most frequent symbol in the code gives the hint about which one was used for E.

However, naive implementation of this approach encounters the significant trouble, especially with short messages - letter frequency distribution may not be idea. For example encrypted message above was created with phrase:

ATTACK AT THE DAWN WITH AMBUSH IN THE CANYON

Where we have six A-s, and six T-s, four H-s and only two E-s. Obviously trying different letters to be E is not better than simply trying all 25 keys one by one.

Letter frequency with Least Squares

Luckily, letter-frequency based approach could be significantly improved for Caesar's Cipher, since we can use not only most frequent letters in our calculations.

How it works?

Suppose we write down an array of average letter frequencies in English text (in percents):

# average letter frequencies in English

[8.1, 1.5, 2.8, 4.3, 13.0, 2.2, 2.0, 6.1, 7.0, 0.15, 0.77, 7.0, 2.4,
        6.8, 7.5, 1.9, 0.095, 6.0, 6.3, 9.1, 2.8, 0.98, 2.4, 0.15, 2.0, 0.074]

Here the first value is frequency of letter A, the fifth one - of the letter E and so on.

Of course in a given message (plain-text, not encrypted) letter distribution will not be exactly the same, but it will anyway be similar. For example let us calculate frequencies in the phrase above (Attack at dawn...):

# letter frequencies in real message

[16.7, 2.8, 5.6, 2.8, 5.6, 0, 0, 11.1, 5.6, 0, 2.8, 0, 2.8,
        8.3, 2.8, 0, 0, 0, 2.8, 16.7, 2.8, 0, 2.8, 0, 2.8, 0]

Wee see, that for some letters difference is significant - about 8% for A and E, however, for most of letters discrepancy is not that large. Let us subtract ideal frequencies from real ones to see this difference:

# difference between real and ideal frequencies:

[8.6, 1.3, 2.8, -1.5, -7.4, -2.2, -2.0, 5.0, -1.4, -0.15, 2.0, -7.0, 0.4, 1.5,
        -4.7, -1.9, -0.095, -6.0, -3.5, 7.6, 0.0, -0.98, 0.4, -0.15, 0.8, -0.074]

Wee see that for many letters difference is small enough (by absolute value):

0 - 1 % for 9 letters;
1 - 2 % for 7 letters;
2 - 5 % for 5 letters.

I.e. for 21 letters of 26 the difference is less or equal to 5%.

Now, what happens if we encode some message with Caesar's Cipher and calculate letter frequencies?

Obviously, the array of frequencies will be alike one mentioned above, but shifted by several positions (according to cipher key). For example, taking encrypted phrase above (Ibbiks ib bpm...) we get the following distribution:

# letters distribution in the encrypted message

[2.8, 16.7, 2.8, 0, 2.8, 0, 2.8, 0, 16.7, 2.8, 5.6, 2.8, 5.6,
        0, 0, 11.1, 5.6, 0, 2.8, 0, 2.8, 8.3, 2.8, 0, 0, 0]

You see, now most frequent letters are B, I, P etc... What useful info can we get out of this distribution?

Let us calculate difference with ideal frequency distribution, like we did it before for plain-text message:

# difference of letter frequencies between encrypted message and ideal distribution

[-5.3, 15.2, 0.0, -4.3, -10.2, -2.2, 0.8, -6.1, 9.7, 2.7, 4.8, -4.2, 3.2,
        -6.8, -7.5, 9.2, 5.5, -6.0, -3.5, -9.1, 0.0, 7.32, 0.4, -0.15, -2.0, -0.074]

Look, there are very big discrepancies (up to 15%). And if we summarize we get the following:

0 - 1 % for 6 letters;
1 - 2 % for 1 letters;
2 - 5 % for 7 letters.

And now in total only 14 letters hit the difference of 5% or lower.

Core idea

Now we can invent the straightforward algorithm:

Calculate array of letter frequencies for encrypted message
Check the difference with ideal letter frequency distribution
Rotate the array by 1 position (remove value from end and add to beginning)
Repeat from 2-nd step for 25 times
Of all rotations of an array choose one for which difference is minimal
The amount of steps by which original array was rotated is, really, the value of the cipher key

The only question is - how to calculate the difference between arrays?

We can use very popular and simple method, similar to root mean square:

calculate differences for each value of array (as we done before);
raise each value to power 2;
get average of all resulting values.

For example, calculating such metric for letter distribution in plain-text message above we get 365, while with the encrypted message above we get much higher value of 982. And if we will rotate the array of frequencies for the encrypted message we'll find that minimal value of 365 is achieved after 8 steps, which means that encryption was performed with shifting each letter by 8 positions.

Raising to power 2 on second step has two goals:

it allows to treat negative and positive differences in the same way;
larger discrepances add extremely big values to resulting difference (i.e. discrepancy of 15 for letter B will add 225 points to resulting difference), so larger errors are "fined severely".

Conclusion

Obviously the discussed algorithm has O(N) time complexity - the time is mosty spent on calculating frequencies, while analysis time does not depend on message size.

If you want to have a practice on this algorithm, see the task Caesar Cipher Cracker there you will find that the method allows to decipher even very short messages.

Of course this approach could be extended to work with Vigenere Cipher and also could be used to check the type of encryption.