Problem #229
Tags:
data-compression
popular-algorithm
c-1
c-0
Among techniques used in data compression, there are interesting algorithms which do not compress data themselves, but transform them so they become better compressible.
One of such algorithms was proposed in 1994 (quite after the popular compression algorithms like Lempel-Ziv or Huffman) and named, as usually, after inventors - Michael Burrows and David Wheeler.
This algorithm rearranges bytes of the input message in such a way, that result has many chains of repeated bytes. At the same time this transformation could be undone (without keeping any additional tables etc).
Let's look at the example (skiping the transformation description itself for now):
original: tiukka.paikka.peikko.paukku.putki.poikki
transformed: aoaiukkpppkaoetkkktkuiiiukkkp.....|ukiapi
Here we use dots instead of spaces for better visibility. Also before transformation it is necessary to add
some end-of-message
character - we use "|"
(pipe) for convenience. Original phrase is a "tongue-twister"
in Finnish, I believe.
Thus rearranged message has the same "byte content" as original - so it is no help for compression by algorithms like Shannon-Fano or Huffman, which profit from unequal character frequency (e.g. character entropy - it obviously remains the same).
At the same time, algorithms which utilize repeating subsequences in text
(e.g. LZ-based or Run-Length encoding
) may now work better - even at
first glance we see groups of similar letters like kkk
and iii
.
It is very simple to describe, though could be tough to implement efficiently:
mamma|
it would be amma|m
, mma|ma
and
so on (including original).amma|m
becomes first, a|mamm
next etc).Let's complete mamma
example:
variations: after sort:
mamma| amma|m
amma|m a|mamm
mma|ma mamma|
ma|mam ma|mam
a|mamm mma|ma
|mamma |mamma
The last column gives mm|maa
- that is the result we want.
Back transformation is also easily described, though perhaps even tougher to implement efficiently. We skip it explanation (feel free to check wikipedia article for details).
While the overall character entropy (as we defined it before) remains the same for the whole message (not regarding the end-of-message symbol), it could be found, that parts of the message have smaller entropy now.
For example, if we use our entropy formula on the tongue-twister given above, we get 2.91
for original
message and 2.99
after BWT (slight increase because of added end-of-message symbol). But split original
message in two (at position 20
- about the middle - for example). Two halves of original have
entropies of 2.87
and 2.79
, while after splitting transformation result at the same point gives us
two substrings with entropies of 2.66
and 2.58
.
In practice this means that even algorithms like Huffman can profit from BWT, if they encode two halves (sufficiently long) separately, or if "adaptive" version of algorithm is used (which adapts to distribution changing over time).
We'll use fragment of the book as an input. It is converted to lowercase for simplisity, with all non-letter characters removed and spaces replaced with dots for visibility. For convenience input is split into several lines, but please concatenate them together before processing. All line feeds are inserted immediately after dots, so it is easy to detect the last line, if needed - it has no dot at the end.
Run BWT over the input, as described above.
Then calculate number of repeating characters in sub-sequences made of the same character. In other words,
count every character which is the same as preceding one. E.g. fragment abbbbc
contains 3 repeating characters
(the first b
is not considered "repeating").
As answer, please, print (separated with spaces):
The first value, of course, is the length of input plus 1, so it is just for cross-check.
Example
Input:
tiukka.paikka.peikko.paukku.putki.poikki
Answer:
41 13 6
Explanation: in the result there are six repetitive groups kk
, ppp
, kkk
, iii
, kkk
, .....
.