Problem #342
Tags:
popular-algorithm
data-compression
c-1
In the Huffman Coding task we have met a data compression algorithm which allows to encode more frequent charcters (or bytes) with fewer amount of bits, while rare characters on contrary may be encoded with more than usual amount of bits. Such algorithm belongs to group of "Entropy Encoding" algorithms.
However both Huffman and Shannon-Fano approaches can only encode characters with whole amount of bits -
there are no ways to encode something with, say, 1.3
bit or 0.7
bit per character. You may wonder - how at
all such things are possible? Fractional bits! What a nonsence!
That's the feature of Arithmetic Coding - algorithm which also uses character frequencies, but in different manner. We are going to study it now. Your goal would be to decode the text compressed (slightly) with it.
Suppose we want to encode short string of text, for examble BAD DAD ADDED
. To simplify the matter, assume every
character (byte) may have not 256
values but only 10
: A
, B
, C
... I
and <space>
for example.
In this case it is easy to represent the string as decimal fraction, replacing A
with 0
, B
with 1
and so on. This results in a fraction like this:
.1039303903343
Let's think how this fraction is constructed. Until we haven't started encoding, and haven't taken any letters
yet, we know the result will be between 0.0000000...
and 0.9999999...
(the latter value really equals to 1
).
So initially the range of the possible result is 0 ... 1
. But after we take the first character and discover it is
B
(corresponding to digit 1
), we understand the range is narrowed, it is between .100000...
and .199999...
(actually 0.2
).
Next charcters A
(corresponding to 0
) further narrows the range to between .100000...
and .1099999...
.
Look at the picture below to visualize this process:
Here the black range corresponds to empty string (in beginning), red to string B
, green to BA
and blue to
BAD
. The range has meaning of possible space into which result can fit if the string starts with the given
prefix.
Any fraction within this interval represents the same prefix. That's how we encode text with fraction.
Now, how to implement compression?
We assigned equal "sub-range" to every possible letter (of 0.1
size). But what if instead we choose some more frequent letter
to use wider sub-range (and others narrower). For example, if the letter A
is more frequent, let's give
the half of the current range to it. That means the single letter A
is represented by any value between
0.0
and 0.5
. Prefix AA
is encoded by any value between 0.0
and 0.25
. At last the sequence
AAA
selects the range between 0.0
and 0.125
.
That is interesting. If we take result of encoding, say, 0.1
, we clearly understand it encodes AAA
despite
it took just a single decimal digit! Of course encoding of other, more rare letters will took more digits than
previously.
The main idea is, however, to dedicate to each character such a sub-range which is proportional to its frequency. Let's clarify this in simplified code.
Suppose we precalculated letter frequencies and give them in two first lines as constants.
frqs = [5, 7, 2, 1, 1, 1, 2, 3, 3, 1] # letter counts for A, B, C, D... I, <space>
total = sum(frqs) # total amount of letters
low = 0 # lower end of range
high = 1 # higher end of range
for letter in string_to_encode:
code = char_to_code(letter) # convert A to 0, B to 1, C to 2 etc
a = 0
for i in range from 0 to code (non-inclusive)
a = a + frqs[i] # a = all counts before this letter in frqs array
b = a + frqs[code] # b = all counts before and including this letter
end for
# a and b will help us to calculate new range:
current_range = high - low
high = low + current_range * (b / total)
low = low + current_range * (a / total)
# that's it! now the range become only a fraction of previous range, according to [a..b] interval
end for
# string encoding is completed, let's take mean of low and high as output
result = (low + high) / 2
print(result) # prints encoded value as fraction
You can try this code with some amends and you'll see it works for short strings (say 5-7 characters).
However there is obvious problem: fractional (real) values in computers are of limited precision. Very soon
low
and high
become very close and at last will stop changing.
We overcome the said issue by "shifting" the front parts of values low
and high
to the output, when they
become equal. This means we send the digit (not necessarily decimal) to the output, then multiply low
and
high
so that they are always sufficiently apart.
Moreover we can use integer values instead of reals here. Let's use integers in base base
composed of digits
digits. In real life base
could be 2
or 256
, but for our problem let it be 27
, allowing to use separate
values for letters A ... Z
and the last one for <space> = 26
. Supposing every value starts with implicit
decimal dot we assign:
low = [0, 0, 0, 0] # e.g. 0 * 27^3 + 0 * 27^2 + 0 * 27^1 + 0
high = [26, 26, 26, 26] # e.g. 26 * 27^3 + 26 * 27^2 + 26 * 27^1 + 26 = 27^4 - 1
More correctly, value high = 1.0
is formed by endless chain of 0.(26)(26)(26)(26)(26)...
but we shall
keep it in mind and amend when "shifting" the value, as we'll see below.
Now range update fragment will look very similarly:
# a and b calculated as before
range = high - low + 1
high = low + range * b / total
low = low + range * a / total
Integer division a/total
and b/total
may truncate result slightly but it doesn't spoil calculations.
Now about "shifting" the high
and low
. After every newly-encoded character it may happen that highest
digits of low
and high
become equal. This mean that digit could be added to result. At the same time
new digit should be added to both values "from the right". Zero to low
and 26
to high
:
if first_digit(low) == first_digit(high) then
send first_digit(low) to output
remove first_digit from both values
multiply both values by 27
append 0 as last digit to low
append 26 as last digit to high
repeat if necessary
Think of this as we operate on endless numbers, but at any given time regard only short fragment of them, sliding more and more to the right as the prefix on the left becomes fixed.
Below we have complete encoder source in Python
implementing this idea.
Usually with algorithms like Huffman we need to provide encoded text along with frequency table which
decoder can use. In this case however we'll start with table consisting of all ones
and will update it
after every next letter, adding 1
to corresponding cell. This means frequencies will be changing over time
but it is not a problem, we'll be using updated frequencies for every next character.
Decoder in turn can also start building frequency table in the same manner, adding 1
after every next
decoded letter - so it will "keep track" of frequencies exactly matching the table used by encoder.
Obvious drawback is, however, that in the beginning of encoding, while the table has not yet settled on "proper" frequencies, little or no actual compression happens.
For better visibility let's use dot (.
) instead of space. The code starts with defining values as described
above.
base = 27
digits = 6
maxv = base - 1 # maximum value of digit in our 27-base system
frqs = [1] * base # counts of letters, actually
total = base # total count of letters in frqs
low = 0
high = 1 # we'll build up value for `high` iteratively
for i in range(digits):
high *= base
high -= 1
tail = (high+1) // base # auxiliary value used in "shifting" process, 27^(digits - 1)
output = ''
def encode(text):
global output
for c in text:
encode_char(c)
encode_char('.') # after all chars are encoded, let's add double space as a sign of end
encode_char('.')
# now we need to take care enough remaining characters are shifted to output
cur_out_len = len(output)
while len(output) < cur_out_len + digits:
encode_char(code_to_char(base // 2)) # let's encode some middle letter until enough chars are shifted
return output
def encode_char(c):
global low, high, frqs, total
c = char_to_code(c)
a, b = calc_probabilities(c)
rangee = high - low + 1
if high - low + 1 < total: # see notes below
raise Exception('range exhausted')
high = low + rangee * b // total
low = low + rangee * a // total
perhaps_output_digits() # here we shift digits and update low and high
frqs[c] += 1 # adaptiveness implemented
total += 1
def char_to_code(c):
if c >= 'A' and c <= 'Z':
return ord(c) - ord('A')
return maxv
def code_to_char(c):
if c >= 0 and c < maxv:
return chr(ord('A') + c)
return '.'
def calc_probabilities(c):
a = sum(frqs[0:c])
return a, a + frqs[c]
def perhaps_output_digits():
global low, high, output
while first_digit(high) == first_digit(low):
low, d = remove_first_digit(low)
low *= base
high, d = remove_first_digit(high)
high = high * base + maxv
output += code_to_char(d)
def first_digit(v):
return v // tail
def remove_first_digit(v): # returns truncated value and digit which was truncated
d = v // tail
v %= tail
return v, d
# main part at last
text = input().upper()
encoded = encode(text)
print(encoded)
Small thing to note is this simplified (yet) implementation allows very rare case when range
becomes too
narrow but no digits could be shifted out yet. In real life we'll need to take care of such situation. However
for the problem you may safely assume input data will never include such case.
Please write decoder which will can decode strings encoded with the encoder shown above. Really it could be much smaller as you can reuse some functions.
Input: the first line contains number of testcases (usually 3
).
Next lines contain one testcase each.
Answer: should give decoded text for every testcase, separated with spaces.
Example
input:
3
SUIV.TEGTPRXAHUABTUC.QDAAXSQKQLLASVYIYDFQYSVXGDNMXRDA
ATHBOBVBPDFHJWFOT.ICNOKUHBAVKGGXPFXLNO.FDBKVMBQNNZNRBEETJ
TGYTPYYDKBHGGOAWRA..LGYHEGOGEDA..QCV.ZFMWBGTGLHYDVRXGXPQZEIO.XQSYMCIDIXMPHJLWVNVGRHLWNOJUEQUZJMCYGUWAGYFJAIESS
answer (visually broken into lines):
SUCH.MONSTROUS.THREATS.TO.THE.PEACE.AND.SANITY.OF
AS.HADN.T.ORTER.BE.ABROAD.AN.I.FER.ONE.THINK.THAT.BLACK
THE.PEAK.ALL.MOVING.THEIR.ARMS.FURIOUSLY.IN.STRANGE.GESTURES.AS.THEIR.INCANTATION.DREW.NEAR.ITS.CULMINATION.FROM
Example 2 (also to show there is actual reduction in text length on longer samples)
input:
1
COZUKSHZMPADYYAG.QKLHBGGYCVDTISKNMDMLBQREDM.YRWJZUYAJZGMBJHZYPDM.BCKSYYZRPTBYVZKXLGIE.PRNPIDBTTFEEJDHHVL.
YWXCIACOSCYRBXCSIEFRXNAEOXPYTNTJZMWGG.QRPXTQWPNNKSHHYEDDCZTG.UNCCSEZBLZJLEHMYHGFPHTHEEVVIJBUUQNRBM.
TXTHQMFLXWVXRHQHH.GOYGCSQNFGQSFFCNRJCNQIJXBCDCZHBSKQWPGWPSHRNS.FFPXPXQUKNHW.O.EIRBBBISADOJWSPDGTYYRNNXOKMC
PEUPO.TKOOCXRADKZYPINYBCRCYFFJZCUUXPT.XX.UPFJZHQDSDMUTUFILCCUEOWTPIXQOEEWPJZDMMAUTINSZRLNPCCAWZBYRYLNXNCAG
MIQGOLKWHMHFMFUZDS.FCQCZLCDOGPNYOZRVMHJHDYZHWHVMTJUAINHRYVVXAMKMIRHCLFLAQ.OELJPDCFIBVUCSJXTHTV.XSETUQLHOAGNI.
WJTP.WLEQXXVZBFRJJWBTMXSMIH.LLGGEPMBZDKHGXXMAKMVGUJVELXJGDUAHLODIURVWHFNTRYSDWZFBNUDRQJLRXOQHZRTYOUFPKXTYU
DUUDTXKOBRQCNDGMV.ZGBIHKYQEIBZYNTYMDRUYYWBDMWENAXRBWNLFABLCOXRALFJUTGTHCZURHBSCILIJECDLGSZEONKLIMRQ.HJRMCT
HKLUVSDELIPJCYXUQMEJEUUZTBKDQTCRBSXJLTBTHAGMWRWKTCRVUSMTJAYDYFZDXLUHYXOCRDWUEWDJSFUEOTPTPRYASSOOBLMFIRZGYIL
FUOPWEQZR.ZSBXGRUDAMKWISHENICF.ZAGPVSTYC.DPVPDYMLXXQUQTGDYEFUBINXLP.CUKVBIVVSWTTJWCILMUJMC.EFAYPIAOCMOILCUMGSGFFCNIL
answer:
CONSTABLE.WAS.IN.BED.AND.WE.WERE.SHOWN.INTO.A.LITTLE.FRONT.PARLOUR.TO.AWAIT.HIS.COMING.HE.APPEARED.PRESENTLY.
LOOKING.A.LITTLE.IRRITABLE.AT.BEING.DISTURBED.IN.HIS.SLUMBERS.I.MADE.MY.REPORT.AT.THE.OFFICE.HE.SAID.HOLMES.
TOOK.A.HALFSOVEREIGN.FROM.HIS.POCKET.AND.PLAYED.WITH.IT.PENSIVELY.WE.THOUGHT.THAT.WE.SHOULD.LIKE.TO.HEAR.IT.
ALL.FROM.YOUR.OWN.LIPS.HE.SAID.I.SHALL.BE.MOST.HAPPY.TO.TELL.YOU.ANYTHING.I.CAN.THE.CONSTABLE.ANSWERED.WITH.
HIS.EYES.UPON.THE.LITTLE.GOLDEN.DISK.JUST.LET.US.HEAR.IT.ALL.IN.YOUR.OWN.WAY.AS.IT.OCCURRED.RANCE.SAT.DOWN.
ON.THE.HORSEHAIR.SOFA.AND.KNITTED.HIS.BROWS.AS.THOUGH.DETERMINED.NOT.TO.OMIT.ANYTHING.IN.HIS.NARRATIVE.ILL.
TELL.IT.YE.FROM.THE.BEGINNING.HE.SAID.MY.TIME.IS.FROM.TEN.AT.NIGHT.TO.SIX.IN.THE.MORNING.AT.ELEVEN.THERE.
lWAS.A.FIGHT.AT.THE.WHITE.HART.BUT.BAR.THAT.ALL.WAS.QUIET.ENOUGH.ON.THE.BEAT.AT.ONE.OCLOCK.IT.BEGAN.TO.RAIN.
AND.I.MET.HARRY.MURCHER.HIM.WHO.HAS.THE.HOLLAND.GROVE.BEAT.AND.WE.STOOD.TOGETHER.AT.THE.CORNER.OF.HENRIETTA.
STREET.ATALKIN.PRESENTLY.MAYBE.ABOUT.TWO.OR.A.LITTLE.AFTER.I.THOUGHT.I.WOULD.TAKE.A.LOOK.ROUND.AND.SEE.THAT.
ALL.WAS.RIGHT.DOWN.THE.BRIXTON