While we have earlier studied three slow ("quadratic") sorting algorithms (Bubble Sort, Insertion Sort and Selection Sort) along with much superior "logarithmic" Quick Sort, there are other curious approaches with some interesting properties.
Merge Sort is something you may rarely hear nowadays, despite it is still widely used in some forms, generally in "big data" fields. Suppose the amount of data to be sorted is so large, that they don't fit computer memory. Like petabytes of log files of requests to server. And we want to sort them by the user id mentioned in each log line.
You can't apply most of well known algorithms (including those mentioned above) here for couple of obvious reasons:
fseek
, generally it is much more slow operation compared to sequental reading of the fileSo file really should be read and written sequentially. As a matter of fact this still suits well with Bubble Sort! Really we can read file line by line, on each pass keeping the "higher" element while remaining are written to the output file in the order they are encountered in the input.
Regretfully this will make N
passes over the file of N
elements (lines), which could be extremely long time
(petabyte
of log lines averaging to 200
characters contains about 5e12
lines - making one pass per second
- incredible for petabyte though - this will take over 150 thousand years).
Merge sort makes only Log2(N)
passes - just about 42
in our case! One minute instead of 150 thousand years!
Regard the images above and below. The Merge Sort was invented when typical file storage was on magnetic tapes - if you still remember tape recorders / players, you know that "rewinding" the tape really took time. Also amount of data on the tapes of course much exceeded typical computer memory - but this sort really needs only few bytes of RAM!
Each pass consists of two steps - splitting the input sequence (say, tape A
) into chunks, putting odd and
even chunks on two auxiliary sequences (say, tape B
and tape C
) - and then merging chunks back, each couple
separately, onto the first tape.
Merging happens in such a way, that every resulting double-chunk remains sorted if two input chunks were sorted themselves.
So on the first pass we split into chunks of size 1
and merge them in pairs into chunks of size 2
. On the
next pass we split into the very same chunks of size 2
and merge them into chunks of size 4
and in general,
on pass K
we split into chunks of size 2^(K - 1)
and merge into chunks of size 2^K
.
How merging happens? Very easy! We have two sorted chunks - let's repeatedly compare their first elements and pull
out the smallest every time. E.g. suppose we have chunks 1 5 10 11
and 2 3 8 12
. We compare 1
and 2
at
their "heads", take 1
. Then compare 5
with 2
, this time picking 2
. Next on comparing 5
and 3
we
again pick from the second chunk and so on. When one of the chunks comes to end, we simply flush the tail
of another one.
Some last notes:
1
but such as roughly fit into memory
and sort them with any suitable in-memory sortYou are given the sequence of values to be sorted. The goal is to apply the merge-sort algorithm described above and as a "proof" to produce the recorded sequence of operations.
This recorded sequence of operations really describes how merging happens on every pass. If the value
is taken from the tape B
, add the letter b
to the output. If the value is taken from the tape C
, add
the letter c
. At last, when one of the chunks becomes empty, put the length of the other chunk's tail to be
"flushed". Separate passes with spaces.
Let's look at some example. Input sequence is 6 2 10 5 4 1 3 9 7 11 8
. First two chunks are 6
on the tape B
and 2
on the tape C
. We compare their heads and find that the second should be taken. We put c
into
"operations" sequence and then find that second chunk becomes empty. We flush 1
element from the first chunk
and add this number to operations sequence.
Thus continued, the first pass ends with c1c1c1b1b11
- note the trailing 1
- due to odd number of elements
the last split produced an empty chunk. The sequence after the first pass is 2 6 5 10 1 4 3 9 7 11 8
.
Now we got chunks 2 6
and 5 10
, this adds bcb1
to output sequence and so on.
Input contains the length of input sequence N
in the first line.
Next line contains N
elements, all different to prevent ambiguity.
Answer contains the operations sequence.
Example
input:
13
10 9 3 11 5 12 2 6 8 13 7 1 4
answer:
c1b1b1b1b1c11 cbb1cbc1cc21 cbccbbb1bc3 cbbcbbccbbbb1