Merge Sort - CodeAbbey

Seemingly 3-tape or 4-tape sorting implementation

While we have earlier studied three slow ("quadratic") sorting algorithms (Bubble Sort, Insertion Sort and Selection Sort) along with much superior "logarithmic" Quick Sort, there are other curious approaches with some interesting properties.

Merge Sort is something you may rarely hear nowadays, despite it is still widely used in some forms, generally in "big data" fields. Suppose the amount of data to be sorted is so large, that they don't fit computer memory. Like petabytes of log files of requests to server. And we want to sort them by the user id mentioned in each log line.

You can't apply most of well known algorithms (including those mentioned above) here for couple of obvious reasons:

even though programming languages allow reading data from arbitrary place in a file with the help of functions like fseek, generally it is much more slow operation compared to sequental reading of the file
element sizes are all of different sizes (log lines usually could be shorter or longer) - so it is really difficult where to "seek" in the file in case of "random access".

So file really should be read and written sequentially. As a matter of fact this still suits well with Bubble Sort! Really we can read file line by line, on each pass keeping the "higher" element while remaining are written to the output file in the order they are encountered in the input.

Regretfully this will make N passes over the file of N elements (lines), which could be extremely long time (petabyte of log lines averaging to 200 characters contains about 5e12 lines - making one pass per second - incredible for petabyte though - this will take over 150 thousand years).

Merge sort makes only Log2(N) passes - just about 42 in our case! One minute instead of 150 thousand years!

How it works

Regard the images above and below. The Merge Sort was invented when typical file storage was on magnetic tapes - if you still remember tape recorders / players, you know that "rewinding" the tape really took time. Also amount of data on the tapes of course much exceeded typical computer memory - but this sort really needs only few bytes of RAM!

Each pass consists of two steps - splitting the input sequence (say, tape A) into chunks, putting odd and even chunks on two auxiliary sequences (say, tape B and tape C) - and then merging chunks back, each couple separately, onto the first tape.

Merging happens in such a way, that every resulting double-chunk remains sorted if two input chunks were sorted themselves.

So on the first pass we split into chunks of size 1 and merge them in pairs into chunks of size 2. On the next pass we split into the very same chunks of size 2 and merge them into chunks of size 4 and in general, on pass K we split into chunks of size 2^(K - 1) and merge into chunks of size 2^K.

How merging happens? Very easy! We have two sorted chunks - let's repeatedly compare their first elements and pull out the smallest every time. E.g. suppose we have chunks 1 5 10 11 and 2 3 8 12. We compare 1 and 2 at their "heads", take 1. Then compare 5 with 2, this time picking 2. Next on comparing 5 and 3 we again pick from the second chunk and so on. When one of the chunks comes to end, we simply flush the tail of another one.

Some last notes:

sorting stops when output chunk contains the whole sequence (nothing to split further)
modern implementation start with splitting into chunks not of size 1 but such as roughly fit into memory and sort them with any suitable in-memory sort
historical "tape" sorting may include more than 3 tapes to speed up the process

Problem Statement

You are given the sequence of values to be sorted. The goal is to apply the merge-sort algorithm described above and as a "proof" to produce the recorded sequence of operations.

This recorded sequence of operations really describes how merging happens on every pass. If the value is taken from the tape B, add the letter b to the output. If the value is taken from the tape C, add the letter c. At last, when one of the chunks becomes empty, put the length of the other chunk's tail to be "flushed". Separate passes with spaces.

Let's look at some example. Input sequence is 6 2 10 5 4 1 3 9 7 11 8. First two chunks are 6 on the tape B and 2 on the tape C. We compare their heads and find that the second should be taken. We put c into "operations" sequence and then find that second chunk becomes empty. We flush 1 element from the first chunk and add this number to operations sequence.

Thus continued, the first pass ends with c1c1c1b1b11 - note the trailing 1 - due to odd number of elements the last split produced an empty chunk. The sequence after the first pass is 2 6 5 10 1 4 3 9 7 11 8.

Now we got chunks 2 6 and 5 10, this adds bcb1 to output sequence and so on.

Input contains the length of input sequence N in the first line.
Next line contains N elements, all different to prevent ambiguity.

Answer contains the operations sequence.

Example

input:
13
10 9 3 11 5 12 2 6 8 13 7 1 4

answer:
c1b1b1b1b1c11 cbb1cbc1cc21 cbccbbb1bc3 cbbcbbccbbbb1