This problem is to be solved in Scheme
(tinyscheme interpreter) -
so please check the first problem of this genre (if you haven't yet)
for more details.
Suppose the stream of words come to us - e.g. as if they are read from some file or web resource - and we want to count occurrences of every word. This is used, for example, to figure out "most important" words of the document (or collection of texts).
We usually want to use Hashtable data structure for this. The
Matching Words problem was similar (though simpler). So it is dict
in Python and HashMap
in Java, and std::unordered_map
in C++; in PHP
and JavaScript
it is built-in
into normal "arrays".
The main idea is that Hashtable allows quick (almost immediate) look-up feature - so programmer doesn't need to browse all the collected words to find one of them (which is time-consuming).
Now, some languages historically had not this advanced data structure from beginning. Among them Pascal
,
plain old C
and Scheme
.
This gives us a wonderful opportunity: to try implementing Hashtable ourselves!
To simplify, we shall work with large integers rather than words. It doesn't matter, since hashtable could store anything - one only need to create suitable hash-function. The function should have properties:
E.g. for string it could be the code of the first (or last?) character. For float we can take, for example, sum of two digits around decimal point. For integer the hash could be this value itself.
Now consider the function was-seen?
which "remembers" argument on every call (in a variable *seen-list*
)
and tells if such value have been already passed before:
(define *seen-list* '())
(define (was-seen? x)
(if (memv x *seen-list*)
#t
(begin
(set! *seen-list* (cons x *seen-list*))
#f)))
Try to run the definition in emulator and then call it few times, like this:
(was-seen? 13) ; says #f
(was-seen? 17) ; says #f
(was-seen? 119) ; says #f
(was-seen? 13) ; says #t
To reset the remembered list one may run (set! *seen-list* '())
of course.
Let's improve this function in two ways! Firstly, make it return not true / false
but how many times
the given value was seen.
Besides this, improve its performance by creating hashtable. One of possible implementations (call it "array of lists") is below:
K
lists instead of just single list, (call them "buckets" in hashtable terminology)vector
of size K
hash
modulo K
to decide into which
bucket it belongsRead more in Wikipedia, if this sounds too vague (Hash table). Note there is also slightly different approach, which doesn't need lists (rather puts values directly into outer vector) - called Open Addressing - provided that its size is large enough.
For this problem it is not important how exactly you implement hashtable - but it should work fast!
You also can solve this problem building some kind of Tree instead, but supposedly it won't be much easier. Hash-table itself is a kind of tree too, with branches (lists) growing out of common trunk (main vector). I believe here is even somewhat cheating approach, but hopefully no one will be interested to try.
Your code should have improved version of was-seen?
function as requested above. It is going to be tested
with about 10000
calls and it shouldn't work too long.
Specifically, you may want to check your function in the following ways:
; manually
(was-seen? 13) ; says 0
(was-seen? 17) ; says 0
(was-seen? 13) ; says 1
(was-seen? 119) ; says 0
(was-seen? 13) ; says 2
; automatically
(let loop ((n 0))
(if (> (was-seen? (modulo (* n 17) 10000)) 0)
(begin
(for-each display (list "Repeats after " n " steps"))
(newline))
(loop (+ 1 n))))
(display (eval-count)) (newline)
Note the last line: it returns the current value of internal operation counter. Interpreter on server-side
will break execution when this counter reaches 3 mln
. So the function is guaranteed not to pass with
simple single list.