Maps, Hash Tables, Skip Lists and Sets
Maps and Dictionaries
Where unique keys are mapped to associated vaLues, and the relationship they express are commonly known as associative arrays or maps. Maps or dictionaries are implemented so that a search for a key, and its associated value, can be performed very efficiently.
In Python the dict
class provides this functionality.
However Python also provides two abstract base class in the collections
module that are relevant: Mapping
and MutableMapping
. Mapping
includes all nonmutating methods supported by Python's dict
class, and MutableMapping
class extends that to include the mutating methods.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
|
Hash Tables
Python's implementation of the dict
class is a hash table - it uses a hash function to map general keys to corresponding indices in a table. Ideally the keys will be well distributed in the range from 0 to N-1 by a hash function, but in practice there may be two or more distinct keys that get mapped to the same index.
Thus we will conceptualize the hash table as a bucket array in which each bucket may manage a collection of items that are sent to a specific index by the hash function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
|
Hash Functions
A hash function maps a key k
to an integer in the range [0,N-1]
, where N
is the capacity of the bucket array for a hash table. Such a hash function is used to index into a bucket array.
If there are two or more keys with the same hash value, then two different items will be mapped to the same bucket in A. This is a collision - and there are strategies to deal with them. But it is best to avoid or minimize them.
The hash function also needs to be fast and easy to compute.
The hash function is also thought of being split into two components: 1) hash code that maps a key k
to an integer and 2) compression function tat maps the hash code to an integer within a range of indices.
This separation allows hash coding to happen independently of the specific hash table size - so the hash table may be dynamically resized easily.
Hash Codes
The resulting integers may even be negative.
Polynomial Hash Code
-
Useful for string typed keys. A string is a sequence of characters encoded using the ASCII integer code. Example: "H e l l o" is
72 101 108 108 111
.So given a string \(s=x_{k-1}x_{k-2}...x_{0}\), the hash code is \(h(s)=x_{k-1}a^{k-1}+x_{k-1}a^{k-2}+...+x_0a^{0}\).
The choice of the value of
a
is important for the spread of the hash value. Emperically "good" values are33, 37, 39, 41
. Cyclic Shift Hash Code
- Same as the polynomial hash coed but using the
shift operation
to perform multiplication.
In Python the built-in function for hashing is hash(x)
that returns an integer value that serves as the hash code for object x
. However only immutable data types are deemed hashable in Python. This is to ensure that a particular object's hash code remains constant during the object's lifespan.
Compression Functions
A good compression function minimizes the number of collisions for a given set of distinct hash codes.
Division Method
-
Maps an integer
i
toi mod N
Here
N
is the size of the bucket array - a fixed positive integer.If
N
is a prime number then this compression function helps "spread out" the distribution of hashed values. Actually ifN
is not prime, then there is greater risk that patterns in the distribution of hash codes will be repeated in the distribution of hash values - causing collisions. The MAD Method
-
Helps eliminate repeated patterns in a set of integer keys - Multiply-Add-and-Divide - where an integer
i
is mapped to:[(ai+b) mod p] mod N
Where
N
is the size of the bucket array,p
i the prime number larger thanN
, anda
andb
are integers chosen at random from interval[0, p-1]
witha>0
.
Collision Handling
Separate Chaining
Have each bucket A[j]
store its own secondary container, holding items k,v
such that h(k)=j
.
The secondary container can be a small map instance implemented using a list. Worst case - operations on an individual bucket take time proportional to the size of the bucket.
Assuming we use a good hash function to index the n
items of our map in a bucket array of capacity N
, the expected size of a bucket i n/N
. Therefore the core map operations run in O[n/N]
time - the ration lambda = n/N
is called the load factor. This usually below 1 - and so the core operations on the has table run in O(1)
expected time.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Open Addressing
TODO: Add more notes here to describe this clearly
Once a collision takes place, open addressing (also known as closed hashing) computes new positions using a probe sequence and the next record is stored in that position. There are some well-known probe sequences:
- Linear Probing: The interval between the probes is fixed to 1. This means that the very next available position in the table would be tried.
- Quadratic Probing: The interval between the probes increases quadratically. This means that the next available position that would be tried would increase quadratically.
- Double Hashing: The interval between probes is fixed for each record but the hash is computed again by double hashing.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
|
Sorted Maps
A map does not provide any way to get a list of all keys ordered by some comparison - in fact the hash-based implementation relies on the intentionally scattered keys to make them more uniformly dstirbuted in a hash table.
Data structures that implement a sorted map
are also called Sorted Search Tables
.
TODO: Add notes here - leaving blank for now.
Applications of Sorted Maps
TODO: Add notes here - leaving blank for now.
Skip Lists
TODO: Add notes here - leaving blank for now.
Sets, Multisets, and Multimaps
Sets
Simply a map in which the keys do not have associated values. In python base classes MutableSet
is available from the collections
module.
Multisets
The same element may occur several times in a multiset. One way to implement this is to have a count of the number of occurrences of an element within a multiset.
In Python the Counter
class in collections
offers the capability.
Multimaps
Similar to a traditional map, in that it associates values with keys, however in a multimap the same key can be mapped to multiple values.
Python does not have a native implementation of this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
|
References
-
Skiena, Steven S.. (2008). The Algorithm Design Manual, 2nd ed. (2). : Springer Publishing Company. ↩
-
Goodrich, M. T., Tamassia, R., & Goldwasser, M. H. (2013). Data Structures and Algorithms in Python (1st ed.). Wiley Publishing. ↩
-
Cormen, T. H., Leiserson, C. E., & Rivest, R. L. (1990). Introduction to algorithms. Cambridge, Mass. : New York, MIT Press. ↩