20. Hashing
In the previous lecture, we introduced two new ADTs, Sets and Maps. We compared three implementations of these types backed by different data structures: unsorted lists, sorted lists, and (balanced) BSTs. We saw that balanced BSTs provided the best worst-case performance since they enabled adding, querying, and removing elements in \(O(\log N)\) time, where \(N\) represents the number of elements in the collection. Is it possible to do better?
As a thought experiment, suppose we wanted to store a set of ints. To make things simpler, let’s further suppose that all of these ints are between 0 and 9999. Can you think of a way to model this set that allows for \(O(1)\) (truly constant) add(), contains() and remove() methods?
Modeling a set of ints
This example is a bit contrived, as it fundamentally relies on the fact that there are only 10000 possible elements in this set. More typically, the number of possible elements will be untenably large. If we remove the bounds on our set of ints, we will need a 4GB array to use this strategy. Moreover, this approach is completely infeasible for representing a set of doubles, longs, or Strings. Nonetheless, its core idea helps inform the design of a more practical data structure: use the value of an object to compute the index where it would be stored in a (somewhat large) array. This process of converting an object to an index is known as hashing, and it forms the basis for a new data structure called a hash table. In this lecture, we’ll give an overview of designing performant hash tables and some complications therein. Hashing is a nuanced and well-studied subject that we will only scratch the surface of in CS 2110; you’ll learn more about it in your later computer science courses.
Hash Tables
A hash table storing elements of type T is a data structure that is backed by an array.
We refer to the entries of the array backing a hash table as its buckets. The current capacity of a hash table is the number of buckets in its backing array.
At any point, each bucket may be either empty, not storing any elements, or non-empty, storing one (or possibly more, in the case of a chaining hash table) elements. There is not a requirement that the non-empty buckets occupy the leftmost slots in the array. Rather, the bucket where a particular element is stored is determined using the “value” of that element. We obtain this value through a process called hashing, that we will discuss in detail shortly.
Given a hash table with a capacity \(C\) storing entries of type T, hashing associates a unique index in \(\{0, 1, \dots, C-1 \}\) to each element of type T.
For now, we can view the hashing process abstractly as a function \(h_C \:\colon\:\) T \(\to \{0, 1, \dots, C-1\}\) that computes these indices. With this function, we can describe the “invariant” of a hash table.
T will be nominally stored in the bucket with index \(h_C(x)\).
Here, we use the word nominally to hedge against the possibility that an element cannot be stored at its hashed index (which will be true for the probing hash tables that we will discuss later). However, it is good intuition to think of an element’s hash value as giving a good hint to where the element should be in the hash table, allowing us to shortcut the search for it.
Just as we have for other arrays, we’ll visualize a hash table as a row of boxes (usually omitting the outer rounded rectangle of the object for simplicity). Since the elements stored in a hash table will have a reference type, the empty buckets are visualized with null slashes, and the non-empty boxes contain arrows pointing to objects. The following animation steps through some basic operations on a hash table.
previous
next
This animation illustrates one of the key features common to all operations on hash tables:
x, we compute the hash value of x and focus our attention on this bucket.
Later in the lecture, we will consider some circumstances that can complicate these operations; however, this principle still applies. As such, this hashing step is critical to the development of a good hash table.
Hashing
As we noted above, hashing is the process of using the value of an object to compute its nominal index in a hash table. Now, we’ll further break down this computation into two steps.
1. Pre-Hashing: The object returns a representative int value (its hash code) using its hashCode() method.
2. Indexing: This hash code, which may be any int value, is converted to an array index. For our purposes, we will only consider one possible indexing, which is Math.abs(hashCode() % C) for a hash table of capacity \(C\).
It is important that we are clear with our terminology when discussing these two hashing steps. Many people use hashing to refer to the first (pre-hashing) step, whereas it actually refers to the combination of both steps. We'll use the term "hash code" to describe the int value that is produced by the first step and the word "hash value" to describe the index computed in the second step. Note that the hash value is a function of not only the object that is hashed but also the (current) capacity of the hash table.
hashCode() Method
In Java, the pre-hashing step is the responsibility of the object being hashed, and it is implemented by the hashCode() method. This is a method of the Object class that can be overridden by any class to offer a bespoke hash code calculation. There are only two requirements for hashCode() dictated by the Java specification.
1. Determinism: When the hashCode() method is invoked multiple times on the same object without modifying that object, then it must always return the same result.
2. Consistency with equals(): If two different objects x and y are deemed to be equal (i.e., x.equals(y) is true), then x.hashCode() == y.hashCode() should be true.
When we override equals(), we weaken its notion of equality. Recall that the default equals() implementation in Object is reference equality. Therefore, more pairs of objects will make equals() true after it is overridden, imposing a strong requirement on the hashCode() method. Java's default hashCode() implementation is only guaranteed to be consistent with reference equality. In practice, this means that you should override hashCode() whenever you override equals().
There are many more possible objects than there are possible int values. By the pigeonhole principle, this means that different (non-equal) objects must sometimes have the same hashCode(). We will soon see that this causes a hash collision, and degrades the performance of our hash table. As a result, we should try to limit hash collisions as much as we can. We prefer hashCode() implementations that “spread out” their values to the greatest extent possible. In practice, this means adhering to the following principles:
- Make sure that the possible return values of your
hashCode()method span the entireintrange. Don’t limit the return values to some range of small values. While this is not as crucial when using modular indexing (as we will do), it is a good practice and can make a difference when the hash table gets large. - Incorporate all of an object’s state into its
hashCode(). Similar objects that differ in a few attributes should not have the samehashCode(), so we want a pre-hashing recipe that depends on these differing attributes. Don’t, for example, hashStrings using their length or their first few characters. Instead, use a function that incorporates all of the characters in a non-commutative way (e.g. multiplying each by a different power of a large prime and adding these results). If you can easily think of many pairs of objects that will have the samehashCode()s, it is likely not a good candidate.
The Objects.hash() and Arrays.hashCode() methods are helpful ingredients for a good hashCode() recipe.
To reason formally about the performance of a hash table, we will need to understand how well our hash function does at spreading out different objects across its buckets. We will work under the following optimistic (overly optimistic to the point of infeasibility, but good for our introductory understanding) assumption.
The Simple Uniform Hashing Assumption (SUHA) asserts that for a given hash function \(h_C \:\colon\:\) T \(\to \{0,1,\dots,C-1\}\), each object inserted in the hash table is equally likely to be placed in any of its buckets, and its placement is not affected by the objects already in the hash table.
More concretely, for a given object \(x\) of type T, \(\Pr\big(h_C(x) = i\big) = \frac{1}{C}\) for all \(i \in \{0,1,\dots,C-1\}\), and these probabilities are mutually independent from the hash values of all other objects.
Hash Codes and Mutability
We have just said that a good hashCode() should incorporate the entire state of an object into the computation of its hash code. However, this can lead to some unexpected behavior when we store mutable objects in a hash table. Consider the following code snippet:
|
|
|
|
Java’s HashSet class is their implementation of the Set interface backed by a hash table. In this code snippet, we construct a set of ArrayList<Integer>s, add a list to this set, add a new element to this list, and then ask if this list is contained in the set. We’d hope to receive the answer true to this query, but unfortunately get the answer,
false
The following animation walks through this example carefully to see what has gone wrong.
previous
next
In this case, the shortcutted search in the hash table was the source of our issue. In light of this possibility, it is important to never store objects in a hash table that may be mutated during their storage. In a HashMap (a Map implementation backed by a hash table, we typically choose an immutable type to serve as the key and relegate all mutable state to the value).
Hash Collisions
As we already noted, there are typically many more possible objects of a given type than there are possible int values. In addition, the indexing step of hashing maps many different hash codes to the same hash value (since the array capacity is significantly smaller than \(2^{32}\)). As a result, we may run into situations where two different objects are hashed to the same index. This is called a hash collision.
A hash collision occurs when two different objects are hashed to the same index in a hash table.
Our current, simple view of a hash table is unable to handle collisions. Each of our buckets is an array cell that stores a reference to an object of type T. Once this array cell stores one reference, there is no room for an additional reference. We will need a strategy to “robustify” our hash table so that it can accommodate collisions. Two such strategies that we will explore next are chaining and linear probing.
In a hash table with many buckets, the likelihood that any particular pair of elements collides is small (under the Simple Uniform Hashing Assumption). We might suspect that this means that we can store a decent number of elements before collisions become likely. Unfortunately, this is not the case, due to a phenomenon called the Birthday Paradox. Intuitively, the number of different pairs of elements grows quadratically in the number of elements, so collisions become likely even when there are relatively few elements. In a hash table with 365 buckets, it is more likely than not that a collision has occurred by the time that 23 elements have been added. In short, collisions are ubiquitous with even the best hash functions, so we must resolve them in a principled way to ensure that our hash table is performant.
Chaining
In a hash table with chaining, we loosen the restriction that each bucket references a single object. Rather, each bucket holds a LinkedList (i.e., a chain) containing all the elements with that hash value.
Let’s write our own HashSet implementation backed by a chaining hash table. The state representation of our HashSet will include two fields. First, it will have the hash table, which (due to chaining) is represented as a LinkedList<T>[] array (where T is the generic type of the set’s elements). Next, we’ll store the current size of the HashSet. Since the elements can be arbitrarily distributed across the buckets, it will be inefficient to recompute the size each time the size() method is called.
HashSet.java
|
|
|
|
Take some time to develop the (preliminary) contains(), add(), and remove() methods for this HashSet. Be sure that your implementations both satisfy the specifications of these Set methods and maintain the HashSet class invariant.
contains() definition
add() definition
remove() definition
Let’s consider the runtime complexity of these methods. We will view the complexity of each call to the hashCode() as \(O(1)\) (it may depend on the size of its target object, but shouldn’t depend on the number of elements stored in the hash table). In this case, the runtime of each of these Set methods is dominated by the linear search over the bucket in contains(), so has runtime \(O(\textrm{bucket size})\). Next, we’ll reason a bit more about this bucket size to obtain a more interpretable bound.
Load Factor and Resizing
Performing a worst-case complexity analysis for hash table operations is overly pessimistic. It is possible that every object a client adds to a hash table is unfortunately hashed to the same value, resulting in \(O(N)\) runtimes (our hash table degenerates to a linked list in a single bucket). However, this is exceedingly (exponentially) unlikely.
Instead, it is more fruitful to understand the typical performance of our hash table. Under the Simple Uniform Hashing Assumption, we think about the assignment of objects to buckets as a random process (even though in practice it is deterministic), since this allows us to quantify the likelihood of hash collisions. Since the time complexity of our hash table operations scales with the size of the bucket containing our object of interest, we would like to understand the expected size of this bucket.
Suppose that our \(N\)’th object (the one that is a parameter of the hash table operation we are analyzing) has hash value \(i \in \{0, 1, \dots, C-1\}\). By the Simple Uniform Hashing Assumption, each of the other \(N-1\) elements was (independently) hashed to bucket \(i\) with probability \(\frac{1}{C}\), so (by linearity of expectation) we can expect bucket \(i\) to contain \(1 + \frac{N-1}{C} = O(\frac{N}{C})\) objects. This factor \(\frac{N}{C}\) is important in the analysis of hash tables, so we give it a name.
The load factor of a hash table, often denoted \(\lambda\), is the ratio of its size to its capacity: \[ \lambda = \frac{N}{C} = \frac{\textrm{number of elements}}{\textrm{number of buckets}}. \]
The load factor measures the expected size of any bucket in a hash table. Therefore, the expected runtime of each hash table operation is \(O(\lambda)\). As more elements are added to a hash table (without increasing its capacity), its load factor increases. If we never adjust the number of buckets, the performance of the hash table degrades, and the runtime of its operations becomes \(O(N)\). If we wish to achieve an expected \(O(1)\) runtime, then we must occasionally increase the number of buckets to ensure that the load factor never exceeds some chosen constant (a common choice is \(\lambda = 0.75\)).
Hash Table Resizing
When we resize a hash table, we increase its number of buckets, \(C\). When we do this, we must change our hash function so that it can access these new buckets. Since the hashCode() is an inherent property of an object, it is the indexing step that changes; we compute the remainder of the hashCode() by a larger modulus. This will change the hash values of some of the objects already in the hash table (a good thing… the whole purpose of this resizing was to reduce the size of overcrowded buckets). Remember that the hash value is a function of not only the object’s value but also the capacity of the hash table. Therefore, we must re-hash all of the objects to determine in which bucket of the enlarged table they belong.
previous
next
This resizing operation has an \(O(N)\) complexity, as it requires a linear scan over the hash table to locate and re-hash all of its elements. Therefore, we want to limit the number of resizes that are needed so we can amortize their cost over many insertions. By the same analysis as for dynamic array resizing, if we double the number of buckets during each resize, our amortized (and expected) runtime of insertions into our hash table remains \(O(1)\).
Take some time to incorporate this resizing into your HashSet implementation.
chaining HashSet with resizing
Linear Probing
In a hash table with linear probing, each bucket can only hold a single object. When an add() operation would cause a collision in a probing hash table, we cannot place the object in its nominal bucket; there is not space for it. Instead, we check the next bucket (incrementing the index and wrapping back to 0 when we hit the end of the table), continuing until we find an empty bucket where our object can be placed. Said differently, we probe the buckets linearly (i.e., sequentially) until we locate a spot for the new object.
Now, we’ll consider the implications that probing has on the three basic operations on the hash table, add(), contains(), and remove(). We’ll describe each of these operations and illustrate them with animations, but we leave it as an exercise to implement a linear probing hash table.
add()
As we noted above, when we add an object to a probing hash table, it may not fit into the bucket corresponding to its hash value. Instead, this bucket becomes the starting position of a linear search that proceeds until we locate an empty bucket where our object can be placed.
previous
next
Just as with chaining, when the load factor grows too large in our hash table, its buckets become crowded. When this happens, the number of steps in our linear search grows larger, which deteriorates the runtime of our hash table operations. To counteract this, we will again need to periodically resize our hash table (in the same manner as the chaining hash table, doubling the number of buckets and then sequentially re-hashing and adding the existing elements).
contains()
For our contains() method, we must also contend with the possibility that in a probing hash table, an element may not be stored in its nominal bucket. When that element was added, it could have been forced into a later bucket by other elements that were previously added to the table. Therefore, we must begin a linear search for our target element beginning at the bucket indexed by its hash value.
We can stop the search and return true as soon as we find the element, but how will we know when to stop the search and return false? Recall that we place an element in the first available empty bucket after its nominal bucket. Therefore, as soon as we encounter an empty bucket in our linear search, we have found a place where the target element could have been added but was not (more on this in a bit), certifying to us that the target element is not in the hash table.
previous
next
remove()
Finally, we consider the remove() method. As is typical with data structure removal, we start by locating the target element. Now, how do we update the state of the data structure to model this removal? An initial thought may be to set its bucket to null. However, this will cause a problem. To see why, let’s again consider the hash table we’ve been working with.
Suppose that we remove 74 from this hash table and replace it with null.
Now, if we run a contains() query on the element 13, what will happen? We will start at bucket 3, advance to bucket 4, encounter null and errantly return false. Our removal broke up the sequential block of elements that extended our linear search from 13’s nominal bucket 3 to the bucket 5 where it is actually located. To avoid the early curtailing of future linear probes, we need a way to signal that “an element used to be here, but was removed”. We do this by placing a tombstone into a bucket during the removal.
A tombstone is a sentinel value that indicates that a value has been removed from this bucket in a probing hash table.
In the contains() method, we can only curtail our search when we encounter an empty (i.e., null) bucket, and must continue our search when we encounter a tombstone.
Linear Probing versus Chaining
We have seen two different strategies for dealing with hash collisions, chaining and linear probing. Let’s take a second to consider the trade-offs between these approaches. The data structure backing a chaining hash table (an array of LinkedLists of elements) is more complicated, so it takes up more memory at the same load factor. In addition, the decentralized storage of the LinkedList buckets means that the memory can be spread out, making it less likely that our code will benefit from faster memory access due to cache locality (a topic beyond our scope, but nonetheless useful to have at the back of mind). The direct storage of elements within the buckets of a probing hash table leads to improved memory performance.
On the other hand, probing hash tables are more sensitive to collisions since probing can cause objects to be stored outside of their nominal buckets. When a lot of elements have nearby hash codes, they can cluster up in the table and lead to probing times that exceed the number of true collisions. In addition, the presence of tombstones, which do not store any data but still lengthen containment queries, further degrades performance when entries are frequently added and removed from the hash table. Chaining hash tables avoid the effects of clustering and tombstones by storing all objects in their nominal buckets.
Main Takeaways:
- In a hash table, each object is hashed to determine the nominal index where it will be located in an array. This lets us shortcut the search for an object by jumping directly to this position.
- Hashing is the process of computing this nominal index for an object. It consists of two steps. First, the object is pre-hashed to obtain its
hashCode(), which can be anyintvalue. Then, this hash code is converted to an array index using modular arithmetic. - A good
hashCode()definition will evenly spread out hash codes across the entire range ofintvalues, using all of an object's state to make sure that "nearby" objects have different hash codes. - Hash collisions occur when two different objects are hashed to the same index value. Chaining and linear probing are two strategies for dealing with hash collisions.
- A properly implemented hash table can support the primary
SetandMapoperations with an amortized and expected \(O(1)\) complexity.
Exercises
Consider a hash table with 5 buckets (labeled 0 to 4) that uses chaining to handle collisions. Elements are non-negative integers whose hash code is given by the formula \(h(i)=3i+2\), and bucket indices are derived by taking the absolute value of the remainder of the hash code divided by the table size.
false?String’s hashCode()
String is an array of chars. Given a length \(n\) string and \(0\le i< n\), denote the \(i\)'th element of the array as \(s_i\). For each of the following hash code functions, compute the hash code of "hash" and determine another String that hashes to the same value. State and explain whether these hash codes are good or bad.
HashMap
HashSet, the Map ADT can leverage hashing to achieve much faster expected runtimes.
|
|
|
|
HashMap. Implement size().
Define a helper method index() to derive the index an element would be in.
|
|
|
|
index() to implement containKey().
put() and remove(). Resize the hash table when \(\lambda > 0.75\).
keySet().
As with any data structure, we might want to provide a way to iterate through it. However with Maps, it is ambiguous on what we want to iterate over: keys, values, or entries. Implement all of these iterators.
|
|
|
|
HashSet<Integer> set that uses chaining. The backing hash table has an initial capacity of 10, doubling in size when \(\lambda > 0.75\). For simplicity, assume the hash code function is \(h(i) = i\). Draw the backing hash table and state the load factor after each of the following sequence of operations.
|
|
|
|
|
|
|
|
|
|
|
|
HashSet uses linear probing. Draw tombstones as displayed in lecture.
HashSet
Set with a hash table that resolves collisions with linear probing.
HashMap
Map with a hash table that resolves collisions with linear probing.
HashSet that uses quadratic probing to resolve collisions.
HashMap that uses quadratic probing to resolve collisions.
remove() to shrink the backing array in half when \(\lambda < 0.25\).