20. Hashing

In the previous lecture, we introduced two new ADTs, Sets and Maps. We compared three implementations of these types backed by different data structures: unsorted lists, sorted lists, and (balanced) BSTs. We saw that balanced BSTs provided the best worst-case performance since they enabled adding, querying, and removing elements in \(O(\log N)\) time, where \(N\) represents the number of elements in the collection. Is it possible to do better?

As a thought experiment, suppose we wanted to store a set of ints. To make things simpler, let’s further suppose that all of these ints are between 0 and 9999. Can you think of a way to model this set that allows for \(O(1)\) (truly constant) add(), contains() and remove() methods?

Modeling a set of ints

We can accomplish this by allocating a boolean[] array with length 10000, where the entry at index \(i\) tells us whether the value \(i\) belongs to this set (true = yes and false = no). By doing this, we can leverage the random access guarantee of arrays to both access and modify whether any particular number is included in this set in \(O(1)\) time.

This example is a bit contrived, as it fundamentally relies on the fact that there are only 10000 possible elements in this set. More typically, the number of possible elements will be untenably large. If we remove the bounds on our set of ints, we will need a 4GB array to use this strategy. Moreover, this approach is completely infeasible for representing a set of doubles, longs, or Strings. Nonetheless, its core idea helps inform the design of a more practical data structure: use the value of an object to compute the index where it would be stored in a (somewhat large) array. This process of converting an object to an index is known as hashing, and it forms the basis for a new data structure called a hash table. In this lecture, we’ll give an overview of designing performant hash tables and some complications therein. Hashing is a nuanced and well-studied subject that we will only scratch the surface of in CS 2110; you’ll learn more about it in your later computer science courses.

Hash Tables

A hash table storing elements of type T is a data structure that is backed by an array.

Definition: Buckets, Capacity

We refer to the entries of the array backing a hash table as its buckets. The current capacity of a hash table is the number of buckets in its backing array.

At any point, each bucket may be either empty, not storing any elements, or non-empty, storing one (or possibly more, in the case of a chaining hash table) elements. There is not a requirement that the non-empty buckets occupy the leftmost slots in the array. Rather, the bucket where a particular element is stored is determined using the “value” of that element. We obtain this value through a process called hashing that we will discuss in detail shortly.

Definition: Hashing

Given a hash table with a capacity \(C\) storing entries of type T, hashing associates a unique index in \(\{0, 1, \dots, C-1 \}\) to each element of type T.

For now, we can view the hashing process abstractly as a function \(h_C \:\colon\:\) T \(\to \{0, 1, \dots, C-1\}\) that computes these indices. With this function, we can describe the “invariant” of a hash table.

In a hash table with capacity \(C\), an element \(x\) of type T will be nominally stored in the bucket with index \(h_C(x)\).

Here, we use the word nominally to hedge against the possibility that an element cannot be stored at its hashed index (which will be true for the probing hash tables that we will discuss later). However, it is good intuition to think of an element’s hash value as giving a good hint to where the element should be in the hash table, allowing us to shortcut the search for it.

Just as we have for other arrays, we’ll visualize a hash table as a row of boxes (usually omitting the outer rounded rectangle of the object for simplicity). Since the elements stored in a hash table will have a reference type, the empty buckets are visualized with null slashes, and the non-empty boxes contain arrows pointing to objects. The following animation steps through some basic operations on a hash table.

This animation illustrates one of the key features common to all operations on hash tables:

At the start of any hash table operation involving an element x, we compute the hash value of x and focus our attention on this bucket.

Later in the lecture, we will consider some circumstances that can complicate these operations; however, this principle still applies. As such, this hashing step is critical to the development of a good hash table.

Hashing

As we noted above, hashing is the process of using the value of an object to compute its nominal index in a hash table. Now, we’ll further break down this computation into two steps.

1. Pre-Hashing: The object returns a representative int value (its hash code) using its hashCode() method.

2. Indexing: This hash code, which may be any int value, is converted to an array index. For our purposes, we will only consider one possible indexing, which is Math.abs(hashCode() % C) for a hash table of capacity \(C\).

Remark:

It is important that we are clear with our terminology when discussing these two hashing steps. Many people use hashing to refer to the first (pre-hashing) step, whereas it actually refers to the combination of both steps. We'll use the term "hash code" to describe the int value that is produced by the first step and the word "hash value" to describe the index computed in the second step. Note that the hash value is a function of not only the object that is hashed but also the (current) capacity of the hash table.

`hashCode()` Method

In Java, the pre-hashing step is the responsibility of the object being hashed, and it is implemented by the hashCode() method. This is a method of the Object class that can be overridden by any class to offer a bespoke hash code calculation. There are only two requirements for hashCode() dictated by the Java specification.

1. Determinism: When the hashCode() method is invoked multiple times on the same object without modifying that object, then it must always return the same result.

2. Consistency with equals(): If two different objects x and y are deemed to be equal (i.e., x.equals(y) is true), then x.hashCode() == y.hashCode() should be true.

Remark:

When we override equals(), we weaken its notion of equality. Recall that the default equals() implementation in Object is reference equality. Therefore, more pairs of objects will make equals() true after it is overridden, imposing a strong requirement on the hashCode() method. Java's default hashCode() implementation is only guaranteed to be consistent with reference equality. In practice, this means that you should override hashCode() whenever you override equals().

There are many more possible objects than there are possible int values. By the pigeonhole principle, this means that different (non-equal) objects must sometimes have the same hashCode(). We will soon see that this causes a hash collision and degrades the performance of our hash table. As a result, we should try to limit hash collisions as much as we can. We prefer hashCode() implementations that “spread out” their values to the greatest extent possible. In practice, this means adhering to the following principles:

Make sure that the possible return values of your hashCode() method span the entire int range. Don’t limit the return values to some range of small values. While this is not as crucial when using modular indexing (as we will do), it is a good practice and can make a difference when the hash table gets large.
Incorporate all of an object’s state into its hashCode(). Similar objects that differ in a few attributes should not have the same hashCode(), so we want a pre-hashing recipe that depends on these differing attributes. Don’t, for example, hash Strings using their length or their first few characters. Instead, use a function that incorporates all of the characters in a non-commutative way (e.g. multiplying each by a different power of a large prime and adding these results). If you can easily think of many pairs of objects that will have the same hashCode()s, it is likely not a good candidate.

The Objects.hash() and Arrays.hashCode() methods are helpful ingredients for a good hashCode() recipe.

To reason formally about the performance of a hash table, we will need to understand how well our hash function does at spreading out different objects across its buckets. We will work under the following optimistic (overly optimistic to the point of infeasibility, but good for our introductory understanding) assumption.

Definition: Simple Uniform Hashing Assumption

The Simple Uniform Hashing Assumption (SUHA) asserts that for a given hash function \(h_C \:\colon\:\) T \(\to \{0,1,\dots,C-1\}\), each object inserted in the hash table is equally likely to be placed in any of its buckets, and its placement is not affected by the objects already in the hash table.

More concretely, for a given object \(x\) of type T, \(\Pr\big(h_C(x) = i\big) = \frac{1}{C}\) for all \(i \in \{0,1,\dots,C-1\}\), and these probabilities are mutually independent from the hash values of all other objects.

Hash Codes and Mutability

We have just said that a good hashCode() should incorporate the entire state of an object into the computation of its hash code. However, this can lead to some unexpected behavior when we store mutable objects in a hash table. Consider the following code snippet:

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

Java’s HashSet class is its implementation of the Set interface backed by a hash table. In this code snippet, we construct a set of ArrayList<Integer>s, add a list to this set, add a new element to this list, and then ask if this list is contained in the set. We’d hope to receive the answer true to this query, but unfortunately get the answer,

false

The following animation walks through this example carefully to see what has gone wrong.

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

We construct a new HashSet object, which is backed by a hash table data structure. For the purpose of illustration, let’s suppose that this hash table has capacity 5.

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

We construct a new ArrayList object, which is initially empty and store a reference to it in the variable list. For the purpose of illustration, let’s suppose that the backing storage array of this ArrayList has initial capacity 4.

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

Now, we add list to our HashSet. To do this, we first determine the hashCode() of list. Let’s suppose that this empty list has hash code 1895238, which corresponds to index 3. Then, we should update the index 3 bucket of the hash table to store a reference to the list object.

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

Now, we add 14 to our ArrayList, which is placed at index 0.

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

1
2
3
4
5


HashSet<ArrayList<Integer>> set = new HashSet<>();
ArrayList<Integer> list = new ArrayList<>();
set.add(list);
list.add(14);
System.out.println(set.contains(list));

Finally, we evaluate (and print) set.contains(list). To perform this contains() call on a hash table, we begin by computing the hash value of list. list has changed since the last time we computed its hashCode(), so we should expect a different result. Indeed, let’s suppose now that its hash code is -4835916, which corresponds to index 1. We check the bucket at index 1 in our hash table and see that it is empty, so we contains() returns false without checking any of the other buckets.

In this case, the shortcutted search in the hash table was the source of our issue. In light of this possibility, it is important to never store objects in a hash table that may be mutated during their storage. In a HashMap (a Map implementation backed by a hash table), we typically choose an immutable type to serve as the key and relegate all mutable state to the value.

Hash Collisions

As we already noted, there are typically many more possible objects of a given type than there are possible int values. In addition, the indexing step of hashing maps many different hash codes to the same hash value (since the array capacity is significantly smaller than \(2^{32}\)). As a result, we may run into situations where two different objects are hashed to the same index. This is called a hash collision.

Definition: Hash Collision

A hash collision occurs when two different objects are hashed to the same index in a hash table.

Our current, simple view of a hash table is unable to handle collisions. Each of our buckets is an array cell that stores a reference to an object of type T. Once this array cell stores one reference, there is no room for an additional reference. We will need a strategy to “robustify” our hash table so that it can accommodate collisions. Two such strategies that we will explore next are chaining and linear probing.

Remark:

In a hash table with many buckets, the likelihood that any particular pair of elements collides is small (under the Simple Uniform Hashing Assumption). We might suspect that this means that we can store a decent number of elements before collisions become likely. Unfortunately, this is not the case, due to a phenomenon called the Birthday Paradox. Intuitively, the number of different pairs of elements grows quadratically in the number of elements, so collisions become likely even when there are relatively few elements. In a hash table with 365 buckets, it is more likely than not that a collision has occurred by the time that 23 elements have been added. In short, collisions are ubiquitous with even the best hash functions, so we must resolve them in a principled way to ensure that our hash table is performant.

Chaining

In a hash table with chaining, we loosen the restriction that each bucket references a single object. Rather, each bucket holds a LinkedList (i.e., a chain) containing all the elements with that hash value.

Let’s write our own HashSet implementation backed by a chaining hash table. The state representation of our HashSet will include two fields. First, it will have the hash table, which (due to chaining) is represented as a LinkedList<T>[] array (where T is the generic type of the set’s elements). Next, we’ll store the current size of the HashSet. Since the elements can be arbitrarily distributed across the buckets, it will be inefficient to recompute the size each time the size() method is called.

HashSet.java

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


/** A Set implementation backed by a chaining hash table. */
public class HashSet<T> implements Set<T> {
  /** The backing storage of this HashSet. */
  private LinkedList<T>[] buckets;

  /** The number of elements stored in this HashSet. */
  private int size;

  /** The initial capacity of the hash table. */
  public static final int INITIAL_CAPACITY = 5;

  /** Construct a new, initially empty, hash set. */
  public HashSet() {
    buckets = new LinkedList[INITIAL_CAPACITY];
    for (int i = 0; i < INITIAL_CAPACITY; i++) {
      buckets[i] = new LinkedList<>();
    }
    size = 0;
  }

  // ... Set methods
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


/** A Set implementation backed by a chaining hash table. */
public class HashSet<T> implements Set<T> {
  /** The backing storage of this HashSet. */
  private LinkedList<T>[] buckets;

  /** The number of elements stored in this HashSet. */
  private int size;

  /** The initial capacity of the hash table. */
  public static final int INITIAL_CAPACITY = 5;

  /** Construct a new, initially empty, hash set. */
  public HashSet() {
    buckets = new LinkedList[INITIAL_CAPACITY];
    for (int i = 0; i < INITIAL_CAPACITY; i++) {
      buckets[i] = new LinkedList<>();
    }
    size = 0;
  }

  // ... Set methods
}

Take some time to develop the (preliminary) contains(), add(), and remove() methods for this HashSet. Be sure that your implementations both satisfy the specifications of these Set methods and maintain the HashSet class invariant.

contains() definition

HashSet.java

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


@Override
public boolean contains(T elem) {
  assert elem != null;
  return buckets[index(elem)].contains(elem);
}

/** Returns the hash value of the given `elem` */
private int index(T elem) {
    return Math.abs(elem.hashCode() % buckets.length);
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


@Override
public boolean contains(T elem) {
  assert elem != null;
  return buckets[index(elem)].contains(elem);
}

/** Returns the hash value of the given `elem` */
private int index(T elem) {
    return Math.abs(elem.hashCode() % buckets.length);
}

Since computing the hash value of an element of type T is a common subroutine to all of these Set methods, let's extract it into a private index() helper method. The capacity of our hash table is buckets.length. Once we compute the hash value, we need to check whether elem is stored in the linked list at that index, which we can do by delegating to the LinkedList.contains() method.

add() definition

HashSet.java

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


@Override
public boolean add(T elem) {
  assert elem != null;
  if (contains(elem)) {
    return false;
  }
  buckets[index(elem)].add(elem);
  size += 1;
  return true;
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


@Override
public boolean add(T elem) {
  assert elem != null;
  if (contains(elem)) {
    return false;
  }
  buckets[index(elem)].add(elem);
  size += 1;
  return true;
}

To conform to the Set specification, we first need to check whether elem is already stored in this HashSet, which we can do with a call to contains(). If it is, we simply return false. If not, we add elem to its corresponding bucket using the LinkedList.add() method. Did you remember to update the size field to restore the class invariant?

remove() definition

HashSet.java

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


@Override
public boolean remove(T elem) {
  assert elem != null;
  if (!contains(elem)) {
    return false;
  }
  buckets[index(elem)].remove(elem);
  size -= 1;
  return true;
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


@Override
public boolean remove(T elem) {
  assert elem != null;
  if (!contains(elem)) {
    return false;
  }
  buckets[index(elem)].remove(elem);
  size -= 1;
  return true;
}

Similar to add(), we first check whether elem is stored in the HashSet. If it is, then we remove it using a call to LinkedList.remove() on its corresponding bucket and then update size. As an alternative approach, LinkedList.remove() returns a boolean that we can use to eliminate a call to our contains() method.

Let’s consider the runtime complexity of these methods. We will view the complexity of each call to the hashCode() as \(O(1)\) (it may depend on the size of its target object, but shouldn’t depend on the number of elements stored in the hash table). In this case, the runtime of each of these Set methods is dominated by the linear search over the bucket in contains(), so has runtime \(O(\textrm{bucket size})\). Next, we’ll reason a bit more about this bucket size to obtain a more interpretable bound.

Load Factor and Resizing

Performing a worst-case complexity analysis for hash table operations is overly pessimistic. It is possible that every object a client adds to a hash table is unfortunately hashed to the same value, resulting in \(O(N)\) runtimes (our hash table degenerates to a linked list in a single bucket). However, this is exceedingly (exponentially) unlikely.

Instead, it is more fruitful to understand the typical performance of our hash table. Under the Simple Uniform Hashing Assumption, we think about the assignment of objects to buckets as a random process (even though in practice it is deterministic), since this allows us to quantify the likelihood of hash collisions. Since the time complexity of our hash table operations scales with the size of the bucket containing our object of interest, we would like to understand the expected size of this bucket.

Suppose that our \(N\)’th object (the one that is a parameter of the hash table operation we are analyzing) has hash value \(i \in \{0, 1, \dots, C-1\}\). By the Simple Uniform Hashing Assumption, each of the other \(N-1\) elements was (independently) hashed to bucket \(i\) with probability \(\frac{1}{C}\), so (by linearity of expectation) we can expect bucket \(i\) to contain \(1 + \frac{N-1}{C} = O(\frac{N}{C})\) objects. This factor \(\frac{N}{C}\) is important in the analysis of hash tables, so we give it a name.

Definition: Load Factor

The load factor of a hash table, often denoted \(\lambda\), is the ratio of its size to its capacity: \[ \lambda = \frac{N}{C} = \frac{\textrm{number of elements}}{\textrm{number of buckets}}. \]

The load factor measures the expected size of any bucket in a hash table. Therefore, the expected runtime of each hash table operation is \(O(\lambda)\). As more elements are added to a hash table (without increasing its capacity), its load factor increases. If we never adjust the number of buckets, the performance of the hash table degrades, and the runtime of its operations becomes \(O(N)\). If we wish to achieve an expected \(O(1)\) runtime, then we must occasionally increase the number of buckets to ensure that the load factor never exceeds some chosen constant (a common choice is \(\lambda = 0.75\)).

Hash Table Resizing

When we resize a hash table, we increase its number of buckets, \(C\). When we do this, we must change our hash function so that it can access these new buckets. Since the hashCode() is an inherent property of an object, it is the indexing step that changes; we compute the remainder of the hashCode() by a larger modulus. This will change the hash values of some of the objects already in the hash table (a good thing… the whole purpose of this resizing was to reduce the size of overcrowded buckets). Remember that the hash value is a function of not only the object’s value but also the capacity of the hash table. Therefore, we must re-hash all of the objects to determine in which bucket of the enlarged table they belong.

This resizing operation has an \(O(N)\) complexity, as it requires a linear scan over the hash table to locate and re-hash all of its elements. Therefore, we want to limit the number of resizes that are needed so we can amortize their cost over many insertions. By the same analysis as for dynamic array resizing, if we double the number of buckets during each resize, our amortized (and expected) runtime of insertions into our hash table remains \(O(1)\).

Take some time to incorporate this resizing into your HashSet implementation.

chaining HashSet with resizing

HashSet.java

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60


/** A Set implementation backed by a chaining hash table */
public class HashSet<T> implements Set<T> {
  /** The backing storage of this HashSet. */
  private LinkedList<T>[] buckets;

  /** The number of elements stored in this HashSet. */
  private int size;

  /** The initial capacity of the hash table. */
  public static final int INITIAL_CAPACITY = 5;

  /** The maximum load factor permissible before resizing. */
  public static final double MAX_LOAD_FACTOR = 0.75;

  /** Construct a new, initially empty, hash set. */
  public HashSet() {
    buckets = emptyTable(INITIAL_CAPACITY);
    size = 0;
  }

  /** 
   * Constructs and returns an empty chaining hash table consisting with the given `capacity`. 
   */
  private LinkedList<T>[] emptyTable(int capacity) {
    LinkedList<T>[] table = new LinkedList[capacity];
    for (int i = 0; i < capacity; i++) {
      table[i] = new LinkedList<>();
    }
    return table;
  }

  /**
   * Reassigns `buckets` to an array with double the capacity and re-hashes all entries.
   */
  private void doubleCapacity() {
    LinkedList<T>[] oldBuckets = buckets;
    buckets = emptyTable(buckets.length * 2);
    for (LinkedList<T> bucket : oldBuckets) {
      for (T elem : bucket) {
        buckets[index(elem)].add(elem);
      }
    }
  }

  @Override
  public boolean add(T elem) {
    assert elem != null;
    if (contains(elem)) {
      return false;
    }
    if ((double) (size + 1) / buckets.length > MAX_LOAD_FACTOR) { // exceed max load factor
      doubleCapacity();
    }
    buckets[index(elem)].add(elem);
    size += 1;
    return true;
  }

  // ... other methods remain unchanged
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60


/** A Set implementation backed by a chaining hash table */
public class HashSet<T> implements Set<T> {
  /** The backing storage of this HashSet. */
  private LinkedList<T>[] buckets;

  /** The number of elements stored in this HashSet. */
  private int size;

  /** The initial capacity of the hash table. */
  public static final int INITIAL_CAPACITY = 5;

  /** The maximum load factor permissible before resizing. */
  public static final double MAX_LOAD_FACTOR = 0.75;

  /** Construct a new, initially empty, hash set. */
  public HashSet() {
    buckets = emptyTable(INITIAL_CAPACITY);
    size = 0;
  }

  /** 
   * Constructs and returns an empty chaining hash table consisting with the given `capacity`. 
   */
  private LinkedList<T>[] emptyTable(int capacity) {
    LinkedList<T>[] table = new LinkedList[capacity];
    for (int i = 0; i < capacity; i++) {
      table[i] = new LinkedList<>();
    }
    return table;
  }

  /**
   * Reassigns `buckets` to an array with double the capacity and re-hashes all entries.
   */
  private void doubleCapacity() {
    LinkedList<T>[] oldBuckets = buckets;
    buckets = emptyTable(buckets.length * 2);
    for (LinkedList<T> bucket : oldBuckets) {
      for (T elem : bucket) {
        buckets[index(elem)].add(elem);
      }
    }
  }

  @Override
  public boolean add(T elem) {
    assert elem != null;
    if (contains(elem)) {
      return false;
    }
    if ((double) (size + 1) / buckets.length > MAX_LOAD_FACTOR) { // exceed max load factor
      doubleCapacity();
    }
    buckets[index(elem)].add(elem);
    size += 1;
    return true;
  }

  // ... other methods remain unchanged
}

In our implementation, we extracted the common subroutine of creating an array filled with empty linked lists (used in both the constructor and the doubleCapacity() methods) into a private emptyTable() helper method.

Linear Probing

In a hash table with linear probing, each bucket can only hold a single object. When an add() operation would cause a collision in a probing hash table, we cannot place the object in its nominal bucket; there is not space for it. Instead, we check the next bucket (incrementing the index and wrapping back to 0 when we hit the end of the table), continuing until we find an empty bucket where our object can be placed. Said differently, we probe the buckets linearly (i.e., sequentially) until we locate a spot for the new object.

Now, we’ll consider the implications that probing has on the three basic operations on the hash table: add(), contains(), and remove(). We’ll describe each of these operations and illustrate them with animations, but we leave it as an exercise to implement a linear probing hash table.

`add()`

As we noted above, when we add an object to a probing hash table, it may not fit into the bucket corresponding to its hash value. Instead, this bucket becomes the starting position of a linear search that proceeds until we locate an empty bucket where our object can be placed.

Just as with chaining, when the load factor grows too large in our hash table, its buckets become crowded. When this happens, the number of steps in our linear search grows larger, which deteriorates the runtime of our hash table operations. To counteract this, we will again need to periodically resize our hash table (in the same manner as the chaining hash table, doubling the number of buckets and then sequentially re-hashing and adding the existing elements).

`contains()`

For our contains() method, we must also contend with the possibility that in a probing hash table, an element may not be stored in its nominal bucket. When that element was added, it could have been forced into a later bucket by other elements that were previously added to the table. Therefore, we must begin a linear search for our target element beginning at the bucket indexed by its hash value.

We can stop the search and return true as soon as we find the element, but how will we know when to stop the search and return false? Recall that we place an element in the first available empty bucket after its nominal bucket. Therefore, as soon as we encounter an empty bucket in our linear search, we have found a place where the target element could have been added but was not (more on this in a bit), certifying to us that the target element is not in the hash table.

`remove()`

Finally, we consider the remove() method. As is typical with data structure removal, we start by locating the target element. Now, how do we update the state of the data structure to model this removal? An initial thought may be to set its bucket to null. However, this will cause a problem. To see why, let’s again consider the hash table we’ve been working with.

Suppose that we remove 74 from this hash table and replace it with null.

Now, if we run a contains() query on the element 13, what will happen? We will start at bucket 3, advance to bucket 4, encounter null and errantly return false. Our removal broke up the sequential block of elements that extended our linear search from 13’s nominal bucket 3 to the bucket 5 where it is actually located. To avoid the early curtailing of future linear probes, we need a way to signal that “an element used to be here, but was removed”. We do this by placing a tombstone into a bucket during the removal.

Definition: Tombstone

A tombstone is a sentinel value that indicates that a value has been removed from this bucket in a probing hash table.

In the contains() method, we can only curtail our search when we encounter an empty (i.e., null) bucket, and must continue our search when we encounter a tombstone.

Linear Probing versus Chaining

We have seen two different strategies for dealing with hash collisions, chaining and linear probing. Let’s take a second to consider the trade-offs between these approaches. The data structure backing a chaining hash table (an array of LinkedLists of elements) is more complicated, so it takes up more memory at the same load factor. In addition, the decentralized storage of the LinkedList buckets means that the memory can be spread out, making it less likely that our code will benefit from faster memory access due to cache locality (a topic beyond our scope, but nonetheless useful to have at the back of your mind). The direct storage of elements within the buckets of a probing hash table leads to improved memory performance.

On the other hand, probing hash tables are more sensitive to collisions since probing can cause objects to be stored outside of their nominal buckets. When a lot of elements have nearby hash codes, they can cluster up in the table and lead to probing times that exceed the number of true collisions. In addition, the presence of tombstones, which do not store any data but still lengthen containment queries, further degrades performance when entries are frequently added and removed from the hash table. Chaining hash tables avoid the effects of clustering and tombstones by storing all objects in their nominal buckets.

Main Takeaways:

In a hash table, each object is hashed to determine the nominal index where it will be located in an array. This lets us shortcut the search for an object by jumping directly to this position.
Hashing is the process of computing this nominal index for an object. It consists of two steps. First, the object is pre-hashed to obtain its hashCode(), which can be any int value. Then, this hash code is converted to an array index using modular arithmetic.
A good hashCode() definition will evenly spread out hash codes across the entire range of int values, using all of an object's state to make sure that "nearby" objects have different hash codes.
Hash collisions occur when two different objects are hashed to the same index value. Chaining and linear probing are two strategies for dealing with hash collisions.
A properly implemented hash table can support the primary Set and Map operations with an amortized and expected \(O(1)\) complexity.

Exercises

Exercise 20.1: Check Your Understanding

Exercise 20.2: String’s hashCode()

Internally, a String is an array of chars. Given a length \(n\) string and \(0\le i< n\), denote the \(i\)'th element of the array as \(s_i\). For each of the following hash code functions, compute the hash code of "hash" and determine another String that hashes to the same value. State and explain whether these hash codes are good or bad.

(a)

\[h(s)=2110s_0\]

(b)

\[h(s)=\sum_{i=0}^{n-1}s_i\]

(c)

\[h(s)=\sum_{i=0}^{n-1}(1+(-1)^i)s_i^2\]

(d)

\[h(s)=\sum_{i=0}^{n-1}31^{n-1-i}s_i\]

Exercise 20.3: Chaining HashMap

Similar to HashSet, the Map ADT can leverage hashing to achieve much faster expected runtimes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


/** A Map implementation backed by a chaining hash table. */
public class HashMap<K, V> implements Map<K, V> {
  /** Represents a (key, value) pair in this map */
  public record Entry<K, V> (K key, V value) { }

  /** The backing storage of this HashMap. */
  private LinkedList<Entry<K, V>>[] buckets;

  /** The number of elements stored in this HashMap. */
  private int size;

  /** The initial capacity of the hash table. */
  public static final int INITIAL_CAPACITY = 5;
}

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


/** A Map implementation backed by a chaining hash table. */
public class HashMap<K, V> implements Map<K, V> {
  /** Represents a (key, value) pair in this map */
  public record Entry<K, V> (K key, V value) { }

  /** The backing storage of this HashMap. */
  private LinkedList<Entry<K, V>>[] buckets;

  /** The number of elements stored in this HashMap. */
  private int size;

  /** The initial capacity of the hash table. */
  public static final int INITIAL_CAPACITY = 5;
}

(a)

Define a constructor for HashMap. Implement size().

(b)

Define a helper method index() to derive the index an element would be in.

1
2


/** Returns the index the element will be in after hashing. */
private int index(K key) { ... } 

1
2


/** Returns the index the element will be in after hashing. */
private int index(K key) { ... } 

(c)

Use index() to implement containsKey().

(d)

Implement put() and remove(). Resize the hash table when \(\lambda > 0.75\).

(e)

Implement keySet().

(f)

As with any data structure, we might want to provide a way to iterate through it. However with Maps, it is ambiguous on what we want to iterate over: keys, values, or entries. Implement all of these iterators.

1
2
3
4
5
6
7
8


/** An iterator over the keys of this HashMap. */
public Iterator<K> keyIter() { ... }

/** An iterator over the values of this HashMap. */
public Iterator<V> valueIter() { ... }

/** An iterator over the entries of this HashMap. */
public Iterator<K, V> entryIter() { ... }

1
2
3
4
5
6
7
8


/** An iterator over the keys of this HashMap. */
public Iterator<K> keyIter() { ... }

/** An iterator over the values of this HashMap. */
public Iterator<V> valueIter() { ... }

/** An iterator over the entries of this HashMap. */
public Iterator<K, V> entryIter() { ... }

Exercise 20.4: Resizing on Long Chains

Consider a hash table that resolves collisions with chaining. Suppose we had a very bad sequence of 7 insertions that resulted in all 7 elements being in the same bucket. A student suggests an improvement to a hash table: when there is a chain that is at least \(\frac14\) the capacity of the backing array, double the capacity and rehash all elements. Does this resolve long chains? Explain why or why not.

Exercise 20.5: Drawing hash tables

Assume that there exists an initially empty HashSet<Integer> set that uses chaining. The backing hash table has an initial capacity of 10, doubling in size when \(\lambda > 0.75\). For simplicity, assume the hash code function is \(h(i) = i\). Draw the backing hash table and state the load factor after each of the following sequence of operations.

(a)

1
2
3
4
5


set.add(2);
set.add(1);
set.add(1);
set.add(0);
set.add(2110);

1
2
3
4
5


set.add(2);
set.add(1);
set.add(1);
set.add(0);
set.add(2110);

(b)

1
2
3


for (int i = 0; i < 8; i++) {
  set.add(i * 10);
}

1
2
3


for (int i = 0; i < 8; i++) {
  set.add(i * 10);
}

(c)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


set.add(1);
set.add(2);
set.add(3);
set.add(5);
set.add(8);
set.add(13);
set.add(21);
set.remove(1);
set.add(34);
set.remove(5);
set.add(55);

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


set.add(1);
set.add(2);
set.add(3);
set.add(5);
set.add(8);
set.add(13);
set.add(21);
set.remove(1);
set.add(34);
set.remove(5);
set.add(55);

(d)

Repeat this problem, but suppose the HashSet uses linear probing. Draw tombstones as displayed in lecture.

Exercise 20.6: Probing HashSet

Implement a Set with a hash table that resolves collisions with linear probing.

Exercise 20.7: Probing HashMap

Implement a Map with a hash table that resolves collisions with linear probing.

Exercise 20.8: Quadratic Probing

Linear probing suffers from clustering when elements with similar hash codes are added into a hash table, severely degrading the performance of the data structure. To break up these clusters, quadratic probing changes the step size between probes. Instead of stepping by 1 each time, it uses quadratically increasing offsets. If an element hashes to an index \(k\), quadratic probing will attempt to insert at indices (modulo \(N\)) \(k\), then \(k+1\), \(k+4\), \(k+9\), and so on.

(a)

Implement a HashSet that uses quadratic probing to resolve collisions.

(b)

Implement a HashMap that uses quadratic probing to resolve collisions.

(c)

Is it still possible for large clusters to appear? If so, give an example of a hash function and a sequence of insertions that results in a cluster.

Exercise 20.9: Dynamic Down-sizing

Much like dynamic arrays, in order to guarantee an \(O(N)\) memory usage for a hash table, we need to dynamically shrink the backing array. Modify remove() to shrink the backing array in half when \(\lambda < 0.25\).

20. Hashing

Hash Tables

Hashing

hashCode() Method

Hash Codes and Mutability

Hash Collisions

Chaining

Load Factor and Resizing

Hash Table Resizing

Linear Probing

add()

contains()

remove()

Linear Probing versus Chaining

Main Takeaways:

Exercises

On this page:

`hashCode()` Method

`add()`

`contains()`

`remove()`