Writing a hashmap in Rust, Part 3b: Exception safety

2023-04-10

Links to other parts of the series:

Welcome to another post in my series on writing a hashmap in Rust!

Please check out Part 1 for an introduction. The full code for the series can be found here.

Unfortuately, it turns out that our previous implementations using MaybeUninit (fourth::Map and fifth::Map) are not exception safe, and as a result, they are unsound. (“Unsound” means “in violation of Rust’s safety guarantees”.)

In this post, I will attempt to explain exception safety and show how our maps can be made exception-safe. If you’re looking for the official/definitive description of exception safety, please see the section in the Rustonomicon.

What is exception safety?

Rust’s key memory safety guarantee is roughly as follows: it is impossible to trigger undefined behavior (UB) from safe code. To uphold this guarantee, our unsafe code has to be hardened such that incorrect or even downright bad user code cannot cause UB.

As a result, we need to be super careful any time we call user code from inside a module that uses unsafe. Possibly the most common dangerous scenario is calling a user-provided routine that panics—when the panic occurs, the program starts unwinding and control is snatched away from our unsafe code. If the panic occured while we were temporarily violating a safety invariant, the bad state of our entity could be observed (e.g., when the type is dropped), leading to memory unsafety.

Thus, exception safety is a key aspect of correctness for a type that uses unsafe: panics coming from user code cannot cause memory safety to be violated!

Why aren’t our maps exception-safe?

It turns out that our previous implementations using MaybeUninit (fourth::Map and fifth::Map) are NOT exception-safe. The problem is the following Clone implementation:

// fourth.rs and fifth.rs

impl<K, V> Clone for Map<K, V>
where
    K: Clone + PartialEq + Eq + Hash,
    V: Clone,
{
    fn clone(&self) -> Self {
        let mut other = Self {
            hasher: DefaultHashBuilder::default(),
            n_items: self.n_items,
            n_occupied: self.n_occupied,
            storage: Box::new_uninit_slice(self.n_buckets()),
            metadata: self.metadata.clone(),
            _ph: PhantomData,
        };

        for (i, m) in self.metadata.iter().enumerate().take(self.n_buckets()) {
            if metadata::is_full(*m) {
                let (k, v) = unsafe { self.storage[i].assume_init_ref() };
                other.storage[i].write((k.clone(), v.clone()));
            }
        }

        other
    }
}

Let’s walk through what’s happening above.

  • First, we make a new instance other of Map. We clone everything over to the new map except for self.storage. We don’t clone self.storage because we can’t—remember that self.storage holds an array of MaybeUninit<(K, V)>, which isn’t Clone. Instead, we set other.storage to be an array of uninitialized memory.
  • Then, we loop through self.metadata and clone each full bucket over to other.storage.

Seems fine right? (You might see the problem already, but it took me a while.)

The problem arises because we cloned all the metadata over in the beginning. Remember, we are supposed to be upholding the following safety invariant:

pub struct Map<K, V, S: BuildHasher = DefaultHashBuilder> {
    /// Safety: we maintain the following invariant:
    /// `self.storage[i]` is initialized whenever `is_full(self.metadata[i])`.
    storage: Box<[MaybeUninit<(K, V)>]>,

    // ...
}

But note that we’re clearly violating this invariant when we set other.metadata = self.metadata.clone() above: At the moment when we do this, other.storage is completely uninitialized, and therefore other is in an invalid state!

In principle, there’s nothing wrong with momentarily violating our invariant, as long as we make sure it gets fixed after. And indeed, assuming everything goes right, our safety invariant is happily restored at the point when we return from clone().

But we can’t assume everything goes right if we’ve called user code! Which we have, in the calls to k.clone() and v.clone(). What if a caller accidentally (or purposefully, it really doesn’t matter) gave us a key type with this behavior?

#[derive(PartialEq, Eq, Hash)]
pub struct Bomb(usize);

impl Clone for Bomb {
    fn clone(&self) -> Self {
        panic!("boom!")
    }
}

If the user gives us K = Bomb, then we never get the chance to restore our invariant in clone()! Everything just blows up as soon as we try to clone a key.

The bad part actually happens after the panic. Rust’s runtime will try to unwind the stack, dropping values as it goes along, including other. But remember how our Drop implementation works: we check each metadata entry and if is_full, drop the values in the corresponding bucket. In doing so, we heavily rely on our safety invariant to avoid reading uninitialized data (aka insta-UB)—unfortunately, our invariant is completely borked at this point!

Obviously, our current Clone implementation is not correct. Let’s start by adding some tests to demonstrate this.

Adding miri tests

Unfortunately, our old miri tests weren’t actually testing for this situation. Let’s add another test:

// lib.rs
#[test]
#[should_panic]
fn clone_bomb() {

    #[derive(PartialEq, Eq, Hash)]
    struct Bomb(usize);

    impl Clone for Bomb {
        fn clone(&self) -> Self {
            panic!("bomb!")
        }
    }

    let mut map: Map<Bomb, String> = Map::new();

    for i in 0..1000 {
        map.insert(Bomb(i), i.to_string());
    }

    // This line panics, but shouldn't cause UB!
    let _ = map.clone();
}

Above, we define our type Bomb from before, fill up a map with a bunch of Bombs, and then clone it. Nothing too complicated! (Note that we marked the test #[should_panic] to indicate that we expect the test to panic.)

Running this on our original Clone implementation causes miri to scream in pain/anger:

$ cargo miri test fifth::tests::clone_bomb

running 1 test
test fifth::tests::clone_bomb - should panic ... error: Undefined Behavior: using uninitialized data, but this operation requires initialized memory

(Long backtrace omitted)

Yup, just as we had argued before, we’re causing UB. Let’s fix it!

Fixing the implementation

We’re lucky that in this case, the problem is actually pretty easy to solve. We argued previously that the problem stems from setting other.metadata = self.metadata.clone() before correctly initializing other.storage. So what if we just don’t do that?

// fourth.rs and fifth.rs
impl<K, V> Clone for Map<K, V>
where
    K: Clone + PartialEq + Eq + Hash,
    V: Clone,
{
    fn clone(&self) -> Self {
        let mut other = Self::with_capacity(self.n_buckets());
        assert_eq!(self.n_buckets(), other.n_buckets());

        for (i, m) in self.metadata.iter().enumerate().take(self.n_buckets()) {
            if metadata::is_full(*m) {
                let (k, v) = unsafe { self.storage[i].assume_init_ref() };
                other.storage[i].write((k.clone(), v.clone()));

                // Important: Only update the metadata after we successfully clone!
                // If cloning panics, then updating the metadata before cloning 
                // leads to a read of uninitialized memory when `other` is dropped.
                other.set_metadata(i, *m);
                other.n_items += 1;
                other.n_occupied += 1;
            }
        }
        other
    }
}

Let’s walk through the new code above.

  • First, we make other using Self::with_capacity, and assert that other has the same number of buckets as self. At this point, other is a valid but completely empty map.
  • Second, we iterate through the self.metadata array. At each full bucket i, we clone the contents of self.storage[i] to other.storage[i]. After this succeeds, we copy self.metadata[i] to other.metadata[i] and increment other.n_items and other.n_occupied1.

This is very similar to our original code, with the key difference that other.metadata is written after other.storage. Let’s see if this fixes our problem:

$ cargo miri test fifth::tests::clone_bomb

running 1 test
test fifth::tests::clone_bomb - should panic ... ok 

Fixed!

Conclusion

In this post, I showed how our initial implementation of Clone was unsound due to a lack of exception safety. I also showed a test to demonstrate how UB could be triggered, and explained the small change needed to fix the code.

For this example, only a simple transformation of the code was required to ensure that our safety invariant was upheld at the point where a panic could possibly occur. This is very similar to the example of Vec::push_all in the Rustonomicon. The Nomicon also provides a much more complicated example, where the required code transformations are much more complicated.

Footnotes

1

Strictly speaking, we actually don’t need to increment other.n_items and other.n_occupied inside of the loop, since we don’t have any safety invariants that rely on these quantities.

#data structures #hashmap #rust #software