Writing a hashmap in Rust, Part 3c: Resizing with SIMD

2023-04-11

Links to other parts of the series:

Welcome to another post in my series on writing a hashmap in Rust!

Please check out Part 1 for an introduction. The full code for the series can be found here.

In this short post, we’ll add another optimization onto fifth::Map: Using SIMD to speed up the map resizing. There’s nothing too complicated about this, as we did all of the hard work in Part 3a, and we get a nice performance boost—now only 20% slower than std for our insert_grow benchmarks!

Also, this post should give a good idea of what we would need to implement the standard iterator utilities (.iter(), .into_iter(), etc.) on our Map type, which are pretty conceptually similar to resizing the map. In both cases, we just want to get all the elements out as fast as possible.

The game plan

Here’s what our old resize implementation on fifth::Map looks like:

// fifth.rs

impl<K, V> for Map<K, V>
where
    K: PartialEq + Eq + Hash,
{
    #[cold]
    #[inline(never)]
    fn resize(&mut self) {
        // Calculate the new capacity.
        let capacity = match self.n_buckets() {
            0 => 16,
            x => x * 2,
        };

        // Set `self.storage` to a new array.
        let new_storage = Box::new_uninit_slice(capacity);
        let old_storage = std::mem::replace(&mut self.storage, new_storage);

        let new_metadata = (0..(capacity + GROUP_SIZE))
            .map(|_| metadata::empty())
            .collect::<Vec<_>>()
            .into_boxed_slice();
        // Here, we need to keep the old metadata,
        // as it's unsafe to blindly access the old storage array.
        let old_metadata = std::mem::replace(&mut self.metadata, new_metadata);

        self.n_items = 0;
        self.n_occupied = 0;

        // Move nodes from `old_storage` to `self.storage`.
        for (&metadata, bucket) in old_metadata.iter().zip(Vec::from(old_storage)) {
            if metadata::is_full(metadata) {
                // SAFETY: we just checked the invariant above.
                let (k, v) = unsafe { bucket.assume_init() };
                self._insert(k, v);
            }
        }
    }
}

This is fine, but in the final loop we’re iterating over self.metadata one entry at a time. Using the SIMD tricks from Part 3a, we can do better!

We basically need to refactor the final loop into the follow steps:

  1. Chunk up old_metadata and old_storage into GROUP_SIZE chunks. Ideally, we should try to do this in a const way (i.e., outputting array chunks instead of slices) to give the compiler more information for doing optimizations. In particular, this tells the compiler that our SIMD accesses are always aligned.
  2. Zip together our two chunked iterators and iterate. On each loop iteration, we’ll get a metadata_chunk and storage_chunk from the zipped iterator.
  3. Identify and use the full buckets. From metadata_chunk, we’ll create an sse::Group, and use SIMD operations to make a Mask representing the full buckets. Then, we’ll grab the corresponding buckets from storage_chunk, pull out k and v, and feed them back into self._insert(k, v) like before.

Sounds pretty easy!

Step 1: Chunking

This step ended up actually being the most annoying because I ended up having to use a pretty new unstable feature, #![feature(iter_array_chunks)]. The feature itself is great, but unfortunately it’s also very new (about 8 months old). While I haven’t been shy about using unstable features in this series, our current chunking use case seems like a pretty common one so I was hoping to avoid adding another unstable feature, especially one that’s only a few months old. If I’m missing an obvious better API for doing this, please let me know!

When trying to do this, I first wanted to use std::slice::array_chunks. This is a nice feature that uses const generics like we want. Unfortunately, array_chunks returns an iterator of array references, but we need an iterator of actual arrays since we have to consume old_storage—otherwise, we’d have to clone the (K, V) pairs, which would limit our Map to only holding types that are Clone.

So instead, we’ll use iter_array_chunks, which turns an iterator over T into one over [T; N]. Here’s the code:

// Before doing anything, `old_metadata` and
// `old_storage` are boxed slices.

let metadata_chunks = Vec::from(old_metadata)
    .into_iter()
    .array_chunks::<GROUP_SIZE>();
let storage_chunks = Vec::from(old_storage)
    .into_iter()
    .array_chunks::<GROUP_SIZE>();

Pretty simple!1

Step 2: Zip & iterate

This step is quite simple:

for (m_chunk, s_chunk) in metadata_chunks.zip(storage_chunks) {
    // ...
}

One nice thing about zipping metadata_chunks and storage_chunks together is that this will automatically take care of the extra metadata group at the end of self.metadata (which we want to ignore).

Step 3: Do a lot of stuff

To implement step 3, we’ll have to first add a few SIMD utilities in sse.rs.

Part (a): SIMD

First, we’ll want to add a method for making a Group from an array. (Our previous constructor took a slice.)

// sse.rs

impl Group {
    #[inline]
    pub fn from_array(a: [u8; GROUP_SIZE]) -> Self {
        Self(SimdType::from_array(a))
    }
}

As you can see above, this is quite easy: we just call std::Simd::from_array. Adding this constructor is purely a performance optimization, since it allows us to avoid the bounds checks in Simd::from_slice.

We’ll also want to add a method for finding the full buckets in a group. Here’s the code for doing this (sorry for the terrible name):

// sse.rs

impl Group {
    #[inline]
    pub fn to_fulls(self) -> MaskType {
        let empty_mask = SimdType::splat(metadata::empty());
        let zeros = SimdType::splat(0);
        (empty_mask & self.0).simd_eq(zeros)
    }
}

The above function creates a vector of bitmasks by splatting metadata::empty() = 0x80. Since empty and tombstone entries both have the MSB set, we do a lane-wise & of the group with empty_mask and then compare for (lane-wise) equality with a vector of zeros. This gives us a mask where the set bits correspond to full metadata entries.

Great! Now on to filling out the loop.

Part (b): Loop

We’ll start by making our mask using our two new SIMD functions:

for (m_chunk, s_chunk) in metadata_chunks.zip(storage_chunks) {
    // Get a mask showing the indices with full buckets.
    let full_mask = sse::Group::from_array(m_chunk).to_fulls();
    // ... 
}

Next, convert our full_mask to an array and zip it with s_chunk. We can then iterate over both arrays, only touching buckets where the mask is true:

for (m_chunk, s_chunk) in metadata_chunks.zip(storage_chunks) {
    // Get a mask showing the indices with full buckets.
    let full_mask = sse::Group::from_array(m_chunk).to_fulls();
    
    // Re-insert each full bucket.
    for (is_full, bucket) in full_mask.to_array().into_iter().zip(s_chunk) {
        if is_full {
            // Do something!
        }
    }
}

Finally, pull our items out of full buckets and re-insert them into the new storage array:

for (m_chunk, s_chunk) in metadata_chunks.zip(storage_chunks) {
    // Get a mask showing the indices with full buckets.
    let full_mask = sse::Group::from_array(m_chunk).to_fulls();
    
    // Re-insert each full bucket.
    for (is_full, bucket) in full_mask.to_array().into_iter().zip(s_chunk) {
        if is_full {
            // Safety: if `is_full`, then we can assume the `bucket` 
            // is initialized according to our safety invariant.
            let (k, v) = unsafe { bucket.assume_init() };
            self._insert(k, v);
        }
    }
}

Done!

Speed comparison with std

Here are the updated performance results. As always, the numbers shown are ratios with std’s runtime.

Benchmarkfifth v1fifth (new resize)
insert_grow_seq (payload: 8 bytes)1.461.19
insert_grow_seq (64 bytes)1.351.20
insert_grow_random (8 bytes)1.421.20
insert_grow_random (64 bytes)1.361.21
insert_reserved_random (8 bytes)1.141.11
insert_reserved_random (64 bytes)1.151.11
lookup (8 bytes)1.231.22
lookup (64 bytes)1.341.38
lookup_string (8 bytes)1.121.07
lookup_string (64 bytes)1.071.01
lookup_miss (8 bytes)1.111.08
lookup_miss (64 bytes)1.101.12
remove (8 bytes)1.001.14
remove (64 bytes)1.121.12

The left column is v1 from Part 3a and the right column is our new code.

We get a substantial improvement on the insert_grow benchmarks, from about 40% slower than std previously to only 20% slower than std now!

Conclusion

In this post, we looked at how to apply our SIMD probing optimizations to map resizing. This was pretty easy to do given that we developed most of the SIMD utilities in Part 3a, and gave a great performance boost.

We could also use the same optimization for our Clone implementation, and for implementations of .iter() or .into_iter() for our Map. In all of these cases, we just want to yield all the full buckets in the map, without worrying about their hashes or anything else.

Footnotes

1

This could actually be even simpler—we don’t actually need to specify the GROUP_SIZE for metadata_chunks, since the compiler will infer the type when we create a Group from a chunk later.

#data structures #hashmap #rust #software