Links to other parts of the series:
Welcome to another post in my series on writing a hashmap in Rust!
Please check out Part 1 for an introduction. The full code for the series can be found here.
In this short post, we’ll add another optimization onto fifth::Map
:
Using SIMD to speed up the map resizing.
There’s nothing too complicated about this, as we did all of the hard work in Part 3a, and we get a nice performance boost—now only 20% slower than std
for our insert_grow
benchmarks!
Also, this post should give a good idea of what we would need to implement the standard iterator utilities (.iter()
, .into_iter()
, etc.) on our Map
type, which are pretty conceptually similar to resizing the map.
In both cases, we just want to get all the elements out as fast as possible.
The game plan
Here’s what our old resize
implementation on fifth::Map
looks like:
// fifth.rs
impl<K, V> for Map<K, V>
where
K: PartialEq + Eq + Hash,
{
#[cold]
#[inline(never)]
fn resize(&mut self) {
// Calculate the new capacity.
let capacity = match self.n_buckets() {
0 => 16,
x => x * 2,
};
// Set `self.storage` to a new array.
let new_storage = Box::new_uninit_slice(capacity);
let old_storage = std::mem::replace(&mut self.storage, new_storage);
let new_metadata = (0..(capacity + GROUP_SIZE))
.map(|_| metadata::empty())
.collect::<Vec<_>>()
.into_boxed_slice();
// Here, we need to keep the old metadata,
// as it's unsafe to blindly access the old storage array.
let old_metadata = std::mem::replace(&mut self.metadata, new_metadata);
self.n_items = 0;
self.n_occupied = 0;
// Move nodes from `old_storage` to `self.storage`.
for (&metadata, bucket) in old_metadata.iter().zip(Vec::from(old_storage)) {
if metadata::is_full(metadata) {
// SAFETY: we just checked the invariant above.
let (k, v) = unsafe { bucket.assume_init() };
self._insert(k, v);
}
}
}
}
This is fine, but in the final loop we’re iterating over self.metadata
one entry at a time.
Using the SIMD tricks from Part 3a, we can do better!
We basically need to refactor the final loop into the follow steps:
- Chunk up
old_metadata
andold_storage
intoGROUP_SIZE
chunks. Ideally, we should try to do this in aconst
way (i.e., outputting array chunks instead of slices) to give the compiler more information for doing optimizations. In particular, this tells the compiler that our SIMD accesses are always aligned. - Zip together our two chunked iterators and iterate.
On each loop iteration, we’ll get a
metadata_chunk
andstorage_chunk
from the zipped iterator. - Identify and use the full buckets. From
metadata_chunk
, we’ll create ansse::Group
, and use SIMD operations to make aMask
representing the full buckets. Then, we’ll grab the corresponding buckets fromstorage_chunk
, pull outk
andv
, and feed them back intoself._insert(k, v)
like before.
Sounds pretty easy!
Step 1: Chunking
This step ended up actually being the most annoying because I ended up having to use a pretty new unstable feature, #![feature(iter_array_chunks)]
.
The feature itself is great, but unfortunately it’s also very new (about 8 months old).
While I haven’t been shy about using unstable features in this series, our current chunking use case seems like a pretty common one so I was hoping to avoid adding another unstable feature, especially one that’s only a few months old.
If I’m missing an obvious better API for doing this, please let me know!
When trying to do this, I first wanted to use std::slice::array_chunks
.
This is a nice feature that uses const generics like we want.
Unfortunately, array_chunks
returns an iterator of array references, but we need an iterator of actual arrays since we have to consume old_storage
—otherwise, we’d have to clone the (K, V)
pairs, which would limit our Map
to only holding types that are Clone
.
So instead, we’ll use iter_array_chunks
, which turns an iterator over T
into one over [T; N]
.
Here’s the code:
// Before doing anything, `old_metadata` and
// `old_storage` are boxed slices.
let metadata_chunks = Vec::from(old_metadata)
.into_iter()
.array_chunks::<GROUP_SIZE>();
let storage_chunks = Vec::from(old_storage)
.into_iter()
.array_chunks::<GROUP_SIZE>();
Pretty simple!1
Step 2: Zip & iterate
This step is quite simple:
for (m_chunk, s_chunk) in metadata_chunks.zip(storage_chunks) {
// ...
}
One nice thing about zipping metadata_chunks
and storage_chunks
together is that this will automatically take care of the extra metadata group at the end of self.metadata
(which we want to ignore).
Step 3: Do a lot of stuff
To implement step 3, we’ll have to first add a few SIMD utilities in sse.rs
.
Part (a): SIMD
First, we’ll want to add a method for making a Group
from an array.
(Our previous constructor took a slice.)
// sse.rs
impl Group {
#[inline]
pub fn from_array(a: [u8; GROUP_SIZE]) -> Self {
Self(SimdType::from_array(a))
}
}
As you can see above, this is quite easy: we just call std::Simd::from_array
.
Adding this constructor is purely a performance optimization, since it allows us to avoid the bounds checks in Simd::from_slice
.
We’ll also want to add a method for finding the full buckets in a group. Here’s the code for doing this (sorry for the terrible name):
// sse.rs
impl Group {
#[inline]
pub fn to_fulls(self) -> MaskType {
let empty_mask = SimdType::splat(metadata::empty());
let zeros = SimdType::splat(0);
(empty_mask & self.0).simd_eq(zeros)
}
}
The above function creates a vector of bitmasks by splatting metadata::empty() = 0x80
.
Since empty and tombstone entries both have the MSB set, we do a lane-wise &
of the group with empty_mask
and then compare for (lane-wise) equality with a vector of zeros.
This gives us a mask where the set bits correspond to full metadata entries.
Great! Now on to filling out the loop.
Part (b): Loop
We’ll start by making our mask using our two new SIMD functions:
for (m_chunk, s_chunk) in metadata_chunks.zip(storage_chunks) {
// Get a mask showing the indices with full buckets.
let full_mask = sse::Group::from_array(m_chunk).to_fulls();
// ...
}
Next, convert our full_mask
to an array and zip it with s_chunk
.
We can then iterate over both arrays, only touching buckets where the mask is true
:
for (m_chunk, s_chunk) in metadata_chunks.zip(storage_chunks) {
// Get a mask showing the indices with full buckets.
let full_mask = sse::Group::from_array(m_chunk).to_fulls();
// Re-insert each full bucket.
for (is_full, bucket) in full_mask.to_array().into_iter().zip(s_chunk) {
if is_full {
// Do something!
}
}
}
Finally, pull our items out of full buckets and re-insert them into the new storage array:
for (m_chunk, s_chunk) in metadata_chunks.zip(storage_chunks) {
// Get a mask showing the indices with full buckets.
let full_mask = sse::Group::from_array(m_chunk).to_fulls();
// Re-insert each full bucket.
for (is_full, bucket) in full_mask.to_array().into_iter().zip(s_chunk) {
if is_full {
// Safety: if `is_full`, then we can assume the `bucket`
// is initialized according to our safety invariant.
let (k, v) = unsafe { bucket.assume_init() };
self._insert(k, v);
}
}
}
Done!
Speed comparison with std
Here are the updated performance results.
As always, the numbers shown are ratios with std
’s runtime.
Benchmark | fifth v1 | fifth (new resize) |
---|---|---|
insert_grow_seq (payload: 8 bytes) | 1.46 | 1.19 |
insert_grow_seq (64 bytes) | 1.35 | 1.20 |
insert_grow_random (8 bytes) | 1.42 | 1.20 |
insert_grow_random (64 bytes) | 1.36 | 1.21 |
insert_reserved_random (8 bytes) | 1.14 | 1.11 |
insert_reserved_random (64 bytes) | 1.15 | 1.11 |
lookup (8 bytes) | 1.23 | 1.22 |
lookup (64 bytes) | 1.34 | 1.38 |
lookup_string (8 bytes) | 1.12 | 1.07 |
lookup_string (64 bytes) | 1.07 | 1.01 |
lookup_miss (8 bytes) | 1.11 | 1.08 |
lookup_miss (64 bytes) | 1.10 | 1.12 |
remove (8 bytes) | 1.00 | 1.14 |
remove (64 bytes) | 1.12 | 1.12 |
The left column is v1
from Part 3a and the right column is our new code.
We get a substantial improvement on the insert_grow
benchmarks, from about 40% slower than std
previously to only 20% slower than std
now!
Conclusion
In this post, we looked at how to apply our SIMD probing optimizations to map resizing. This was pretty easy to do given that we developed most of the SIMD utilities in Part 3a, and gave a great performance boost.
We could also use the same optimization for our Clone
implementation, and for implementations of .iter()
or .into_iter()
for our Map
.
In all of these cases, we just want to yield all the full buckets in the map, without worrying about their hashes or anything else.
Footnotes
This could actually be even simpler—we don’t actually need to specify the GROUP_SIZE
for metadata_chunks
, since the compiler will infer the type when we create a Group
from a chunk later.