New subject: [ruby-core:125337] [Ruby Feature#22011] Hash tables with swiss table

22 Apr 2026

      Issue #22011 has been reported by dsh0416 (Delton Ding).

----------------------------------------
Feature #22011: Hash tables with swiss table
https://bugs.ruby-lang.org/issues/22011

* Author: dsh0416 (Delton Ding)
* Status: Open
----------------------------------------
This change adds a Swiss-table-inspired probing layer to Ruby's core
`st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the
stored hash into a parallel array. It is built and enabled by default;
`--disable-swiss-st` reverts to the original `st.c`. The public ABI of
`struct st_table` and the iteration-order guarantee are preserved.

## Motivation

Hashes are everywhere in Ruby — instance-variable tables, ivar shapes,
constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles
of Rails-shaped workloads spend a meaningful fraction of CPU inside
`st_lookup` and `st_insert`. Two pain points stood out in the upstream
implementation:

1. **Probe loops are branch-heavy.** Every step of the perturb chain
   loads a bin, fetches the entry it points to, compares the full
   `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is
   several dependent loads per probe with no way to fast-reject groups
   of slots in parallel.
2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets
   one cache line per ~2.5 entries. Iteration and equality scans burn
   through L1 quickly, and Ruby programs typically hold a *lot* of
   small-to-mid-sized hashes (so per-table overhead matters).

The Swiss-table family of designs (Abseil `flat_hash_map`, Rust
`hashbrown`) addresses (1) with a 1-byte-per-slot **control array**
that lets a single SIMD/SWAR comparison reject or short-list 8 slots at
once. We borrow that idea but keep Ruby's two-array layout (so we don't
break ABI or insertion order) and add a third, parallel `ctrl[]` byte
array. We then attack (2) by also extracting the hash field out of
`st_table_entry` into its own parallel `uint32_t hashes[]` array.

---

## Design

### 1. Three-array layout

| array       | width                       | role                                                                       |
| ----------- | --------------------------- | -------------------------------------------------------------------------- |
| `entries[]` | 16 B (was 24 B)             | insertion-order log of `(key, record)`                                     |
| `hashes[]`  | 4 B per slot                | parallel array of truncated 32-bit hashes (also encodes the tombstone)     |
| `bins[]`    | adaptive 1 / 2 / 4 / 8 B    | hash-indexed array of indices into `entries[]`                             |
| `ctrl[]`    | 1 B per slot                | `H2` (top 7 bits of hash) or `EMPTY (0xff)` / `DELETED (0xfe)`             |

`entries[]` and `hashes[]` are the same length and addressed by the
same index, so iteration and slot reuse stay trivial. `ctrl[]` is the
fast-reject filter and lives alongside `bins[]`; only when a `ctrl`
byte matches H2 do we load the (now smaller) entry and the parallel
hash to confirm the match. SWAR is done on `uint64_t` reads of
`ctrl[]` so the per-group cost is a handful of bitwise ops with no
SIMD register transfer (which on Apple Silicon was not a win — see
"What we tried but didn't keep" below).

### 2. Compact `st_table_entry`

Before:

```c
struct st_table_entry {
    st_hash_t  hash;    /* 8 B */
    st_data_t  key;     /* 8 B */
    st_data_t  record;  /* 8 B */
};                       /* 24 B */
```

After (when `ST_USE_SWISS_BINS` is on, which is the default):

```c
struct st_table_entry {
    st_data_t  key;     /* 8 B */
    st_data_t  record;  /* 8 B */
};                       /* 16 B */
```

The hash moves into `tab->hashes[i]`, a `uint32_t` (so two slots fit
per 8-byte word, four entries per cache line). All access goes through
`ST_HASH_AT_PTR` / `ST_HASH_AT_IDX` macros so the same source compiles
unchanged when the feature is disabled. The `set_table` variant keeps
its original inline-hash layout — it already uses 16 B entries and is
not part of `st_table`'s ABI footprint.

### 3. The new hash function (the part that needed the most care)

Storing only the low 32 bits of the hash means we can no longer read
H2 from the *top* 7 bits of the original `unsigned long` hash, which is
what a textbook Swiss design does. Naively recomputing the full 64-bit
hash from the key on every rebuild / rehash / `st_shift` /
`st_general_foreach` works for correctness but kills insert and rebuild
performance — equality+hash for strings is not free.

The fix is to derive H2 from a band of the *truncated* 32-bit hash,
disjoint from the band that picks the bin:

```c
/* bin index: low `bin_power` bits, masked by `bins_mask(tab)`     */
hash_bin(uint32_t h, st_table *tab) { return h & bins_mask(tab); }

/* H2: bits 25..31 of the same 32-bit hash, never overlaps with     */
/* the bin index because bin_power is capped well under 25 in       */
/* practice.                                                        */
static inline unsigned char
st_swiss_h2(st_hash_t hash) {
    return (unsigned char)((hash >> 25) & 0x7f);
}
```

This makes the stored `uint32_t` self-sufficient: every probe reads
both the bin index and the H2 byte from the same word, no `do_hash()`
call required, and rebuild/rehash/shift/foreach all use
`ST_HASH_AT_IDX(tab, i)` instead of recomputing.

Two additional details that fall out of the truncation:

1. `normalize_hash_value()` is updated so that `0xFFFFFFFF` (the
   tombstone marker for the 32-bit hash slot) never collides with a
   real hash value — if the truncation lands on the reserved value we
   bump it. The 64-bit reservation is preserved on platforms that
   compile without `ST_USE_SWISS_BINS`.
2. The `MARK_ENTRY_DELETED` / `DELETED_ENTRY_P` macros now take the
   table as a parameter so they can read/write the parallel hash slot.

### 4. Prefetch on H2 match

When SWAR finds a candidate H2 match in a control group, the next
operation is loading the matching `st_table_entry` and the parallel
hash word — both of which are in cold cache lines on a miss. We issue
`__builtin_prefetch` on both immediately after the match is detected
in `find_table_entry_ind` / `find_table_bin_ind` /
`find_table_bin_ptr_and_reserve`. On lookup-heavy workloads this hides
a meaningful chunk of the L2 latency that the SWAR fast-filter would
otherwise expose.

### 5. What we tried but didn't keep

* **SSE2 / NEON intrinsics for the group scan.** On Apple Silicon the
  vector-to-GPR transfer for the match mask cost more than the entire
  SWAR sequence; on x86_64 the win, if any, was within noise. SWAR is
  the only shipped backend.
* **Recomputing the full 64-bit hash** in rebuild/rehash paths to keep
  H2 in the high bits. Correct but cost ~5–10 % on insert workloads;
  superseded by the bit-band trick above.

---

## Results vs `master`

Both binaries built from the same tree (`master = 42b3cdc51a`,
`swiss = 3c0446847f`), same compiler, same flags. Each script run 5×
with `--disable-gems`, best-of-N reported. Memory is `maximum resident
set size` from `/usr/bin/time -l` on macOS arm64 (M-class).

### Throughput (lower wall time = better)

| benchmark         | master (s) | swiss (s) | speedup |
| ----------------- | ---------: | --------: | ------: |
| aref_int_large    |     0.8352 |    0.6862 | **+17.8 %** |
| aref_str_large    |     0.9915 |    0.8406 | **+15.2 %** |
| aref_miss_large   |     1.0803 |    0.7896 | **+26.9 %** |
| aref_mix_50       |     1.0337 |    0.8201 | **+20.7 %** |
| insert_grow       |     0.1138 |    0.1105 |    +2.9 % |
| churn (mixed RW)  |     0.0321 |    0.0304 |    +5.5 % |
| iterate           |     0.0566 |    0.0565 |   ±0.1 %  |

Lookups are the headline win — both successful (`+15 % … +20 %`) and
missing (`+27 %`), the latter because a missing key now short-circuits
on the first SWAR group with no entry/bin loads at all. Inserts and
churn are modestly faster because rebuilds no longer call `do_hash`.
Iteration is unaffected (it never touched bins or ctrl).

### Memory

Process RSS for the benchmark workloads (same runs):

| benchmark         | master (MB) | swiss (MB) | delta |
| ----------------- | ----------: | ---------: | ----: |
| insert_grow       |       66.62 |      60.44 | **−9.3 %** |
| aref_str_large    |       15.27 |      15.23 |   −0.3 % |
| aref_mix_50       |       16.64 |      16.61 |   −0.2 % |
| churn             |       13.19 |      13.03 |   −1.2 % |

Per-table memory (`ObjectSpace.memsize_of`, sum across many hashes):

| workload                                 | master    | swiss     | delta       |
| ---------------------------------------- | --------: | --------: | ----------: |
| 2 000 hashes × 200 entries               | 14.66 MB  | 12.11 MB  | **−17.4 %** |
| 1 hash × 100 000 entries                 |  4.19 MB  |  3.28 MB  | **−21.9 %** |

The two memory views agree: any workload that holds a meaningful
number of entries live (whether one big hash or many small ones) sees
double-digit shrinkage from the 24 B → 16 B entry plus the 1 B
`ctrl[]` / 4 B `hashes[]` pair, because they together (`5 B` per slot)
cost less than the 8 B saved per slot in `entries[]`. Per-table
overhead at fewer than ~64 entries is roughly flat; the Swiss path
itself only kicks in at `entry_power ≥ 6` (table capacity ≥ 64).

---Files--------------------------------
patch.diff (78 KB)

-- 
https://bugs.ruby-lang.org/

[ruby-core:125336] [Ruby Feature#22011] Hash tables with swiss table

dsh0416 (Delton Ding)

dsh0416 (Delton Ding)

dsh0416 (Delton Ding)

dsh0416 (Delton Ding)

byroot (Jean Boussier)

dsh0416 (Delton Ding)

dsh0416 (Delton Ding)

dsh0416 (Delton Ding)

dsh0416 (Delton Ding)

dsh0416 (Delton Ding)

dsh0416 (Delton Ding)

tags

participants (2)