[ruby-core:125336] [Ruby Feature#22011] Hash tables with swiss table

22 Apr 2026

      Issue #22011 has been reported by dsh0416 (Delton Ding).

----------------------------------------
Feature #22011: Hash tables with swiss table
https://bugs.ruby-lang.org/issues/22011

* Author: dsh0416 (Delton Ding)
* Status: Open
----------------------------------------
This change adds a Swiss-table-inspired probing layer to Ruby's core
`st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the
stored hash into a parallel array. It is built and enabled by default;
`--disable-swiss-st` reverts to the original `st.c`. The public ABI of
`struct st_table` and the iteration-order guarantee are preserved.

## Motivation

Hashes are everywhere in Ruby — instance-variable tables, ivar shapes,
constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles
of Rails-shaped workloads spend a meaningful fraction of CPU inside
`st_lookup` and `st_insert`. Two pain points stood out in the upstream
implementation:

1. **Probe loops are branch-heavy.** Every step of the perturb chain
   loads a bin, fetches the entry it points to, compares the full
   `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is
   several dependent loads per probe with no way to fast-reject groups
   of slots in parallel.
2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets
   one cache line per ~2.5 entries. Iteration and equality scans burn
   through L1 quickly, and Ruby programs typically hold a *lot* of
   small-to-mid-sized hashes (so per-table overhead matters).

The Swiss-table family of designs (Abseil `flat_hash_map`, Rust
`hashbrown`) addresses (1) with a 1-byte-per-slot **control array**
that lets a single SIMD/SWAR comparison reject or short-list 8 slots at
once. We borrow that idea but keep Ruby's two-array layout (so we don't
break ABI or insertion order) and add a third, parallel `ctrl[]` byte
array. We then attack (2) by also extracting the hash field out of
`st_table_entry` into its own parallel `uint32_t hashes[]` array.

---

## Design

### 1. Three-array layout

| array       | width                       | role                                                                       |
| ----------- | --------------------------- | -------------------------------------------------------------------------- |
| `entries[]` | 16 B (was 24 B)             | insertion-order log of `(key, record)`                                     |
| `hashes[]`  | 4 B per slot                | parallel array of truncated 32-bit hashes (also encodes the tombstone)     |
| `bins[]`    | adaptive 1 / 2 / 4 / 8 B    | hash-indexed array of indices into `entries[]`                             |
| `ctrl[]`    | 1 B per slot                | `H2` (top 7 bits of hash) or `EMPTY (0xff)` / `DELETED (0xfe)`             |

`entries[]` and `hashes[]` are the same length and addressed by the
same index, so iteration and slot reuse stay trivial. `ctrl[]` is the
fast-reject filter and lives alongside `bins[]`; only when a `ctrl`
byte matches H2 do we load the (now smaller) entry and the parallel
hash to confirm the match. SWAR is done on `uint64_t` reads of
`ctrl[]` so the per-group cost is a handful of bitwise ops with no
SIMD register transfer (which on Apple Silicon was not a win — see
"What we tried but didn't keep" below).

### 2. Compact `st_table_entry`

Before:

```c
struct st_table_entry {
    st_hash_t  hash;    /* 8 B */
    st_data_t  key;     /* 8 B */
    st_data_t  record;  /* 8 B */
};                       /* 24 B */
```

After (when `ST_USE_SWISS_BINS` is on, which is the default):

```c
struct st_table_entry {
    st_data_t  key;     /* 8 B */
    st_data_t  record;  /* 8 B */
};                       /* 16 B */
```

The hash moves into `tab->hashes[i]`, a `uint32_t` (so two slots fit
per 8-byte word, four entries per cache line). All access goes through
`ST_HASH_AT_PTR` / `ST_HASH_AT_IDX` macros so the same source compiles
unchanged when the feature is disabled. The `set_table` variant keeps
its original inline-hash layout — it already uses 16 B entries and is
not part of `st_table`'s ABI footprint.

### 3. The new hash function (the part that needed the most care)

Storing only the low 32 bits of the hash means we can no longer read
H2 from the *top* 7 bits of the original `unsigned long` hash, which is
what a textbook Swiss design does. Naively recomputing the full 64-bit
hash from the key on every rebuild / rehash / `st_shift` /
`st_general_foreach` works for correctness but kills insert and rebuild
performance — equality+hash for strings is not free.

The fix is to derive H2 from a band of the *truncated* 32-bit hash,
disjoint from the band that picks the bin:

```c
/* bin index: low `bin_power` bits, masked by `bins_mask(tab)`     */
hash_bin(uint32_t h, st_table *tab) { return h & bins_mask(tab); }

/* H2: bits 25..31 of the same 32-bit hash, never overlaps with     */
/* the bin index because bin_power is capped well under 25 in       */
/* practice.                                                        */
static inline unsigned char
st_swiss_h2(st_hash_t hash) {
    return (unsigned char)((hash >> 25) & 0x7f);
}
```

This makes the stored `uint32_t` self-sufficient: every probe reads
both the bin index and the H2 byte from the same word, no `do_hash()`
call required, and rebuild/rehash/shift/foreach all use
`ST_HASH_AT_IDX(tab, i)` instead of recomputing.

Two additional details that fall out of the truncation:

1. `normalize_hash_value()` is updated so that `0xFFFFFFFF` (the
   tombstone marker for the 32-bit hash slot) never collides with a
   real hash value — if the truncation lands on the reserved value we
   bump it. The 64-bit reservation is preserved on platforms that
   compile without `ST_USE_SWISS_BINS`.
2. The `MARK_ENTRY_DELETED` / `DELETED_ENTRY_P` macros now take the
   table as a parameter so they can read/write the parallel hash slot.

### 4. Prefetch on H2 match

When SWAR finds a candidate H2 match in a control group, the next
operation is loading the matching `st_table_entry` and the parallel
hash word — both of which are in cold cache lines on a miss. We issue
`__builtin_prefetch` on both immediately after the match is detected
in `find_table_entry_ind` / `find_table_bin_ind` /
`find_table_bin_ptr_and_reserve`. On lookup-heavy workloads this hides
a meaningful chunk of the L2 latency that the SWAR fast-filter would
otherwise expose.

### 5. What we tried but didn't keep

* **SSE2 / NEON intrinsics for the group scan.** On Apple Silicon the
  vector-to-GPR transfer for the match mask cost more than the entire
  SWAR sequence; on x86_64 the win, if any, was within noise. SWAR is
  the only shipped backend.
* **Recomputing the full 64-bit hash** in rebuild/rehash paths to keep
  H2 in the high bits. Correct but cost ~5–10 % on insert workloads;
  superseded by the bit-band trick above.

---

## Results vs `master`

Both binaries built from the same tree (`master = 42b3cdc51a`,
`swiss = 3c0446847f`), same compiler, same flags. Each script run 5×
with `--disable-gems`, best-of-N reported. Memory is `maximum resident
set size` from `/usr/bin/time -l` on macOS arm64 (M-class).

### Throughput (lower wall time = better)

| benchmark         | master (s) | swiss (s) | speedup |
| ----------------- | ---------: | --------: | ------: |
| aref_int_large    |     0.8352 |    0.6862 | **+17.8 %** |
| aref_str_large    |     0.9915 |    0.8406 | **+15.2 %** |
| aref_miss_large   |     1.0803 |    0.7896 | **+26.9 %** |
| aref_mix_50       |     1.0337 |    0.8201 | **+20.7 %** |
| insert_grow       |     0.1138 |    0.1105 |    +2.9 % |
| churn (mixed RW)  |     0.0321 |    0.0304 |    +5.5 % |
| iterate           |     0.0566 |    0.0565 |   ±0.1 %  |

Lookups are the headline win — both successful (`+15 % … +20 %`) and
missing (`+27 %`), the latter because a missing key now short-circuits
on the first SWAR group with no entry/bin loads at all. Inserts and
churn are modestly faster because rebuilds no longer call `do_hash`.
Iteration is unaffected (it never touched bins or ctrl).

### Memory

Process RSS for the benchmark workloads (same runs):

| benchmark         | master (MB) | swiss (MB) | delta |
| ----------------- | ----------: | ---------: | ----: |
| insert_grow       |       66.62 |      60.44 | **−9.3 %** |
| aref_str_large    |       15.27 |      15.23 |   −0.3 % |
| aref_mix_50       |       16.64 |      16.61 |   −0.2 % |
| churn             |       13.19 |      13.03 |   −1.2 % |

Per-table memory (`ObjectSpace.memsize_of`, sum across many hashes):

| workload                                 | master    | swiss     | delta       |
| ---------------------------------------- | --------: | --------: | ----------: |
| 2 000 hashes × 200 entries               | 14.66 MB  | 12.11 MB  | **−17.4 %** |
| 1 hash × 100 000 entries                 |  4.19 MB  |  3.28 MB  | **−21.9 %** |

The two memory views agree: any workload that holds a meaningful
number of entries live (whether one big hash or many small ones) sees
double-digit shrinkage from the 24 B → 16 B entry plus the 1 B
`ctrl[]` / 4 B `hashes[]` pair, because they together (`5 B` per slot)
cost less than the 8 B saved per slot in `entries[]`. Per-table
overhead at fewer than ~64 entries is roughly flat; the Swiss path
itself only kicks in at `entry_power ≥ 6` (table capacity ≥ 64).

---Files--------------------------------
patch.diff (78 KB)

-- 
https://bugs.ruby-lang.org/