Issue #22011 has been reported by dsh0416 (Delton Ding). ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. --- ## Design ### 1. Three-array layout | array | width | role | | ----------- | --------------------------- | -------------------------------------------------------------------------- | | `entries[]` | 16 B (was 24 B) | insertion-order log of `(key, record)` | | `hashes[]` | 4 B per slot | parallel array of truncated 32-bit hashes (also encodes the tombstone) | | `bins[]` | adaptive 1 / 2 / 4 / 8 B | hash-indexed array of indices into `entries[]` | | `ctrl[]` | 1 B per slot | `H2` (top 7 bits of hash) or `EMPTY (0xff)` / `DELETED (0xfe)` | `entries[]` and `hashes[]` are the same length and addressed by the same index, so iteration and slot reuse stay trivial. `ctrl[]` is the fast-reject filter and lives alongside `bins[]`; only when a `ctrl` byte matches H2 do we load the (now smaller) entry and the parallel hash to confirm the match. SWAR is done on `uint64_t` reads of `ctrl[]` so the per-group cost is a handful of bitwise ops with no SIMD register transfer (which on Apple Silicon was not a win — see "What we tried but didn't keep" below). ### 2. Compact `st_table_entry` Before: ```c struct st_table_entry { st_hash_t hash; /* 8 B */ st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 24 B */ ``` After (when `ST_USE_SWISS_BINS` is on, which is the default): ```c struct st_table_entry { st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 16 B */ ``` The hash moves into `tab->hashes[i]`, a `uint32_t` (so two slots fit per 8-byte word, four entries per cache line). All access goes through `ST_HASH_AT_PTR` / `ST_HASH_AT_IDX` macros so the same source compiles unchanged when the feature is disabled. The `set_table` variant keeps its original inline-hash layout — it already uses 16 B entries and is not part of `st_table`'s ABI footprint. ### 3. The new hash function (the part that needed the most care) Storing only the low 32 bits of the hash means we can no longer read H2 from the *top* 7 bits of the original `unsigned long` hash, which is what a textbook Swiss design does. Naively recomputing the full 64-bit hash from the key on every rebuild / rehash / `st_shift` / `st_general_foreach` works for correctness but kills insert and rebuild performance — equality+hash for strings is not free. The fix is to derive H2 from a band of the *truncated* 32-bit hash, disjoint from the band that picks the bin: ```c /* bin index: low `bin_power` bits, masked by `bins_mask(tab)` */ hash_bin(uint32_t h, st_table *tab) { return h & bins_mask(tab); } /* H2: bits 25..31 of the same 32-bit hash, never overlaps with */ /* the bin index because bin_power is capped well under 25 in */ /* practice. */ static inline unsigned char st_swiss_h2(st_hash_t hash) { return (unsigned char)((hash >> 25) & 0x7f); } ``` This makes the stored `uint32_t` self-sufficient: every probe reads both the bin index and the H2 byte from the same word, no `do_hash()` call required, and rebuild/rehash/shift/foreach all use `ST_HASH_AT_IDX(tab, i)` instead of recomputing. Two additional details that fall out of the truncation: 1. `normalize_hash_value()` is updated so that `0xFFFFFFFF` (the tombstone marker for the 32-bit hash slot) never collides with a real hash value — if the truncation lands on the reserved value we bump it. The 64-bit reservation is preserved on platforms that compile without `ST_USE_SWISS_BINS`. 2. The `MARK_ENTRY_DELETED` / `DELETED_ENTRY_P` macros now take the table as a parameter so they can read/write the parallel hash slot. ### 4. Prefetch on H2 match When SWAR finds a candidate H2 match in a control group, the next operation is loading the matching `st_table_entry` and the parallel hash word — both of which are in cold cache lines on a miss. We issue `__builtin_prefetch` on both immediately after the match is detected in `find_table_entry_ind` / `find_table_bin_ind` / `find_table_bin_ptr_and_reserve`. On lookup-heavy workloads this hides a meaningful chunk of the L2 latency that the SWAR fast-filter would otherwise expose. ### 5. What we tried but didn't keep * **SSE2 / NEON intrinsics for the group scan.** On Apple Silicon the vector-to-GPR transfer for the match mask cost more than the entire SWAR sequence; on x86_64 the win, if any, was within noise. SWAR is the only shipped backend. * **Recomputing the full 64-bit hash** in rebuild/rehash paths to keep H2 in the high bits. Correct but cost ~5–10 % on insert workloads; superseded by the bit-band trick above. --- ## Results vs `master` Both binaries built from the same tree (`master = 42b3cdc51a`, `swiss = 3c0446847f`), same compiler, same flags. Each script run 5× with `--disable-gems`, best-of-N reported. Memory is `maximum resident set size` from `/usr/bin/time -l` on macOS arm64 (M-class). ### Throughput (lower wall time = better) | benchmark | master (s) | swiss (s) | speedup | | ----------------- | ---------: | --------: | ------: | | aref_int_large | 0.8352 | 0.6862 | **+17.8 %** | | aref_str_large | 0.9915 | 0.8406 | **+15.2 %** | | aref_miss_large | 1.0803 | 0.7896 | **+26.9 %** | | aref_mix_50 | 1.0337 | 0.8201 | **+20.7 %** | | insert_grow | 0.1138 | 0.1105 | +2.9 % | | churn (mixed RW) | 0.0321 | 0.0304 | +5.5 % | | iterate | 0.0566 | 0.0565 | ±0.1 % | Lookups are the headline win — both successful (`+15 % … +20 %`) and missing (`+27 %`), the latter because a missing key now short-circuits on the first SWAR group with no entry/bin loads at all. Inserts and churn are modestly faster because rebuilds no longer call `do_hash`. Iteration is unaffected (it never touched bins or ctrl). ### Memory Process RSS for the benchmark workloads (same runs): | benchmark | master (MB) | swiss (MB) | delta | | ----------------- | ----------: | ---------: | ----: | | insert_grow | 66.62 | 60.44 | **−9.3 %** | | aref_str_large | 15.27 | 15.23 | −0.3 % | | aref_mix_50 | 16.64 | 16.61 | −0.2 % | | churn | 13.19 | 13.03 | −1.2 % | Per-table memory (`ObjectSpace.memsize_of`, sum across many hashes): | workload | master | swiss | delta | | ---------------------------------------- | --------: | --------: | ----------: | | 2 000 hashes × 200 entries | 14.66 MB | 12.11 MB | **−17.4 %** | | 1 hash × 100 000 entries | 4.19 MB | 3.28 MB | **−21.9 %** | The two memory views agree: any workload that holds a meaningful number of entries live (whether one big hash or many small ones) sees double-digit shrinkage from the 24 B → 16 B entry plus the 1 B `ctrl[]` / 4 B `hashes[]` pair, because they together (`5 B` per slot) cost less than the 8 B saved per slot in `entries[]`. Per-table overhead at fewer than ~64 entries is roughly flat; the Swiss path itself only kicks in at `entry_power ≥ 6` (table capacity ≥ 64). ---Files-------------------------------- patch.diff (78 KB) -- https://bugs.ruby-lang.org/