[ruby-core:125336] [Ruby Feature#22011] Hash tables with swiss table
Issue #22011 has been reported by dsh0416 (Delton Ding). ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. --- ## Design ### 1. Three-array layout | array | width | role | | ----------- | --------------------------- | -------------------------------------------------------------------------- | | `entries[]` | 16 B (was 24 B) | insertion-order log of `(key, record)` | | `hashes[]` | 4 B per slot | parallel array of truncated 32-bit hashes (also encodes the tombstone) | | `bins[]` | adaptive 1 / 2 / 4 / 8 B | hash-indexed array of indices into `entries[]` | | `ctrl[]` | 1 B per slot | `H2` (top 7 bits of hash) or `EMPTY (0xff)` / `DELETED (0xfe)` | `entries[]` and `hashes[]` are the same length and addressed by the same index, so iteration and slot reuse stay trivial. `ctrl[]` is the fast-reject filter and lives alongside `bins[]`; only when a `ctrl` byte matches H2 do we load the (now smaller) entry and the parallel hash to confirm the match. SWAR is done on `uint64_t` reads of `ctrl[]` so the per-group cost is a handful of bitwise ops with no SIMD register transfer (which on Apple Silicon was not a win — see "What we tried but didn't keep" below). ### 2. Compact `st_table_entry` Before: ```c struct st_table_entry { st_hash_t hash; /* 8 B */ st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 24 B */ ``` After (when `ST_USE_SWISS_BINS` is on, which is the default): ```c struct st_table_entry { st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 16 B */ ``` The hash moves into `tab->hashes[i]`, a `uint32_t` (so two slots fit per 8-byte word, four entries per cache line). All access goes through `ST_HASH_AT_PTR` / `ST_HASH_AT_IDX` macros so the same source compiles unchanged when the feature is disabled. The `set_table` variant keeps its original inline-hash layout — it already uses 16 B entries and is not part of `st_table`'s ABI footprint. ### 3. The new hash function (the part that needed the most care) Storing only the low 32 bits of the hash means we can no longer read H2 from the *top* 7 bits of the original `unsigned long` hash, which is what a textbook Swiss design does. Naively recomputing the full 64-bit hash from the key on every rebuild / rehash / `st_shift` / `st_general_foreach` works for correctness but kills insert and rebuild performance — equality+hash for strings is not free. The fix is to derive H2 from a band of the *truncated* 32-bit hash, disjoint from the band that picks the bin: ```c /* bin index: low `bin_power` bits, masked by `bins_mask(tab)` */ hash_bin(uint32_t h, st_table *tab) { return h & bins_mask(tab); } /* H2: bits 25..31 of the same 32-bit hash, never overlaps with */ /* the bin index because bin_power is capped well under 25 in */ /* practice. */ static inline unsigned char st_swiss_h2(st_hash_t hash) { return (unsigned char)((hash >> 25) & 0x7f); } ``` This makes the stored `uint32_t` self-sufficient: every probe reads both the bin index and the H2 byte from the same word, no `do_hash()` call required, and rebuild/rehash/shift/foreach all use `ST_HASH_AT_IDX(tab, i)` instead of recomputing. Two additional details that fall out of the truncation: 1. `normalize_hash_value()` is updated so that `0xFFFFFFFF` (the tombstone marker for the 32-bit hash slot) never collides with a real hash value — if the truncation lands on the reserved value we bump it. The 64-bit reservation is preserved on platforms that compile without `ST_USE_SWISS_BINS`. 2. The `MARK_ENTRY_DELETED` / `DELETED_ENTRY_P` macros now take the table as a parameter so they can read/write the parallel hash slot. ### 4. Prefetch on H2 match When SWAR finds a candidate H2 match in a control group, the next operation is loading the matching `st_table_entry` and the parallel hash word — both of which are in cold cache lines on a miss. We issue `__builtin_prefetch` on both immediately after the match is detected in `find_table_entry_ind` / `find_table_bin_ind` / `find_table_bin_ptr_and_reserve`. On lookup-heavy workloads this hides a meaningful chunk of the L2 latency that the SWAR fast-filter would otherwise expose. ### 5. What we tried but didn't keep * **SSE2 / NEON intrinsics for the group scan.** On Apple Silicon the vector-to-GPR transfer for the match mask cost more than the entire SWAR sequence; on x86_64 the win, if any, was within noise. SWAR is the only shipped backend. * **Recomputing the full 64-bit hash** in rebuild/rehash paths to keep H2 in the high bits. Correct but cost ~5–10 % on insert workloads; superseded by the bit-band trick above. --- ## Results vs `master` Both binaries built from the same tree (`master = 42b3cdc51a`, `swiss = 3c0446847f`), same compiler, same flags. Each script run 5× with `--disable-gems`, best-of-N reported. Memory is `maximum resident set size` from `/usr/bin/time -l` on macOS arm64 (M-class). ### Throughput (lower wall time = better) | benchmark | master (s) | swiss (s) | speedup | | ----------------- | ---------: | --------: | ------: | | aref_int_large | 0.8352 | 0.6862 | **+17.8 %** | | aref_str_large | 0.9915 | 0.8406 | **+15.2 %** | | aref_miss_large | 1.0803 | 0.7896 | **+26.9 %** | | aref_mix_50 | 1.0337 | 0.8201 | **+20.7 %** | | insert_grow | 0.1138 | 0.1105 | +2.9 % | | churn (mixed RW) | 0.0321 | 0.0304 | +5.5 % | | iterate | 0.0566 | 0.0565 | ±0.1 % | Lookups are the headline win — both successful (`+15 % … +20 %`) and missing (`+27 %`), the latter because a missing key now short-circuits on the first SWAR group with no entry/bin loads at all. Inserts and churn are modestly faster because rebuilds no longer call `do_hash`. Iteration is unaffected (it never touched bins or ctrl). ### Memory Process RSS for the benchmark workloads (same runs): | benchmark | master (MB) | swiss (MB) | delta | | ----------------- | ----------: | ---------: | ----: | | insert_grow | 66.62 | 60.44 | **−9.3 %** | | aref_str_large | 15.27 | 15.23 | −0.3 % | | aref_mix_50 | 16.64 | 16.61 | −0.2 % | | churn | 13.19 | 13.03 | −1.2 % | Per-table memory (`ObjectSpace.memsize_of`, sum across many hashes): | workload | master | swiss | delta | | ---------------------------------------- | --------: | --------: | ----------: | | 2 000 hashes × 200 entries | 14.66 MB | 12.11 MB | **−17.4 %** | | 1 hash × 100 000 entries | 4.19 MB | 3.28 MB | **−21.9 %** | The two memory views agree: any workload that holds a meaningful number of entries live (whether one big hash or many small ones) sees double-digit shrinkage from the 24 B → 16 B entry plus the 1 B `ctrl[]` / 4 B `hashes[]` pair, because they together (`5 B` per slot) cost less than the 8 B saved per slot in `entries[]`. Per-table overhead at fewer than ~64 entries is roughly flat; the Swiss path itself only kicks in at `entry_power ≥ 6` (table capacity ≥ 64). ---Files-------------------------------- patch.diff (78 KB) -- https://bugs.ruby-lang.org/
Issue #22011 has been updated by dsh0416 (Delton Ding). File swiss-hash.patch added File deleted (patch.diff) fix some regression bugs ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117091 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. --- ## Design ### 1. Three-array layout | array | width | role | | ----------- | --------------------------- | -------------------------------------------------------------------------- | | `entries[]` | 16 B (was 24 B) | insertion-order log of `(key, record)` | | `hashes[]` | 4 B per slot | parallel array of truncated 32-bit hashes (also encodes the tombstone) | | `bins[]` | adaptive 1 / 2 / 4 / 8 B | hash-indexed array of indices into `entries[]` | | `ctrl[]` | 1 B per slot | `H2` (top 7 bits of hash) or `EMPTY (0xff)` / `DELETED (0xfe)` | `entries[]` and `hashes[]` are the same length and addressed by the same index, so iteration and slot reuse stay trivial. `ctrl[]` is the fast-reject filter and lives alongside `bins[]`; only when a `ctrl` byte matches H2 do we load the (now smaller) entry and the parallel hash to confirm the match. SWAR is done on `uint64_t` reads of `ctrl[]` so the per-group cost is a handful of bitwise ops with no SIMD register transfer (see "What we tried but didn't keep" below). ### 2. Compact `st_table_entry` Before: ```c struct st_table_entry { st_hash_t hash; /* 8 B */ st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 24 B */ ``` After (when `ST_USE_SWISS_BINS` is on, which is the default): ```c struct st_table_entry { st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 16 B */ ``` The hash moves into `tab->hashes[i]`, a `uint32_t` (so two slots fit per 8-byte word, four entries per cache line). All access goes through `ST_HASH_AT_PTR` / `ST_HASH_AT_IDX` macros so the same source compiles unchanged when the feature is disabled. The `set_table` variant keeps its original inline-hash layout — it already uses 16 B entries and is not part of `st_table`'s ABI footprint. ### 3. The new hash function Storing only the low 32 bits of the hash means we can no longer read H2 from the *top* 7 bits of the original `unsigned long` hash, which is what a textbook Swiss design does. Naively recomputing the full 64-bit hash from the key on every rebuild / rehash / `st_shift` / `st_general_foreach` works for correctness but kills insert and rebuild performance — equality+hash for strings is not free. The fix is to derive H2 from a band of the *truncated* 32-bit hash, disjoint from the band that picks the bin: ```c /* bin index: low `bin_power` bits, masked by `bins_mask(tab)` */ hash_bin(uint32_t h, st_table *tab) { return h & bins_mask(tab); } /* H2: bits 25..31 of the same 32-bit hash, never overlaps with */ /* the bin index because bin_power is capped well under 25 in */ /* practice. */ static inline unsigned char st_swiss_h2(st_hash_t hash) { return (unsigned char)((hash >> 25) & 0x7f); } ``` This makes the stored `uint32_t` self-sufficient: every probe reads both the bin index and the H2 byte from the same word, no `do_hash()` call required, and rebuild/rehash/shift/foreach all use `ST_HASH_AT_IDX(tab, i)` instead of recomputing. Two additional details that fall out of the truncation: 1. `normalize_hash_value()` is updated so that `0xFFFFFFFF` (the tombstone marker for the 32-bit hash slot) never collides with a real hash value — if the truncation lands on the reserved value we bump it. The 64-bit reservation is preserved on platforms that compile without `ST_USE_SWISS_BINS`. 2. The `MARK_ENTRY_DELETED` / `DELETED_ENTRY_P` macros now take the table as a parameter so they can read/write the parallel hash slot. ### 4. Prefetch on H2 match When SWAR finds a candidate H2 match in a control group, the next operation is loading the matching `st_table_entry` and the parallel hash word — both of which are in cold cache lines on a miss. We issue `__builtin_prefetch` on both immediately after the match is detected in `find_table_entry_ind` / `find_table_bin_ind` / `find_table_bin_ptr_and_reserve`. On lookup-heavy workloads this hides a meaningful chunk of the L2 latency that the SWAR fast-filter would otherwise expose. ### 5. What we tried but didn't keep * **SSE2 / NEON intrinsics for the group scan.** On Apple Silicon the vector-to-GPR transfer for the match mask cost more than the entire SWAR sequence; on x86_64 the benefit, if any, was within noise. SWAR is the only shipped backend. * **Recomputing the full 64-bit hash** in rebuild/rehash paths to keep H2 in the high bits. Correct but cost ~5–10 % on insert workloads; superseded by the bit-band trick above. --- ## Results vs `master` Both binaries built from the same tree (`master = 42b3cdc51a`, `swiss = 3c0446847f`), same compiler, same flags. Each script run 5× with `--disable-gems`, best-of-N reported. Memory is `maximum resident set size` from `/usr/bin/time -l` on macOS arm64 (M-class). ### Throughput (lower wall time = better) | benchmark | master (s) | swiss (s) | speedup | | ----------------- | ---------: | --------: | ------: | | aref_int_large | 0.8352 | 0.6862 | **+17.8 %** | | aref_str_large | 0.9915 | 0.8406 | **+15.2 %** | | aref_miss_large | 1.0803 | 0.7896 | **+26.9 %** | | aref_mix_50 | 1.0337 | 0.8201 | **+20.7 %** | | insert_grow | 0.1138 | 0.1105 | +2.9 % | | churn (mixed RW) | 0.0321 | 0.0304 | +5.5 % | | iterate | 0.0566 | 0.0565 | ±0.1 % | Lookups are the headline win — both successful (`+15 % … +20 %`) and missing (`+27 %`), the latter because a missing key now short-circuits on the first SWAR group with no entry/bin loads at all. Inserts and churn are modestly faster because rebuilds no longer call `do_hash`. Iteration is unaffected (it never touched bins or ctrl). ### Memory Process RSS for the benchmark workloads (same runs): | benchmark | master (MB) | swiss (MB) | delta | | ----------------- | ----------: | ---------: | ----: | | insert_grow | 66.62 | 60.44 | **−9.3 %** | | aref_str_large | 15.27 | 15.23 | −0.3 % | | aref_mix_50 | 16.64 | 16.61 | −0.2 % | | churn | 13.19 | 13.03 | −1.2 % | Per-table memory (`ObjectSpace.memsize_of`, sum across many hashes): | workload | master | swiss | delta | | ---------------------------------------- | --------: | --------: | ----------: | | 2 000 hashes × 200 entries | 14.66 MB | 12.11 MB | **−17.4 %** | | 1 hash × 100 000 entries | 4.19 MB | 3.28 MB | **−21.9 %** | The two memory views agree: any workload that holds a meaningful number of entries live (whether one big hash or many small ones) sees double-digit shrinkage from the 24 B → 16 B entry plus the 1 B `ctrl[]` / 4 B `hashes[]` pair, because they together (`5 B` per slot) cost less than the 8 B saved per slot in `entries[]`. Per-table overhead at fewer than ~64 entries is roughly flat; the Swiss path itself only kicks in at `entry_power ≥ 6` (table capacity ≥ 64). More benches are coming soon to make sure the patch really works with no regression, especially in real world cases. ---Files-------------------------------- swiss-hash.patch (60.1 KB) -- https://bugs.ruby-lang.org/
Issue #22011 has been updated by dsh0416 (Delton Ding). File output_001.txt added File output_002.txt added add ruby bench results ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117092 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. --- ## Design ### 1. Three-array layout | array | width | role | | ----------- | --------------------------- | -------------------------------------------------------------------------- | | `entries[]` | 16 B (was 24 B) | insertion-order log of `(key, record)` | | `hashes[]` | 4 B per slot | parallel array of truncated 32-bit hashes (also encodes the tombstone) | | `bins[]` | adaptive 1 / 2 / 4 / 8 B | hash-indexed array of indices into `entries[]` | | `ctrl[]` | 1 B per slot | `H2` (top 7 bits of hash) or `EMPTY (0xff)` / `DELETED (0xfe)` | `entries[]` and `hashes[]` are the same length and addressed by the same index, so iteration and slot reuse stay trivial. `ctrl[]` is the fast-reject filter and lives alongside `bins[]`; only when a `ctrl` byte matches H2 do we load the (now smaller) entry and the parallel hash to confirm the match. SWAR is done on `uint64_t` reads of `ctrl[]` so the per-group cost is a handful of bitwise ops with no SIMD register transfer (see "What we tried but didn't keep" below). ### 2. Compact `st_table_entry` Before: ```c struct st_table_entry { st_hash_t hash; /* 8 B */ st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 24 B */ ``` After (when `ST_USE_SWISS_BINS` is on, which is the default): ```c struct st_table_entry { st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 16 B */ ``` The hash moves into `tab->hashes[i]`, a `uint32_t` (so two slots fit per 8-byte word, four entries per cache line). All access goes through `ST_HASH_AT_PTR` / `ST_HASH_AT_IDX` macros so the same source compiles unchanged when the feature is disabled. The `set_table` variant keeps its original inline-hash layout — it already uses 16 B entries and is not part of `st_table`'s ABI footprint. ### 3. The new hash function Storing only the low 32 bits of the hash means we can no longer read H2 from the *top* 7 bits of the original `unsigned long` hash, which is what a textbook Swiss design does. Naively recomputing the full 64-bit hash from the key on every rebuild / rehash / `st_shift` / `st_general_foreach` works for correctness but kills insert and rebuild performance — equality+hash for strings is not free. The fix is to derive H2 from a band of the *truncated* 32-bit hash, disjoint from the band that picks the bin: ```c /* bin index: low `bin_power` bits, masked by `bins_mask(tab)` */ hash_bin(uint32_t h, st_table *tab) { return h & bins_mask(tab); } /* H2: bits 25..31 of the same 32-bit hash, never overlaps with */ /* the bin index because bin_power is capped well under 25 in */ /* practice. */ static inline unsigned char st_swiss_h2(st_hash_t hash) { return (unsigned char)((hash >> 25) & 0x7f); } ``` This makes the stored `uint32_t` self-sufficient: every probe reads both the bin index and the H2 byte from the same word, no `do_hash()` call required, and rebuild/rehash/shift/foreach all use `ST_HASH_AT_IDX(tab, i)` instead of recomputing. Two additional details that fall out of the truncation: 1. `normalize_hash_value()` is updated so that `0xFFFFFFFF` (the tombstone marker for the 32-bit hash slot) never collides with a real hash value — if the truncation lands on the reserved value we bump it. The 64-bit reservation is preserved on platforms that compile without `ST_USE_SWISS_BINS`. 2. The `MARK_ENTRY_DELETED` / `DELETED_ENTRY_P` macros now take the table as a parameter so they can read/write the parallel hash slot. ### 4. Prefetch on H2 match When SWAR finds a candidate H2 match in a control group, the next operation is loading the matching `st_table_entry` and the parallel hash word — both of which are in cold cache lines on a miss. We issue `__builtin_prefetch` on both immediately after the match is detected in `find_table_entry_ind` / `find_table_bin_ind` / `find_table_bin_ptr_and_reserve`. On lookup-heavy workloads this hides a meaningful chunk of the L2 latency that the SWAR fast-filter would otherwise expose. ### 5. What we tried but didn't keep * **SSE2 / NEON intrinsics for the group scan.** On Apple Silicon the vector-to-GPR transfer for the match mask cost more than the entire SWAR sequence; on x86_64 the benefit, if any, was within noise. SWAR is the only shipped backend. * **Recomputing the full 64-bit hash** in rebuild/rehash paths to keep H2 in the high bits. Correct but cost ~5–10 % on insert workloads; superseded by the bit-band trick above. --- ## Results vs `master` Both binaries built from the same tree (`master = 42b3cdc51a`, `swiss = 3c0446847f`), same compiler, same flags. Each script run 5× with `--disable-gems`, best-of-N reported. Memory is `maximum resident set size` from `/usr/bin/time -l` on macOS arm64 (M-class). ### Throughput (lower wall time = better) | benchmark | master (s) | swiss (s) | speedup | | ----------------- | ---------: | --------: | ------: | | aref_int_large | 0.8352 | 0.6862 | **+17.8 %** | | aref_str_large | 0.9915 | 0.8406 | **+15.2 %** | | aref_miss_large | 1.0803 | 0.7896 | **+26.9 %** | | aref_mix_50 | 1.0337 | 0.8201 | **+20.7 %** | | insert_grow | 0.1138 | 0.1105 | +2.9 % | | churn (mixed RW) | 0.0321 | 0.0304 | +5.5 % | | iterate | 0.0566 | 0.0565 | ±0.1 % | Lookups are the headline win — both successful (`+15 % … +20 %`) and missing (`+27 %`), the latter because a missing key now short-circuits on the first SWAR group with no entry/bin loads at all. Inserts and churn are modestly faster because rebuilds no longer call `do_hash`. Iteration is unaffected (it never touched bins or ctrl). ### Memory Process RSS for the benchmark workloads (same runs): | benchmark | master (MB) | swiss (MB) | delta | | ----------------- | ----------: | ---------: | ----: | | insert_grow | 66.62 | 60.44 | **−9.3 %** | | aref_str_large | 15.27 | 15.23 | −0.3 % | | aref_mix_50 | 16.64 | 16.61 | −0.2 % | | churn | 13.19 | 13.03 | −1.2 % | Per-table memory (`ObjectSpace.memsize_of`, sum across many hashes): | workload | master | swiss | delta | | ---------------------------------------- | --------: | --------: | ----------: | | 2 000 hashes × 200 entries | 14.66 MB | 12.11 MB | **−17.4 %** | | 1 hash × 100 000 entries | 4.19 MB | 3.28 MB | **−21.9 %** | The two memory views agree: any workload that holds a meaningful number of entries live (whether one big hash or many small ones) sees double-digit shrinkage from the 24 B → 16 B entry plus the 1 B `ctrl[]` / 4 B `hashes[]` pair, because they together (`5 B` per slot) cost less than the 8 B saved per slot in `entries[]`. Per-table overhead at fewer than ~64 entries is roughly flat; the Swiss path itself only kicks in at `entry_power ≥ 6` (table capacity ≥ 64). More benches are coming soon to make sure the patch really works with no regression, especially in real world cases. ---Files-------------------------------- swiss-hash.patch (60.1 KB) output_001.txt (5.27 KB) output_002.txt (5.29 KB) -- https://bugs.ruby-lang.org/
Issue #22011 has been updated by dsh0416 (Delton Ding). I found some crash cases, that I need to fix it first. Also, I am running the full ruby-bench against the master branch on linux and windows to see if I've missed anything. ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117094 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. --- ## Design ### 1. Three-array layout | array | width | role | | ----------- | --------------------------- | -------------------------------------------------------------------------- | | `entries[]` | 16 B (was 24 B) | insertion-order log of `(key, record)` | | `hashes[]` | 4 B per slot | parallel array of truncated 32-bit hashes (also encodes the tombstone) | | `bins[]` | adaptive 1 / 2 / 4 / 8 B | hash-indexed array of indices into `entries[]` | | `ctrl[]` | 1 B per slot | `H2` (top 7 bits of hash) or `EMPTY (0xff)` / `DELETED (0xfe)` | `entries[]` and `hashes[]` are the same length and addressed by the same index, so iteration and slot reuse stay trivial. `ctrl[]` is the fast-reject filter and lives alongside `bins[]`; only when a `ctrl` byte matches H2 do we load the (now smaller) entry and the parallel hash to confirm the match. SWAR is done on `uint64_t` reads of `ctrl[]` so the per-group cost is a handful of bitwise ops with no SIMD register transfer (see "What we tried but didn't keep" below). ### 2. Compact `st_table_entry` Before: ```c struct st_table_entry { st_hash_t hash; /* 8 B */ st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 24 B */ ``` After (when `ST_USE_SWISS_BINS` is on, which is the default): ```c struct st_table_entry { st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 16 B */ ``` The hash moves into `tab->hashes[i]`, a `uint32_t` (so two slots fit per 8-byte word, four entries per cache line). All access goes through `ST_HASH_AT_PTR` / `ST_HASH_AT_IDX` macros so the same source compiles unchanged when the feature is disabled. The `set_table` variant keeps its original inline-hash layout — it already uses 16 B entries and is not part of `st_table`'s ABI footprint. ### 3. The new hash function Storing only the low 32 bits of the hash means we can no longer read H2 from the *top* 7 bits of the original `unsigned long` hash, which is what a textbook Swiss design does. Naively recomputing the full 64-bit hash from the key on every rebuild / rehash / `st_shift` / `st_general_foreach` works for correctness but kills insert and rebuild performance — equality+hash for strings is not free. The fix is to derive H2 from a band of the *truncated* 32-bit hash, disjoint from the band that picks the bin: ```c /* bin index: low `bin_power` bits, masked by `bins_mask(tab)` */ hash_bin(uint32_t h, st_table *tab) { return h & bins_mask(tab); } /* H2: bits 25..31 of the same 32-bit hash, never overlaps with */ /* the bin index because bin_power is capped well under 25 in */ /* practice. */ static inline unsigned char st_swiss_h2(st_hash_t hash) { return (unsigned char)((hash >> 25) & 0x7f); } ``` This makes the stored `uint32_t` self-sufficient: every probe reads both the bin index and the H2 byte from the same word, no `do_hash()` call required, and rebuild/rehash/shift/foreach all use `ST_HASH_AT_IDX(tab, i)` instead of recomputing. Two additional details that fall out of the truncation: 1. `normalize_hash_value()` is updated so that `0xFFFFFFFF` (the tombstone marker for the 32-bit hash slot) never collides with a real hash value — if the truncation lands on the reserved value we bump it. The 64-bit reservation is preserved on platforms that compile without `ST_USE_SWISS_BINS`. 2. The `MARK_ENTRY_DELETED` / `DELETED_ENTRY_P` macros now take the table as a parameter so they can read/write the parallel hash slot. ### 4. Prefetch on H2 match When SWAR finds a candidate H2 match in a control group, the next operation is loading the matching `st_table_entry` and the parallel hash word — both of which are in cold cache lines on a miss. We issue `__builtin_prefetch` on both immediately after the match is detected in `find_table_entry_ind` / `find_table_bin_ind` / `find_table_bin_ptr_and_reserve`. On lookup-heavy workloads this hides a meaningful chunk of the L2 latency that the SWAR fast-filter would otherwise expose. ### 5. What we tried but didn't keep * **SSE2 / NEON intrinsics for the group scan.** On Apple Silicon the vector-to-GPR transfer for the match mask cost more than the entire SWAR sequence; on x86_64 the benefit, if any, was within noise. SWAR is the only shipped backend. * **Recomputing the full 64-bit hash** in rebuild/rehash paths to keep H2 in the high bits. Correct but cost ~5–10 % on insert workloads; superseded by the bit-band trick above. --- ## Results vs `master` Both binaries built from the same tree (`master = 42b3cdc51a`, `swiss = 3c0446847f`), same compiler, same flags. Each script run 5× with `--disable-gems`, best-of-N reported. Memory is `maximum resident set size` from `/usr/bin/time -l` on macOS arm64 (M-class). ### Throughput (lower wall time = better) | benchmark | master (s) | swiss (s) | speedup | | ----------------- | ---------: | --------: | ------: | | aref_int_large | 0.8352 | 0.6862 | **+17.8 %** | | aref_str_large | 0.9915 | 0.8406 | **+15.2 %** | | aref_miss_large | 1.0803 | 0.7896 | **+26.9 %** | | aref_mix_50 | 1.0337 | 0.8201 | **+20.7 %** | | insert_grow | 0.1138 | 0.1105 | +2.9 % | | churn (mixed RW) | 0.0321 | 0.0304 | +5.5 % | | iterate | 0.0566 | 0.0565 | ±0.1 % | Lookups are the headline win — both successful (`+15 % … +20 %`) and missing (`+27 %`), the latter because a missing key now short-circuits on the first SWAR group with no entry/bin loads at all. Inserts and churn are modestly faster because rebuilds no longer call `do_hash`. Iteration is unaffected (it never touched bins or ctrl). ### Memory Process RSS for the benchmark workloads (same runs): | benchmark | master (MB) | swiss (MB) | delta | | ----------------- | ----------: | ---------: | ----: | | insert_grow | 66.62 | 60.44 | **−9.3 %** | | aref_str_large | 15.27 | 15.23 | −0.3 % | | aref_mix_50 | 16.64 | 16.61 | −0.2 % | | churn | 13.19 | 13.03 | −1.2 % | Per-table memory (`ObjectSpace.memsize_of`, sum across many hashes): | workload | master | swiss | delta | | ---------------------------------------- | --------: | --------: | ----------: | | 2 000 hashes × 200 entries | 14.66 MB | 12.11 MB | **−17.4 %** | | 1 hash × 100 000 entries | 4.19 MB | 3.28 MB | **−21.9 %** | The two memory views agree: any workload that holds a meaningful number of entries live (whether one big hash or many small ones) sees double-digit shrinkage from the 24 B → 16 B entry plus the 1 B `ctrl[]` / 4 B `hashes[]` pair, because they together (`5 B` per slot) cost less than the 8 B saved per slot in `entries[]`. Per-table overhead at fewer than ~64 entries is roughly flat; the Swiss path itself only kicks in at `entry_power ≥ 6` (table capacity ≥ 64). More benches are coming soon to make sure the patch really works with no regression, especially in real world cases. ---Files-------------------------------- output_001.txt (5.27 KB) output_002.txt (5.29 KB) swiss-hash.patch (60.1 KB) -- https://bugs.ruby-lang.org/
Issue #22011 has been updated by byroot (Jean Boussier). Just a side note, enlarging `struct st_table` by 16B as some consequence for various objects sizes (e.g. Hash goes from pool 80 to pool 96, but a bunch of other structs embed `struct st_table`). I had a vague plan to collocate the `bins` and `entries` buffers like I did with `set_table` (https://github.com/ruby/ruby/commit/85c52079aa35a1d2e063a5b40eebe91701c8cb9e). If we end up with two more buffers, it becomes more important. ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117097 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. --- ## Design ### 1. Three-array layout | array | width | role | | ----------- | --------------------------- | -------------------------------------------------------------------------- | | `entries[]` | 16 B (was 24 B) | insertion-order log of `(key, record)` | | `hashes[]` | 4 B per slot | parallel array of truncated 32-bit hashes (also encodes the tombstone) | | `bins[]` | adaptive 1 / 2 / 4 / 8 B | hash-indexed array of indices into `entries[]` | | `ctrl[]` | 1 B per slot | `H2` (top 7 bits of hash) or `EMPTY (0xff)` / `DELETED (0xfe)` | `entries[]` and `hashes[]` are the same length and addressed by the same index, so iteration and slot reuse stay trivial. `ctrl[]` is the fast-reject filter and lives alongside `bins[]`; only when a `ctrl` byte matches H2 do we load the (now smaller) entry and the parallel hash to confirm the match. SWAR is done on `uint64_t` reads of `ctrl[]` so the per-group cost is a handful of bitwise ops with no SIMD register transfer (see "What we tried but didn't keep" below). ### 2. Compact `st_table_entry` Before: ```c struct st_table_entry { st_hash_t hash; /* 8 B */ st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 24 B */ ``` After (when `ST_USE_SWISS_BINS` is on, which is the default): ```c struct st_table_entry { st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 16 B */ ``` The hash moves into `tab->hashes[i]`, a `uint32_t` (so two slots fit per 8-byte word, four entries per cache line). All access goes through `ST_HASH_AT_PTR` / `ST_HASH_AT_IDX` macros so the same source compiles unchanged when the feature is disabled. The `set_table` variant keeps its original inline-hash layout — it already uses 16 B entries and is not part of `st_table`'s ABI footprint. ### 3. The new hash function Storing only the low 32 bits of the hash means we can no longer read H2 from the *top* 7 bits of the original `unsigned long` hash, which is what a textbook Swiss design does. Naively recomputing the full 64-bit hash from the key on every rebuild / rehash / `st_shift` / `st_general_foreach` works for correctness but kills insert and rebuild performance — equality+hash for strings is not free. The fix is to derive H2 from a band of the *truncated* 32-bit hash, disjoint from the band that picks the bin: ```c /* bin index: low `bin_power` bits, masked by `bins_mask(tab)` */ hash_bin(uint32_t h, st_table *tab) { return h & bins_mask(tab); } /* H2: bits 25..31 of the same 32-bit hash, never overlaps with */ /* the bin index because bin_power is capped well under 25 in */ /* practice. */ static inline unsigned char st_swiss_h2(st_hash_t hash) { return (unsigned char)((hash >> 25) & 0x7f); } ``` This makes the stored `uint32_t` self-sufficient: every probe reads both the bin index and the H2 byte from the same word, no `do_hash()` call required, and rebuild/rehash/shift/foreach all use `ST_HASH_AT_IDX(tab, i)` instead of recomputing. Two additional details that fall out of the truncation: 1. `normalize_hash_value()` is updated so that `0xFFFFFFFF` (the tombstone marker for the 32-bit hash slot) never collides with a real hash value — if the truncation lands on the reserved value we bump it. The 64-bit reservation is preserved on platforms that compile without `ST_USE_SWISS_BINS`. 2. The `MARK_ENTRY_DELETED` / `DELETED_ENTRY_P` macros now take the table as a parameter so they can read/write the parallel hash slot. ### 4. Prefetch on H2 match When SWAR finds a candidate H2 match in a control group, the next operation is loading the matching `st_table_entry` and the parallel hash word — both of which are in cold cache lines on a miss. We issue `__builtin_prefetch` on both immediately after the match is detected in `find_table_entry_ind` / `find_table_bin_ind` / `find_table_bin_ptr_and_reserve`. On lookup-heavy workloads this hides a meaningful chunk of the L2 latency that the SWAR fast-filter would otherwise expose. ### 5. What we tried but didn't keep * **SSE2 / NEON intrinsics for the group scan.** On Apple Silicon the vector-to-GPR transfer for the match mask cost more than the entire SWAR sequence; on x86_64 the benefit, if any, was within noise. SWAR is the only shipped backend. * **Recomputing the full 64-bit hash** in rebuild/rehash paths to keep H2 in the high bits. Correct but cost ~5–10 % on insert workloads; superseded by the bit-band trick above. --- ## Results vs `master` Both binaries built from the same tree (`master = 42b3cdc51a`, `swiss = 3c0446847f`), same compiler, same flags. Each script run 5× with `--disable-gems`, best-of-N reported. Memory is `maximum resident set size` from `/usr/bin/time -l` on macOS arm64 (M-class). ### Throughput (lower wall time = better) | benchmark | master (s) | swiss (s) | speedup | | ----------------- | ---------: | --------: | ------: | | aref_int_large | 0.8352 | 0.6862 | **+17.8 %** | | aref_str_large | 0.9915 | 0.8406 | **+15.2 %** | | aref_miss_large | 1.0803 | 0.7896 | **+26.9 %** | | aref_mix_50 | 1.0337 | 0.8201 | **+20.7 %** | | insert_grow | 0.1138 | 0.1105 | +2.9 % | | churn (mixed RW) | 0.0321 | 0.0304 | +5.5 % | | iterate | 0.0566 | 0.0565 | ±0.1 % | Lookups are the headline win — both successful (`+15 % … +20 %`) and missing (`+27 %`), the latter because a missing key now short-circuits on the first SWAR group with no entry/bin loads at all. Inserts and churn are modestly faster because rebuilds no longer call `do_hash`. Iteration is unaffected (it never touched bins or ctrl). ### Memory Process RSS for the benchmark workloads (same runs): | benchmark | master (MB) | swiss (MB) | delta | | ----------------- | ----------: | ---------: | ----: | | insert_grow | 66.62 | 60.44 | **−9.3 %** | | aref_str_large | 15.27 | 15.23 | −0.3 % | | aref_mix_50 | 16.64 | 16.61 | −0.2 % | | churn | 13.19 | 13.03 | −1.2 % | Per-table memory (`ObjectSpace.memsize_of`, sum across many hashes): | workload | master | swiss | delta | | ---------------------------------------- | --------: | --------: | ----------: | | 2 000 hashes × 200 entries | 14.66 MB | 12.11 MB | **−17.4 %** | | 1 hash × 100 000 entries | 4.19 MB | 3.28 MB | **−21.9 %** | The two memory views agree: any workload that holds a meaningful number of entries live (whether one big hash or many small ones) sees double-digit shrinkage from the 24 B → 16 B entry plus the 1 B `ctrl[]` / 4 B `hashes[]` pair, because they together (`5 B` per slot) cost less than the 8 B saved per slot in `entries[]`. Per-table overhead at fewer than ~64 entries is roughly flat; the Swiss path itself only kicks in at `entry_power ≥ 6` (table capacity ≥ 64). More benches are coming soon to make sure the patch really works with no regression, especially in real world cases. ---Files-------------------------------- output_001.txt (5.27 KB) output_002.txt (5.29 KB) swiss-hash.patch (60.1 KB) -- https://bugs.ruby-lang.org/
Issue #22011 has been updated by dsh0416 (Delton Ding). I see that the benchmarks might be affected, since st_numhash/symbol hash path looks like constant on bits 25..31, which increases the hash collision, under investigating. ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117102 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. --- ## Design ### 1. Three-array layout | array | width | role | | ----------- | --------------------------- | -------------------------------------------------------------------------- | | `entries[]` | 16 B (was 24 B) | insertion-order log of `(key, record)` | | `hashes[]` | 4 B per slot | parallel array of truncated 32-bit hashes (also encodes the tombstone) | | `bins[]` | adaptive 1 / 2 / 4 / 8 B | hash-indexed array of indices into `entries[]` | | `ctrl[]` | 1 B per slot | `H2` (top 7 bits of hash) or `EMPTY (0xff)` / `DELETED (0xfe)` | `entries[]` and `hashes[]` are the same length and addressed by the same index, so iteration and slot reuse stay trivial. `ctrl[]` is the fast-reject filter and lives alongside `bins[]`; only when a `ctrl` byte matches H2 do we load the (now smaller) entry and the parallel hash to confirm the match. SWAR is done on `uint64_t` reads of `ctrl[]` so the per-group cost is a handful of bitwise ops with no SIMD register transfer (see "What we tried but didn't keep" below). ### 2. Compact `st_table_entry` Before: ```c struct st_table_entry { st_hash_t hash; /* 8 B */ st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 24 B */ ``` After (when `ST_USE_SWISS_BINS` is on, which is the default): ```c struct st_table_entry { st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 16 B */ ``` The hash moves into `tab->hashes[i]`, a `uint32_t` (so two slots fit per 8-byte word, four entries per cache line). All access goes through `ST_HASH_AT_PTR` / `ST_HASH_AT_IDX` macros so the same source compiles unchanged when the feature is disabled. The `set_table` variant keeps its original inline-hash layout — it already uses 16 B entries and is not part of `st_table`'s ABI footprint. ### 3. The new hash function Storing only the low 32 bits of the hash means we can no longer read H2 from the *top* 7 bits of the original `unsigned long` hash, which is what a textbook Swiss design does. Naively recomputing the full 64-bit hash from the key on every rebuild / rehash / `st_shift` / `st_general_foreach` works for correctness but kills insert and rebuild performance — equality+hash for strings is not free. The fix is to derive H2 from a band of the *truncated* 32-bit hash, disjoint from the band that picks the bin: ```c /* bin index: low `bin_power` bits, masked by `bins_mask(tab)` */ hash_bin(uint32_t h, st_table *tab) { return h & bins_mask(tab); } /* H2: bits 25..31 of the same 32-bit hash, never overlaps with */ /* the bin index because bin_power is capped well under 25 in */ /* practice. */ static inline unsigned char st_swiss_h2(st_hash_t hash) { return (unsigned char)((hash >> 25) & 0x7f); } ``` This makes the stored `uint32_t` self-sufficient: every probe reads both the bin index and the H2 byte from the same word, no `do_hash()` call required, and rebuild/rehash/shift/foreach all use `ST_HASH_AT_IDX(tab, i)` instead of recomputing. Two additional details that fall out of the truncation: 1. `normalize_hash_value()` is updated so that `0xFFFFFFFF` (the tombstone marker for the 32-bit hash slot) never collides with a real hash value — if the truncation lands on the reserved value we bump it. The 64-bit reservation is preserved on platforms that compile without `ST_USE_SWISS_BINS`. 2. The `MARK_ENTRY_DELETED` / `DELETED_ENTRY_P` macros now take the table as a parameter so they can read/write the parallel hash slot. ### 4. Prefetch on H2 match When SWAR finds a candidate H2 match in a control group, the next operation is loading the matching `st_table_entry` and the parallel hash word — both of which are in cold cache lines on a miss. We issue `__builtin_prefetch` on both immediately after the match is detected in `find_table_entry_ind` / `find_table_bin_ind` / `find_table_bin_ptr_and_reserve`. On lookup-heavy workloads this hides a meaningful chunk of the L2 latency that the SWAR fast-filter would otherwise expose. ### 5. What we tried but didn't keep * **SSE2 / NEON intrinsics for the group scan.** On Apple Silicon the vector-to-GPR transfer for the match mask cost more than the entire SWAR sequence; on x86_64 the benefit, if any, was within noise. SWAR is the only shipped backend. * **Recomputing the full 64-bit hash** in rebuild/rehash paths to keep H2 in the high bits. Correct but cost ~5–10 % on insert workloads; superseded by the bit-band trick above. --- ## Results vs `master` Both binaries built from the same tree (`master = 42b3cdc51a`, `swiss = 3c0446847f`), same compiler, same flags. Each script run 5× with `--disable-gems`, best-of-N reported. Memory is `maximum resident set size` from `/usr/bin/time -l` on macOS arm64 (M-class). ### Throughput (lower wall time = better) | benchmark | master (s) | swiss (s) | speedup | | ----------------- | ---------: | --------: | ------: | | aref_int_large | 0.8352 | 0.6862 | **+17.8 %** | | aref_str_large | 0.9915 | 0.8406 | **+15.2 %** | | aref_miss_large | 1.0803 | 0.7896 | **+26.9 %** | | aref_mix_50 | 1.0337 | 0.8201 | **+20.7 %** | | insert_grow | 0.1138 | 0.1105 | +2.9 % | | churn (mixed RW) | 0.0321 | 0.0304 | +5.5 % | | iterate | 0.0566 | 0.0565 | ±0.1 % | Lookups are the headline win — both successful (`+15 % … +20 %`) and missing (`+27 %`), the latter because a missing key now short-circuits on the first SWAR group with no entry/bin loads at all. Inserts and churn are modestly faster because rebuilds no longer call `do_hash`. Iteration is unaffected (it never touched bins or ctrl). ### Memory Process RSS for the benchmark workloads (same runs): | benchmark | master (MB) | swiss (MB) | delta | | ----------------- | ----------: | ---------: | ----: | | insert_grow | 66.62 | 60.44 | **−9.3 %** | | aref_str_large | 15.27 | 15.23 | −0.3 % | | aref_mix_50 | 16.64 | 16.61 | −0.2 % | | churn | 13.19 | 13.03 | −1.2 % | Per-table memory (`ObjectSpace.memsize_of`, sum across many hashes): | workload | master | swiss | delta | | ---------------------------------------- | --------: | --------: | ----------: | | 2 000 hashes × 200 entries | 14.66 MB | 12.11 MB | **−17.4 %** | | 1 hash × 100 000 entries | 4.19 MB | 3.28 MB | **−21.9 %** | The two memory views agree: any workload that holds a meaningful number of entries live (whether one big hash or many small ones) sees double-digit shrinkage from the 24 B → 16 B entry plus the 1 B `ctrl[]` / 4 B `hashes[]` pair, because they together (`5 B` per slot) cost less than the 8 B saved per slot in `entries[]`. Per-table overhead at fewer than ~64 entries is roughly flat; the Swiss path itself only kicks in at `entry_power ≥ 6` (table capacity ≥ 64). More benches are coming soon to make sure the patch really works with no regression, especially in real world cases. ---Files-------------------------------- output_001.txt (5.27 KB) output_002.txt (5.29 KB) changes.patch (34.5 KB) -- https://bugs.ruby-lang.org/
Issue #22011 has been updated by dsh0416 (Delton Ding). File memory_top_rss_changes.png added File time_by_test_violin_interp.png added File time_by_test_violin_yjit.png added Description updated Here are the ruby benches results. | Area | interp | yjit | | --- | ---: | ---: | | Time geomean | `+1.77%` slower | `-0.17%` faster | | Time median benchmark delta | `+1.55%` slower | `+0.02%` flat | | Benchmarks >1% slower | 42 | 15 | | Benchmarks >1% faster | 9 | 15 | | RSS geomean | `-3.66%` lower | `-3.01%` lower | | RSS median benchmark delta | `+0.09%` flat | `+0.09%` flat | Overall: `swss` is a small interpreter-mode slowdown, essentially neutral or faster under YJIT, and generally lower RSS. The RSS median is basically unchanged, so the memory improvement comes from a subset of large reductions rather than broad small wins. Largest time slowdowns: - `yjit/loops-times`: `+12.52%` - `interp/structaset`: `+9.38%` - `interp/fib`: `+8.04%` - `interp/keyword_args`: `+6.76%` - `interp/send_rubyfunc_block`: `+6.15%` Largest time speedups: - `yjit/activerecord`: `-13.38%` - `yjit/fluentd`: `-11.73%` - `yjit/send_cfunc_block`: `-8.96%` - `yjit/nqueens`: `-6.82%` - `yjit/splay`: `-4.47%` Memory is mixed. Very few benches show RSS increases, including `interp/graphql-native` `+20.15%`, `yjit/fluentd` `+19.32%`, and `yjit/graphql-native` `+17.42%`. Biggest RSS reductions are `yjit/str_concat` `-42.45%`, `interp/psych-load` `-39.31%`, and `yjit/psych-load` `-38.07%`, and RSS reductions are more often. ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117104 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. --- ## Design ### 1. Three-array layout | array | width | role | | ----------- | --------------------------- | -------------------------------------------------------------------------- | | `entries[]` | 16 B (was 24 B) | insertion-order log of `(key, record)` | | `hashes[]` | 4 B per slot | parallel array of truncated 32-bit hashes (also encodes the tombstone) | | `bins[]` | adaptive 1 / 2 / 4 / 8 B | hash-indexed array of indices into `entries[]` | | `ctrl[]` | 1 B per slot | `H2` (top 7 bits of hash) or `EMPTY (0xff)` / `DELETED (0xfe)` | `entries[]` and `hashes[]` are the same length and addressed by the same index, so iteration and slot reuse stay trivial. `ctrl[]` is the fast-reject filter and lives alongside `bins[]`; only when a `ctrl` byte matches H2 do we load the (now smaller) entry and the parallel hash to confirm the match. SWAR is done on `uint64_t` reads of `ctrl[]` so the per-group cost is a handful of bitwise ops with no SIMD register transfer (see "What we tried but didn't keep" below). ### 2. Compact `st_table_entry` Before: ```c struct st_table_entry { st_hash_t hash; /* 8 B */ st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 24 B */ ``` After (when `ST_USE_SWISS_BINS` is on, which is the default): ```c struct st_table_entry { st_data_t key; /* 8 B */ st_data_t record; /* 8 B */ }; /* 16 B */ ``` The hash moves into `tab->hashes[i]`, a `uint32_t` (so two slots fit per 8-byte word, four entries per cache line). All access goes through `ST_HASH_AT_PTR` / `ST_HASH_AT_IDX` macros so the same source compiles unchanged when the feature is disabled. The `set_table` variant keeps its original inline-hash layout — it already uses 16 B entries and is not part of `st_table`'s ABI footprint. ### 3. The new hash function Storing only the low 32 bits of the hash means we can no longer read H2 from the *top* 7 bits of the original `unsigned long` hash, which is what a textbook Swiss design does. Naively recomputing the full 64-bit hash from the key on every rebuild / rehash / `st_shift` / `st_general_foreach` works for correctness but kills insert and rebuild performance — equality+hash for strings is not free. The fix is to derive H2 from a band of the *truncated* 32-bit hash, disjoint from the band that picks the bin: ```c /* bin index: low `bin_power` bits, masked by `bins_mask(tab)` */ hash_bin(uint32_t h, st_table *tab) { return h & bins_mask(tab); } /* H2: bits 25..31 of the same 32-bit hash, never overlaps with */ /* the bin index because bin_power is capped well under 25 in */ /* practice. */ static inline unsigned char st_swiss_h2(st_hash_t hash) { return (unsigned char)((hash >> 25) & 0x7f); } ``` This makes the stored `uint32_t` self-sufficient: every probe reads both the bin index and the H2 byte from the same word, no `do_hash()` call required, and rebuild/rehash/shift/foreach all use `ST_HASH_AT_IDX(tab, i)` instead of recomputing. Two additional details that fall out of the truncation: 1. `normalize_hash_value()` is updated so that `0xFFFFFFFF` (the tombstone marker for the 32-bit hash slot) never collides with a real hash value — if the truncation lands on the reserved value we bump it. The 64-bit reservation is preserved on platforms that compile without `ST_USE_SWISS_BINS`. 2. The `MARK_ENTRY_DELETED` / `DELETED_ENTRY_P` macros now take the table as a parameter so they can read/write the parallel hash slot. ### 4. Prefetch on H2 match When SWAR finds a candidate H2 match in a control group, the next operation is loading the matching `st_table_entry` and the parallel hash word — both of which are in cold cache lines on a miss. We issue `__builtin_prefetch` on both immediately after the match is detected in `find_table_entry_ind` / `find_table_bin_ind` / `find_table_bin_ptr_and_reserve`. On lookup-heavy workloads this hides a meaningful chunk of the L2 latency that the SWAR fast-filter would otherwise expose. ### 5. What we tried but didn't keep * **SSE2 / NEON intrinsics for the group scan.** On Apple Silicon the vector-to-GPR transfer for the match mask cost more than the entire SWAR sequence; on x86_64 the benefit, if any, was within noise. SWAR is the only shipped backend. * **Recomputing the full 64-bit hash** in rebuild/rehash paths to keep H2 in the high bits. Correct but cost ~5–10 % on insert workloads; superseded by the bit-band trick above. ## Results vs `master` (microbenches) Both binaries built from the same tree (`master = 42b3cdc51a`, `swiss = 3c0446847f`), same compiler, same flags. Each script run 5× with `--disable-gems`, best-of-N reported. Memory is `maximum resident set size` from `/usr/bin/time -l` on macOS arm64 (M-class). ### Throughput (lower wall time = better) | benchmark | master (s) | swiss (s) | speedup | | ----------------- | ---------: | --------: | ------: | | aref_int_large | 0.8352 | 0.6862 | **+17.8 %** | | aref_str_large | 0.9915 | 0.8406 | **+15.2 %** | | aref_miss_large | 1.0803 | 0.7896 | **+26.9 %** | | aref_mix_50 | 1.0337 | 0.8201 | **+20.7 %** | | insert_grow | 0.1138 | 0.1105 | +2.9 % | | churn (mixed RW) | 0.0321 | 0.0304 | +5.5 % | | iterate | 0.0566 | 0.0565 | ±0.1 % | Lookups are the headline win — both successful (`+15 % … +20 %`) and missing (`+27 %`), the latter because a missing key now short-circuits on the first SWAR group with no entry/bin loads at all. Inserts and churn are modestly faster because rebuilds no longer call `do_hash`. Iteration is unaffected (it never touched bins or ctrl). ### Memory Process RSS for the benchmark workloads (same runs): | benchmark | master (MB) | swiss (MB) | delta | | ----------------- | ----------: | ---------: | ----: | | insert_grow | 66.62 | 60.44 | **−9.3 %** | | aref_str_large | 15.27 | 15.23 | −0.3 % | | aref_mix_50 | 16.64 | 16.61 | −0.2 % | | churn | 13.19 | 13.03 | −1.2 % | Per-table memory (`ObjectSpace.memsize_of`, sum across many hashes): | workload | master | swiss | delta | | ---------------------------------------- | --------: | --------: | ----------: | | 2 000 hashes × 200 entries | 14.66 MB | 12.11 MB | **−17.4 %** | | 1 hash × 100 000 entries | 4.19 MB | 3.28 MB | **−21.9 %** | The two memory views agree: any workload that holds a meaningful number of entries live (whether one big hash or many small ones) sees double-digit shrinkage from the 24 B → 16 B entry plus the 1 B `ctrl[]` / 4 B `hashes[]` pair, because they together (`5 B` per slot) cost less than the 8 B saved per slot in `entries[]`. Per-table overhead at fewer than ~64 entries is roughly flat; the Swiss path itself only kicks in at `entry_power ≥ 6` (table capacity ≥ 64). More benches are coming soon to make sure the patch really works with no regression, especially in real world cases. ---Files-------------------------------- output_001.txt (5.27 KB) output_002.txt (5.29 KB) changes.patch (34.5 KB) memory_top_rss_changes.png (111 KB) time_by_test_violin_interp.png (613 KB) time_by_test_violin_yjit.png (662 KB) -- https://bugs.ruby-lang.org/
Issue #22011 has been updated by dsh0416 (Delton Ding). From the bench results, we could see some very positive signals in some benches, but it also introduces some regressions, I would try if I could further narrowing down the regressions. ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117108 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. ## Changes 1. Adds `ST_USE_SWISS_BINS`, enabled under `RUBY_EXPORT`, disabled otherwise so `parser_st.c` keeps the old-compatible layout. 2. Removes the stored hash field from st_table_entry in the Swiss path and stores hashes in a side array after entries, using 32-bit stored hashes. 3. Introduces packed Swiss bin groups: 8 control bytes plus 8 bin indexes per group. 4. Adds Swiss probing via control-byte matching, using 7-bit h2 fingerprints for candidate filtering. 5. Packs bins after entries when possible, reducing separate allocation pressure. 6. Updates allocation, copy, free, memsize, rebuild, rehash, insert, delete, shift, lookup, foreach, keys, and values paths to go through hash/bin abstraction macros. 7. Adjusts rebuild thresholds for Swiss load factor, using roughly 7/8 bin occupancy. 8. Adds tombstone-triggered rebuild behavior when many deleted slots accumulate. ## Analysis Earlier preliminarily analysis shows HUGE improvements on microbenchmarks after adopting ideas from swiss tables, but it is not quite true in the real-world benches. Swiss-table-style probing can look huge in isolated lookup/insert microbenchmarks because the hot loop is tight, the table stays cache-hot, and control-byte filtering avoids many full entry/key comparisons. In real Ruby workloads, the hash table operations are mixed with object allocation, method dispatch, GC barriers, string/hash callbacks, comparisons, branchy VM work, etc. That weakens the “everything is in L1” assumption. If the old table is kept at < 0.5 load factor, collision chains/probe lengths are already short, so a simpler open-addressing scheme can be very competitive because it has smaller code and fewer moving parts. The larger practical win of the Swiss-ish design is probably memory density: sustaining a much higher load factor, around 7/8, without making probe behavior collapse. With this new method, we could increase the load factor to about 7/8 with competitive speed. This saves us more memory then the introduced `ctrl[]`. ---Files-------------------------------- output_001.txt (5.27 KB) output_002.txt (5.29 KB) memory_top_rss_changes.png (111 KB) time_by_test_violin_interp.png (613 KB) time_by_test_violin_yjit.png (662 KB) changes.patch (43.2 KB) -- https://bugs.ruby-lang.org/
Issue #22011 has been updated by dsh0416 (Delton Ding). File memory_top_rss_changes.png added File time_by_test_violin_interp.png added File time_by_test_violin_yjit.png added The regression got better controlled with the new patch. ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117111 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. ## Changes 1. Adds `ST_USE_SWISS_BINS`, enabled under `RUBY_EXPORT`, disabled otherwise so `parser_st.c` keeps the old-compatible layout. 2. Removes the stored hash field from st_table_entry in the Swiss path and stores hashes in a side array after entries, using 32-bit stored hashes. 3. Introduces packed Swiss bin groups: 8 control bytes plus 8 bin indexes per group. 4. Adds Swiss probing via control-byte matching, using 7-bit h2 fingerprints for candidate filtering. 5. Packs bins after entries when possible, reducing separate allocation pressure. 6. Updates allocation, copy, free, memsize, rebuild, rehash, insert, delete, shift, lookup, foreach, keys, and values paths to go through hash/bin abstraction macros. 7. Adjusts rebuild thresholds for Swiss load factor, using roughly 7/8 bin occupancy. 8. Adds tombstone-triggered rebuild behavior when many deleted slots accumulate. ## Analysis Earlier preliminarily analysis shows HUGE improvements on microbenchmarks after adopting ideas from swiss tables, but it is not quite true in the real-world benches. Swiss-table-style probing can look huge in isolated lookup/insert microbenchmarks because the hot loop is tight, the table stays cache-hot, and control-byte filtering avoids many full entry/key comparisons. In real Ruby workloads, the hash table operations are mixed with object allocation, method dispatch, GC barriers, string/hash callbacks, comparisons, branchy VM work, etc. That weakens the “everything is in L1” assumption. If the old table is kept at < 0.5 load factor, collision chains/probe lengths are already short, so a simpler open-addressing scheme can be very competitive because it has smaller code and fewer moving parts. The larger practical win of the Swiss-ish design is probably memory density: sustaining a much higher load factor, around 7/8, without making probe behavior collapse. With this new method, we could increase the load factor to about 7/8 with competitive speed. This saves us more memory then the introduced `ctrl[]`. ---Files-------------------------------- output_001.txt (5.27 KB) output_002.txt (5.29 KB) memory_top_rss_changes.png (111 KB) time_by_test_violin_interp.png (613 KB) time_by_test_violin_yjit.png (662 KB) changes.patch (43.2 KB) memory_top_rss_changes.png (112 KB) time_by_test_violin_interp.png (604 KB) time_by_test_violin_yjit.png (652 KB) -- https://bugs.ruby-lang.org/
Issue #22011 has been updated by dsh0416 (Delton Ding). File memory_top_rss_changes.png added File time_by_test_violin_interp.png added File time_by_test_violin_yjit.png added File changes.patch added The regression got better controlled with the new patch. ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117112 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. ## Changes 1. Adds `ST_USE_SWISS_BINS`, enabled under `RUBY_EXPORT`, disabled otherwise so `parser_st.c` keeps the old-compatible layout. 2. Removes the stored hash field from st_table_entry in the Swiss path and stores hashes in a side array after entries, using 32-bit stored hashes. 3. Introduces packed Swiss bin groups: 8 control bytes plus 8 bin indexes per group. 4. Adds Swiss probing via control-byte matching, using 7-bit h2 fingerprints for candidate filtering. 5. Packs bins after entries when possible, reducing separate allocation pressure. 6. Updates allocation, copy, free, memsize, rebuild, rehash, insert, delete, shift, lookup, foreach, keys, and values paths to go through hash/bin abstraction macros. 7. Adjusts rebuild thresholds for Swiss load factor, using roughly 7/8 bin occupancy. 8. Adds tombstone-triggered rebuild behavior when many deleted slots accumulate. ## Analysis Earlier preliminarily analysis shows HUGE improvements on microbenchmarks after adopting ideas from swiss tables, but it is not quite true in the real-world benches. Swiss-table-style probing can look huge in isolated lookup/insert microbenchmarks because the hot loop is tight, the table stays cache-hot, and control-byte filtering avoids many full entry/key comparisons. In real Ruby workloads, the hash table operations are mixed with object allocation, method dispatch, GC barriers, string/hash callbacks, comparisons, branchy VM work, etc. That weakens the “everything is in L1” assumption. If the old table is kept at < 0.5 load factor, collision chains/probe lengths are already short, so a simpler open-addressing scheme can be very competitive because it has smaller code and fewer moving parts. The larger practical win of the Swiss-ish design is probably memory density: sustaining a much higher load factor, around 7/8, without making probe behavior collapse. With this new method, we could increase the load factor to about 7/8 with competitive speed. This saves us more memory then the introduced `ctrl[]`. ---Files-------------------------------- output_001.txt (5.27 KB) output_002.txt (5.29 KB) memory_top_rss_changes.png (111 KB) time_by_test_violin_interp.png (613 KB) time_by_test_violin_yjit.png (662 KB) changes.patch (43.2 KB) memory_top_rss_changes.png (112 KB) time_by_test_violin_interp.png (604 KB) time_by_test_violin_yjit.png (652 KB) memory_top_rss_changes.png (112 KB) time_by_test_violin_interp.png (604 KB) time_by_test_violin_yjit.png (652 KB) changes.patch (42.3 KB) -- https://bugs.ruby-lang.org/
Issue #22011 has been updated by dsh0416 (Delton Ding). File clipboard-202604251344-sv2ro.png added <img style="width: 810px;" src="clipboard-202604251344-sv2ro.png"><br> The Student T test shows that the performance is almost identical, but the memory consumption is confidently lowered by about 3%. ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117115 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. ## Changes 1. Adds `ST_USE_SWISS_BINS`, enabled under `RUBY_EXPORT`, disabled otherwise so `parser_st.c` keeps the old-compatible layout. 2. Removes the stored hash field from st_table_entry in the Swiss path and stores hashes in a side array after entries, using 32-bit stored hashes. 3. Introduces packed Swiss bin groups: 8 control bytes plus 8 bin indexes per group. 4. Adds Swiss probing via control-byte matching, using 7-bit h2 fingerprints for candidate filtering. 5. Packs bins after entries when possible, reducing separate allocation pressure. 6. Updates allocation, copy, free, memsize, rebuild, rehash, insert, delete, shift, lookup, foreach, keys, and values paths to go through hash/bin abstraction macros. 7. Adjusts rebuild thresholds for Swiss load factor, using roughly 7/8 bin occupancy. 8. Adds tombstone-triggered rebuild behavior when many deleted slots accumulate. ## Analysis Earlier preliminarily analysis shows HUGE improvements on microbenchmarks after adopting ideas from swiss tables, but it is not quite true in the real-world benches. Swiss-table-style probing can look huge in isolated lookup/insert microbenchmarks because the hot loop is tight, the table stays cache-hot, and control-byte filtering avoids many full entry/key comparisons. In real Ruby workloads, the hash table operations are mixed with object allocation, method dispatch, GC barriers, string/hash callbacks, comparisons, branchy VM work, etc. That weakens the “everything is in L1” assumption. If the old table is kept at < 0.5 load factor, collision chains/probe lengths are already short, so a simpler open-addressing scheme can be very competitive because it has smaller code and fewer moving parts. The larger practical win of the Swiss-ish design is probably memory density: sustaining a much higher load factor, around 7/8, without making probe behavior collapse. With this new method, we could increase the load factor to about 7/8 with competitive speed. This saves us more memory then the introduced `ctrl[]`. ---Files-------------------------------- output_001.txt (5.27 KB) output_002.txt (5.29 KB) memory_top_rss_changes.png (111 KB) time_by_test_violin_interp.png (613 KB) time_by_test_violin_yjit.png (662 KB) changes.patch (43.2 KB) memory_top_rss_changes.png (112 KB) time_by_test_violin_interp.png (604 KB) time_by_test_violin_yjit.png (652 KB) memory_top_rss_changes.png (112 KB) time_by_test_violin_interp.png (604 KB) time_by_test_violin_yjit.png (652 KB) changes.patch (42.3 KB) clipboard-202604251344-sv2ro.png (72.1 KB) -- https://bugs.ruby-lang.org/
participants (2)
-
byroot (Jean Boussier) -
dsh0416 (Delton Ding)