Issue #22011 has been updated by dsh0416 (Delton Ding). From the bench results, we could see some very positive signals in some benches, but it also introduces some regressions, I would try if I could further narrowing down the regressions. ---------------------------------------- Feature #22011: Hash tables with swiss table https://bugs.ruby-lang.org/issues/22011#change-117108 * Author: dsh0416 (Delton Ding) * Status: Open ---------------------------------------- This change adds a Swiss-table-inspired probing layer to Ruby's core `st_table`, and shrinks `st_table_entry` from 24 B to 16 B by moving the stored hash into a parallel array. It is built and enabled by default; `--disable-swiss-st` reverts to the original `st.c`. The public ABI of `struct st_table` and the iteration-order guarantee are preserved. ## Motivation Hashes are everywhere in Ruby — instance-variable tables, ivar shapes, constant tables, JSON/HTTP/AR rows, every `params`, every `to_h`. Profiles of Rails-shaped workloads spend a meaningful fraction of CPU inside `st_lookup` and `st_insert`. Two pain points stood out in the upstream implementation: 1. **Probe loops are branch-heavy.** Every step of the perturb chain loads a bin, fetches the entry it points to, compares the full `st_hash_t` (8 B) and only then calls `eql?`. On a miss that is several dependent loads per probe with no way to fast-reject groups of slots in parallel. 2. **`st_table_entry` is 24 B.** The `(hash, key, record)` triple gets one cache line per ~2.5 entries. Iteration and equality scans burn through L1 quickly, and Ruby programs typically hold a *lot* of small-to-mid-sized hashes (so per-table overhead matters). The Swiss-table family of designs (Abseil `flat_hash_map`, Rust `hashbrown`) addresses (1) with a 1-byte-per-slot **control array** that lets a single SIMD/SWAR comparison reject or short-list 8 slots at once. We borrow that idea but keep Ruby's two-array layout (so we don't break ABI or insertion order) and add a third, parallel `ctrl[]` byte array. We then attack (2) by also extracting the hash field out of `st_table_entry` into its own parallel `uint32_t hashes[]` array. ## Changes 1. Adds `ST_USE_SWISS_BINS`, enabled under `RUBY_EXPORT`, disabled otherwise so `parser_st.c` keeps the old-compatible layout. 2. Removes the stored hash field from st_table_entry in the Swiss path and stores hashes in a side array after entries, using 32-bit stored hashes. 3. Introduces packed Swiss bin groups: 8 control bytes plus 8 bin indexes per group. 4. Adds Swiss probing via control-byte matching, using 7-bit h2 fingerprints for candidate filtering. 5. Packs bins after entries when possible, reducing separate allocation pressure. 6. Updates allocation, copy, free, memsize, rebuild, rehash, insert, delete, shift, lookup, foreach, keys, and values paths to go through hash/bin abstraction macros. 7. Adjusts rebuild thresholds for Swiss load factor, using roughly 7/8 bin occupancy. 8. Adds tombstone-triggered rebuild behavior when many deleted slots accumulate. ## Analysis Earlier preliminarily analysis shows HUGE improvements on microbenchmarks after adopting ideas from swiss tables, but it is not quite true in the real-world benches. Swiss-table-style probing can look huge in isolated lookup/insert microbenchmarks because the hot loop is tight, the table stays cache-hot, and control-byte filtering avoids many full entry/key comparisons. In real Ruby workloads, the hash table operations are mixed with object allocation, method dispatch, GC barriers, string/hash callbacks, comparisons, branchy VM work, etc. That weakens the “everything is in L1” assumption. If the old table is kept at < 0.5 load factor, collision chains/probe lengths are already short, so a simpler open-addressing scheme can be very competitive because it has smaller code and fewer moving parts. The larger practical win of the Swiss-ish design is probably memory density: sustaining a much higher load factor, around 7/8, without making probe behavior collapse. With this new method, we could increase the load factor to about 7/8 with competitive speed. This saves us more memory then the introduced `ctrl[]`. ---Files-------------------------------- output_001.txt (5.27 KB) output_002.txt (5.29 KB) memory_top_rss_changes.png (111 KB) time_by_test_violin_interp.png (613 KB) time_by_test_violin_yjit.png (662 KB) changes.patch (43.2 KB) -- https://bugs.ruby-lang.org/