[ruby-core:124599] [Ruby Feature#21846] Add a fast path for GC sweeping
Issue #21846 has been reported by eightbitraptor (Matt V-H). ---------------------------------------- Feature #21846: Add a fast path for GC sweeping https://bugs.ruby-lang.org/issues/21846 * Author: eightbitraptor (Matt V-H) * Status: Open ---------------------------------------- [Github PR 15885](https://github.com/ruby/ruby/pull/15885) ## Summary This proposal adds a fast path through the garbage collector's sweep phase that skips expensive cleanup operations for objects that don't require them. Simple embedded objects without external resources, finalizers, or weak references are added directly to the freelist, resulting in a 20-40% reduction in sweep time and 1-2% improvement in overall application performance. ## Background Ruby's garbage collector currently uses a single code path for sweeping all objects. During the sweep phase, every unmarked object goes through `rb_gc_obj_free_vm_weak_references` to clean up entries in the object_id table (`id2ref_tbl`) and generic fields table, followed by `rb_gc_obj_free` for type-specific cleanup and memory deallocation. This is necessary for objects with external resources, but a lot of objects created are "simple" - they have no external allocations, no registered object_id, no generic ivars, and no finalizers. For these simple objects, the sweep phase does significant work even though there's nothing to clean up. Every object is checked against the `id2ref_tbl` table (even if `ObjectSpace._id2ref` was never called), they're checked for finalizers, and they go through type-specific free functions that often just return immediately. In applications that allocate many temporary objects, this overhead accumulates significantly. ## Proposal A new predicate function `gc_sweep_fast_path_p` determines whether an object can skip the expensive cleanup calls. Objects eligible for the fast path are added directly to the freelist after clearing the write barrier unprotected bit. Two conditions always require the slow path: 1. The object has a finalizer (`FL_FINALIZE` flag is set) 2. The object has an observed `object_id` stored in the `id2ref_tbl` (checked via `rb_shape_has_object_id`) If neither condition applies, we move on to some type specific flag checks We use the fast path in these cases: * `T_OBJECT`: Embedded objects only (no heap-allocated instance variable array) * `T_STRING`: Embedded strings that aren't frozen * `T_ARRAY`: Embedded arrays * `T_HASH`: Embedded hashes (ie. no external `st_table`) * `T_BIGNUM`: Embedded bignums * `T_STRUCT`: Embedded structs * `T_FLOAT`, `T_RATIONAL`, `T_COMPLEX` * `T_DATA`: TypedData that is embedded with a NULL or `RUBY_DEFAULT_FREE` dfree function * `T_IMEMO`: Safe subtypes only (constcache, cref, ifunc, memo, svar, throw_data) For all the above types we'll also check whether any generic ivars are registered in the `generic_fields_tbl`. If they have generic ivars set, then we'll use the slow path to clean them up. All other types use the slow path. ### Compatibility Currently this patch is is disabled for modular GC builds (`BUILDING_MODULAR_GC`) because we need access to the `id2ref_tbl`, and I haven't pushed it through the GC API. ## Evaluation ### Benchmark Results All benchmarks were run on an Amazon c5.metal bare-metal VM with turbo boost and ASLR disabled, CPU frequency locked and using the performance governor, and pinned to a single core using `taskset`. GC-specific benchmarks show significant sweep time improvements: ``` Benchmark │ Wall(B) Sweep(B) Mark(B) │ Wall(E) Sweep(E) Mark(E) │ Wall Δ Sweep Δ ───────────────┼────────────────────────────────┼────────────────────────────────┼────────────────── null │ 0.000s 1ms 4ms │ 0.000s 1ms 4ms │ 0% 0% hash1 │ 4.330s 875ms 46ms │ 3.960s 531ms 44ms │ +8.6% +39.3% hash2 │ 6.356s 243ms 988ms │ 6.298s 176ms 1.03s │ +0.9% +27.6% rdoc │ 37.337s 2.42s 1.09s │ 36.678s 2.11s 1.20s │ +1.8% +13.1% binary_trees │ 3.366s 426ms 252ms │ 3.082s 275ms 239ms │ +8.4% +35.4% ring │ 5.252s 14ms 2.47s │ 5.327s 12ms 2.43s │ -1.4% +14.3% redblack │ 2.966s 28ms 41ms │ 2.940s 21ms 38ms │ +0.9% +25.0% ───────────────┼────────────────────────────────┼────────────────────────────────┼────────────────── Legend: (B) = Baseline, (E) = Experiment, Δ = improvement (positive = faster) Wall = total wallclock, Sweep = GC sweeping time, Mark = GC marking time Times are median of 3 runs ``` Application benchmarks (YJIT enabled): ``` -------------- ----------- ---------- --------------- ---------- ------------------ ----------------- bench master (ms) stddev (%) experiment (ms) stddev (%) experiment 1st itr master/experiment activerecord 132.5 0.9 132.5 1.0 1.056 1.001 chunky-png 577.2 0.4 580.1 0.4 0.994 0.995 erubi-rails 902.9 0.2 894.3 0.2 1.040 1.010 hexapdf 1763.9 3.3 1760.6 3.7 1.027 1.002 liquid-c 56.9 0.6 56.7 1.4 1.004 1.003 liquid-compile 46.3 2.1 46.1 2.1 1.005 1.004 liquid-render 77.8 0.8 75.1 0.9 1.023 1.036 mail 114.7 0.4 113.0 1.4 1.054 1.015 psych-load 1635.4 1.4 1625.9 0.5 0.988 1.006 railsbench 1685.4 2.4 1650.1 2.0 0.989 1.021 rubocop 133.5 8.1 130.3 7.8 1.002 1.024 ruby-lsp 140.3 1.9 137.5 1.8 1.007 1.020 sequel 64.6 0.7 63.9 0.7 1.003 1.011 shipit 1196.2 4.3 1181.5 4.2 1.003 1.012 -------------- ----------- ---------- --------------- ---------- ------------------ ----------------- Legend: - master/experiment: ratio of master/experiment time. Higher is better for experiment. Above 1 represents a speedup. ``` ### Analysis The GC benchmarks show sweep time improvements of 13-39% across workloads, with the largest gains in hash-heavy and tree-based benchmarks where many simple embedded objects are allocated. The `hash1` benchmark shows nearly 40% faster sweeping, translating to 8.6% wall clock improvement. Application benchmarks show modest but consistent 1-2% improvements, with `railsbench` showing a 2.1% speedup. The improvement is smaller for applications because sweep time is typically a small fraction of total execution time, but the gains are additive with other optimizations. The `ring` benchmark shows a small regression in wall time despite faster sweeping, likely due to measurement noise or allocation pattern differences. The sweep improvement (+14.3%) is still present. The conservative nature of the fast path means some objects that could theoretically skip cleanup will still use the slow path. For example, a T_STRING with generic instance variables added via `instance_variable_set` will use the slow path even though the string content itself is embedded. This is intentional - We'd rather miss an optimisation opportunity than cause a memory leak. -- https://bugs.ruby-lang.org/
Issue #21846 has been updated by byroot (Jean Boussier). @eightbitraptor I personally like this patch and think it's a positive perf improvement with no user facing consequence. I don't see any downsides except that it makes developing Ruby a bit more complicated as you could forget to update this logic and cause memory leaks but seems acceptable. What I wonder though is if you have an idea of how much time is spent in this routine? Because I think instead of checking for these various cases in there, we could also dedicate an object flag for that purpose, and have the various types responsible for keeping that flag updated. ---------------------------------------- Feature #21846: Add a fast path for GC sweeping https://bugs.ruby-lang.org/issues/21846#change-116200 * Author: eightbitraptor (Matt V-H) * Status: Open ---------------------------------------- [Github PR 15885](https://github.com/ruby/ruby/pull/15885) ## Summary This proposal adds a fast path through the garbage collector's sweep phase that skips expensive cleanup operations for objects that don't require them. Simple embedded objects without external resources, finalizers, or weak references are added directly to the freelist, resulting in a 20-40% reduction in sweep time and 1-2% improvement in overall application performance. ## Background Ruby's garbage collector currently uses a single code path for sweeping all objects. During the sweep phase, every unmarked object goes through `rb_gc_obj_free_vm_weak_references` to clean up entries in the object_id table (`id2ref_tbl`) and generic fields table, followed by `rb_gc_obj_free` for type-specific cleanup and memory deallocation. This is necessary for objects with external resources, but a lot of objects created are "simple" - they have no external allocations, no registered object_id, no generic ivars, and no finalizers. For these simple objects, the sweep phase does significant work even though there's nothing to clean up. Every object is checked against the `id2ref_tbl` table (even if `ObjectSpace._id2ref` was never called), they're checked for finalizers, and they go through type-specific free functions that often just return immediately. In applications that allocate many temporary objects, this overhead accumulates significantly. ## Proposal A new predicate function `gc_sweep_fast_path_p` determines whether an object can skip the expensive cleanup calls. Objects eligible for the fast path are added directly to the freelist after clearing the write barrier unprotected bit. Two conditions always require the slow path: 1. The object has a finalizer (`FL_FINALIZE` flag is set) 2. The object has an observed `object_id` stored in the `id2ref_tbl` (checked via `rb_shape_has_object_id`) If neither condition applies, we move on to some type specific flag checks We use the fast path in these cases: * `T_OBJECT`: Embedded objects only (no heap-allocated instance variable array) * `T_STRING`: Embedded strings that aren't frozen * `T_ARRAY`: Embedded arrays * `T_HASH`: Embedded hashes (ie. no external `st_table`) * `T_BIGNUM`: Embedded bignums * `T_STRUCT`: Embedded structs * `T_FLOAT`, `T_RATIONAL`, `T_COMPLEX` * `T_DATA`: TypedData that is embedded with a NULL or `RUBY_DEFAULT_FREE` dfree function * `T_IMEMO`: Safe subtypes only (constcache, cref, ifunc, memo, svar, throw_data) For all the above types we'll also check whether any generic ivars are registered in the `generic_fields_tbl`. If they have generic ivars set, then we'll use the slow path to clean them up. All other types use the slow path. ### Compatibility Currently this patch is is disabled for modular GC builds (`BUILDING_MODULAR_GC`) because we need access to the `id2ref_tbl`, and I haven't pushed it through the GC API. ## Evaluation ### Benchmark Results All benchmarks were run on an Amazon c5.metal bare-metal VM with turbo boost and ASLR disabled, CPU frequency locked and using the performance governor, and pinned to a single core using `taskset`. GC-specific benchmarks show significant sweep time improvements: ``` Benchmark │ Wall(B) Sweep(B) Mark(B) │ Wall(E) Sweep(E) Mark(E) │ Wall Δ Sweep Δ ───────────────┼────────────────────────────────┼────────────────────────────────┼────────────────── null │ 0.000s 1ms 4ms │ 0.000s 1ms 4ms │ 0% 0% hash1 │ 4.330s 875ms 46ms │ 3.960s 531ms 44ms │ +8.6% +39.3% hash2 │ 6.356s 243ms 988ms │ 6.298s 176ms 1.03s │ +0.9% +27.6% rdoc │ 37.337s 2.42s 1.09s │ 36.678s 2.11s 1.20s │ +1.8% +13.1% binary_trees │ 3.366s 426ms 252ms │ 3.082s 275ms 239ms │ +8.4% +35.4% ring │ 5.252s 14ms 2.47s │ 5.327s 12ms 2.43s │ -1.4% +14.3% redblack │ 2.966s 28ms 41ms │ 2.940s 21ms 38ms │ +0.9% +25.0% ───────────────┼────────────────────────────────┼────────────────────────────────┼────────────────── Legend: (B) = Baseline, (E) = Experiment, Δ = improvement (positive = faster) Wall = total wallclock, Sweep = GC sweeping time, Mark = GC marking time Times are median of 3 runs ``` Application benchmarks (YJIT enabled): ``` -------------- ----------- ---------- --------------- ---------- ------------------ ----------------- bench master (ms) stddev (%) experiment (ms) stddev (%) experiment 1st itr master/experiment activerecord 132.5 0.9 132.5 1.0 1.056 1.001 chunky-png 577.2 0.4 580.1 0.4 0.994 0.995 erubi-rails 902.9 0.2 894.3 0.2 1.040 1.010 hexapdf 1763.9 3.3 1760.6 3.7 1.027 1.002 liquid-c 56.9 0.6 56.7 1.4 1.004 1.003 liquid-compile 46.3 2.1 46.1 2.1 1.005 1.004 liquid-render 77.8 0.8 75.1 0.9 1.023 1.036 mail 114.7 0.4 113.0 1.4 1.054 1.015 psych-load 1635.4 1.4 1625.9 0.5 0.988 1.006 railsbench 1685.4 2.4 1650.1 2.0 0.989 1.021 rubocop 133.5 8.1 130.3 7.8 1.002 1.024 ruby-lsp 140.3 1.9 137.5 1.8 1.007 1.020 sequel 64.6 0.7 63.9 0.7 1.003 1.011 shipit 1196.2 4.3 1181.5 4.2 1.003 1.012 -------------- ----------- ---------- --------------- ---------- ------------------ ----------------- Legend: - master/experiment: ratio of master/experiment time. Higher is better for experiment. Above 1 represents a speedup. ``` ### Analysis The GC benchmarks show sweep time improvements of 13-39% across workloads, with the largest gains in hash-heavy and tree-based benchmarks where many simple embedded objects are allocated. The `hash1` benchmark shows nearly 40% faster sweeping, translating to 8.6% wall clock improvement. Application benchmarks show modest but consistent 1-2% improvements, with `railsbench` showing a 2.1% speedup. The improvement is smaller for applications because sweep time is typically a small fraction of total execution time, but the gains are additive with other optimizations. The `ring` benchmark shows a small regression in wall time despite faster sweeping, likely due to measurement noise or allocation pattern differences. The sweep improvement (+14.3%) is still present. The conservative nature of the fast path means some objects that could theoretically skip cleanup will still use the slow path. For example, a T_STRING with generic instance variables added via `instance_variable_set` will use the slow path even though the string content itself is embedded. This is intentional - We'd rather miss an optimisation opportunity than cause a memory leak. -- https://bugs.ruby-lang.org/
Issue #21846 has been updated by eightbitraptor (Matt V-H). My original implementation of this idea had an extra bitmap used to determine whether each object needed a "full sweep" or not. I abandoned this idea partly because having to remember to manually update the flag and keep it in sync when the objects state changed was a lot of work and pretty error prone, and partly because the cache impacts of having to check a bitmap ended up hurting performance more than it helped. I like the idea of a single flag in the header. I'd be interested to see if the cost of updating the flag negates the benefit of the fast path optimisation. I'll investigate, I didn't think we had any bits spare but it looks like we might. ---------------------------------------- Feature #21846: Add a fast path for GC sweeping https://bugs.ruby-lang.org/issues/21846#change-116202 * Author: eightbitraptor (Matt V-H) * Status: Open ---------------------------------------- [Github PR 15885](https://github.com/ruby/ruby/pull/15885) ## Summary This proposal adds a fast path through the garbage collector's sweep phase that skips expensive cleanup operations for objects that don't require them. Simple embedded objects without external resources, finalizers, or weak references are added directly to the freelist, resulting in a 20-40% reduction in sweep time and 1-2% improvement in overall application performance. ## Background Ruby's garbage collector currently uses a single code path for sweeping all objects. During the sweep phase, every unmarked object goes through `rb_gc_obj_free_vm_weak_references` to clean up entries in the object_id table (`id2ref_tbl`) and generic fields table, followed by `rb_gc_obj_free` for type-specific cleanup and memory deallocation. This is necessary for objects with external resources, but a lot of objects created are "simple" - they have no external allocations, no registered object_id, no generic ivars, and no finalizers. For these simple objects, the sweep phase does significant work even though there's nothing to clean up. Every object is checked against the `id2ref_tbl` table (even if `ObjectSpace._id2ref` was never called), they're checked for finalizers, and they go through type-specific free functions that often just return immediately. In applications that allocate many temporary objects, this overhead accumulates significantly. ## Proposal A new predicate function `gc_sweep_fast_path_p` determines whether an object can skip the expensive cleanup calls. Objects eligible for the fast path are added directly to the freelist after clearing the write barrier unprotected bit. Two conditions always require the slow path: 1. The object has a finalizer (`FL_FINALIZE` flag is set) 2. The object has an observed `object_id` stored in the `id2ref_tbl` (checked via `rb_shape_has_object_id`) If neither condition applies, we move on to some type specific flag checks We use the fast path in these cases: * `T_OBJECT`: Embedded objects only (no heap-allocated instance variable array) * `T_STRING`: Embedded strings that aren't frozen * `T_ARRAY`: Embedded arrays * `T_HASH`: Embedded hashes (ie. no external `st_table`) * `T_BIGNUM`: Embedded bignums * `T_STRUCT`: Embedded structs * `T_FLOAT`, `T_RATIONAL`, `T_COMPLEX` * `T_DATA`: TypedData that is embedded with a NULL or `RUBY_DEFAULT_FREE` dfree function * `T_IMEMO`: Safe subtypes only (constcache, cref, ifunc, memo, svar, throw_data) For all the above types we'll also check whether any generic ivars are registered in the `generic_fields_tbl`. If they have generic ivars set, then we'll use the slow path to clean them up. All other types use the slow path. ### Compatibility Currently this patch is is disabled for modular GC builds (`BUILDING_MODULAR_GC`) because we need access to the `id2ref_tbl`, and I haven't pushed it through the GC API. ## Evaluation ### Benchmark Results All benchmarks were run on an Amazon c5.metal bare-metal VM with turbo boost and ASLR disabled, CPU frequency locked and using the performance governor, and pinned to a single core using `taskset`. GC-specific benchmarks show significant sweep time improvements: ``` Benchmark │ Wall(B) Sweep(B) Mark(B) │ Wall(E) Sweep(E) Mark(E) │ Wall Δ Sweep Δ ───────────────┼────────────────────────────────┼────────────────────────────────┼────────────────── null │ 0.000s 1ms 4ms │ 0.000s 1ms 4ms │ 0% 0% hash1 │ 4.330s 875ms 46ms │ 3.960s 531ms 44ms │ +8.6% +39.3% hash2 │ 6.356s 243ms 988ms │ 6.298s 176ms 1.03s │ +0.9% +27.6% rdoc │ 37.337s 2.42s 1.09s │ 36.678s 2.11s 1.20s │ +1.8% +13.1% binary_trees │ 3.366s 426ms 252ms │ 3.082s 275ms 239ms │ +8.4% +35.4% ring │ 5.252s 14ms 2.47s │ 5.327s 12ms 2.43s │ -1.4% +14.3% redblack │ 2.966s 28ms 41ms │ 2.940s 21ms 38ms │ +0.9% +25.0% ───────────────┼────────────────────────────────┼────────────────────────────────┼────────────────── Legend: (B) = Baseline, (E) = Experiment, Δ = improvement (positive = faster) Wall = total wallclock, Sweep = GC sweeping time, Mark = GC marking time Times are median of 3 runs ``` Application benchmarks (YJIT enabled): ``` -------------- ----------- ---------- --------------- ---------- ------------------ ----------------- bench master (ms) stddev (%) experiment (ms) stddev (%) experiment 1st itr master/experiment activerecord 132.5 0.9 132.5 1.0 1.056 1.001 chunky-png 577.2 0.4 580.1 0.4 0.994 0.995 erubi-rails 902.9 0.2 894.3 0.2 1.040 1.010 hexapdf 1763.9 3.3 1760.6 3.7 1.027 1.002 liquid-c 56.9 0.6 56.7 1.4 1.004 1.003 liquid-compile 46.3 2.1 46.1 2.1 1.005 1.004 liquid-render 77.8 0.8 75.1 0.9 1.023 1.036 mail 114.7 0.4 113.0 1.4 1.054 1.015 psych-load 1635.4 1.4 1625.9 0.5 0.988 1.006 railsbench 1685.4 2.4 1650.1 2.0 0.989 1.021 rubocop 133.5 8.1 130.3 7.8 1.002 1.024 ruby-lsp 140.3 1.9 137.5 1.8 1.007 1.020 sequel 64.6 0.7 63.9 0.7 1.003 1.011 shipit 1196.2 4.3 1181.5 4.2 1.003 1.012 -------------- ----------- ---------- --------------- ---------- ------------------ ----------------- Legend: - master/experiment: ratio of master/experiment time. Higher is better for experiment. Above 1 represents a speedup. ``` ### Analysis The GC benchmarks show sweep time improvements of 13-39% across workloads, with the largest gains in hash-heavy and tree-based benchmarks where many simple embedded objects are allocated. The `hash1` benchmark shows nearly 40% faster sweeping, translating to 8.6% wall clock improvement. Application benchmarks show modest but consistent 1-2% improvements, with `railsbench` showing a 2.1% speedup. The improvement is smaller for applications because sweep time is typically a small fraction of total execution time, but the gains are additive with other optimizations. The `ring` benchmark shows a small regression in wall time despite faster sweeping, likely due to measurement noise or allocation pattern differences. The sweep improvement (+14.3%) is still present. The conservative nature of the fast path means some objects that could theoretically skip cleanup will still use the slow path. For example, a T_STRING with generic instance variables added via `instance_variable_set` will use the slow path even though the string content itself is embedded. This is intentional - We'd rather miss an optimisation opportunity than cause a memory leak. -- https://bugs.ruby-lang.org/
participants (2)
-
byroot (Jean Boussier) -
eightbitraptor (Matt V-H)