[ruby-core:123795] [Ruby Bug#21685] Unnecessary context-switching, especially bad on multi-core machines.
Issue #21685 has been reported by jpl-coconut (Jacob Lacouture). ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by byroot (Jean Boussier).
The fix simply defers suspending the thread until the syscall has been running for some short interval.
That's an idea we discussed in the past with @jhawthorn @tenderlovemaking and @luke-gru. IIRC that's something Go does? ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115198 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by jpl-coconut (Jacob Lacouture). How can I help move the discussion forward? Is it better if I create a PR and we discuss there? ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115266 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by jhawthorn (John Hawthorn). @jpl-coconut Please do! This seems like a really good demonstration of the issue and a good start on addressing it. The change is a lot smaller than I expected it to be ❤️. We would only make performance improvements like this to the `master` branch without a backport to 3.4 or older. We have an existing "timer thread" which serves some similar functions (thread preemption, waking sleeping threads). Do you think this could be integrated with that? There's some previous discussion in #20816 ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115268 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by luke-gru (Luke Gruber). Thanks for taking a look at this and coming up with an implementation, this is great. I haven't really played around with it much but I did read the code and I have a few thoughts: * There's 1 deferred wait thread per ractor, which isn't ideal but I understand why you did it like that. It does look like it would work well for programs that don't use ractors. I was envisioning a thread that would deal with all ractors at once and instead of sleeping for a fixed time, would loop over a registered list (maybe a min-heap?) and sleep for a variable number of microseconds depending on the first thread's registration time. * The deferred wait thread gets joined at ractor free time which isn't great. It would be better (more predictable, more safe) to join the thread at ractor termination. Since we only want 1 of these threads though, it could have a lifecycle similar to the timer thread (created on startup and on fork). * There's an ABA issue with the thread pointer because in the unlikely scenario where the thread is freed while the deferred wait thread has a pointer to it and another thread is created with same memory address of the previous one, it would try to schedule that new thread. It would have to hold on to the thread's id which could be a monotonically increasing 64 bit uint. Similarly, deferred_wait_thread_count should be 64bit uint. * deferred_wait_th_dummy is leaking :) * 50 microseconds seems reasonable but I haven't played around with it. Overall great job, and our team might look at this in the new year. If you want to work on it further or collaborate don't hesitate to reach out. Thanks. ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115276 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by jpl-coconut (Jacob Lacouture). I created a PR with my original changes applied to master. It's [here](https://github.com/ruby/ruby/pull/15294). The proposal to use a single-thread per process rather than one per ractor involves some complexity, but I think it's simpler than reusing the timer-thread. I'll work on that and update the PR when ready. ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115283 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by ko1 (Koichi Sasada). Actually MN threads does similar thing. ``` $ time RUBY_MN_THREADS=1 make run ./miniruby -I../../src/trunk/lib -I. -I.ext/common -r./x86_64-linux-fake ../../src/trunk/test.rb voluntary_ctxt_switches: 56 nonvoluntary_ctxt_switches: 0 real 0m15.720s user 0m6.858s sys 0m10.661s ``` if the working threads are not main thread (main thread doesn't use MN thread). ```ruby Thread.new{ # need to run on child threads qstress(Queue) }.join puts File.read('/proc/self/status').scan(/\w*voluntary.*?$/) ``` ``` $ time RUBY_MN_THREADS=1 taskset 1 make run ./miniruby -I../../src/trunk/lib -I. -I.ext/common -r./x86_64-linux-fake ../../src/trunk/test.rb voluntary_ctxt_switches: 1 nonvoluntary_ctxt_switches: 3 real 0m5.617s user 0m2.883s sys 0m2.541s ``` But using single CPU is faster, and it may be improved... ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115524 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by ko1 (Koichi Sasada). Previous machine is on WSL. On another vanilla Linux machine Ubuntu 24.04/Linux 6.8.0-87-generic/i7-13700H nproc:20: ``` all CPU 1 CPU MN=0 real: 0m8.919s real:0m7.874s voluntary_ctxt_switches: 1399816 voluntary_ctxt_switches: 1657676 nonvoluntary_ctxt_switches: 35 nonvoluntary_ctxt_switches: 48541 MN=1 real:0m10.118s real:0m6.915s voluntary_ctxt_switches: 1399058 voluntary_ctxt_switches: 1203349 nonvoluntary_ctxt_switches: 16 nonvoluntary_ctxt_switches: 20080 MN=1/tasks run on child threads (qstress(Queue) is on a child thread) real: 0m6.692s real:0m5.672s voluntary_ctxt_switches: 1 voluntary_ctxt_switches: 1 nonvoluntary_ctxt_switches: 1 nonvoluntary_ctxt_switches: 3 ``` There are difference but not significant on the description. It seems depends on CPU/Linux versions (scheduler). BTW single thread overhead is: ``` ko1@ruby-sp3:~/ruby/build/master$ time make run ./miniruby -I../../src/master/lib -I. -I.ext/common -r./x86_64-linux-fake ../../src/master/test.rb real 0m3.436s user 0m2.239s sys 0m1.197s ``` with the following script: ```ruby def qstress_single qclass q = qclass.new messages = [] # worker (0..1000000).each{|i| q.push i File.write('/dev/null', '0') # master m = q.pop messages << m if messages.count >= 5 messages.clear File.write('/dev/null', '0') end case m when :end break end } q.push :end end qstress_single ``` ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115557 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by jpl-coconut (Jacob Lacouture). Interesting. I get similar results on both linux and macos, on both benchmarks. But I'm using ARM in both cases, not x86. ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115558 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by jhawthorn (John Hawthorn). The benchmark from #20816 gets a little more pure results as it only deals with a fast syscall in a loop without Queue communication. I adjusted it slightly. ``` ruby THREADS = (ARGV.first || 10).to_i # choose thread count from CLI N = 1_000_000 # to make the test longer or shorter IMPL = ENV["RACTOR"] == "1" ? Ractor : Thread puts "Ruby #{RUBY_VERSION} / YJIT: #{RubyVM::YJIT.enabled?} / #{THREADS} #{IMPL}s" puts RUBY_DESCRIPTION before = Process.clock_gettime(Process::CLOCK_MONOTONIC) Array.new(THREADS).map do IMPL.new do (N/THREADS).times { File.mtime(__FILE__) } end end.each(&:join) elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - before puts "#{N} calls completed in #{elapsed.round(3)}s" ``` Running this in `perf stat` shows `1,001,669 context-switches`, every time we syscall we make a context switch. With M:N threads `193,182 context-switches`. With this branch `5,998 context-switches`. It's not totally clear to me why M:N threading helps. Comparing this branch against current Ruby master, parameterized by 1 or 8 threads, Ractors vs Threads, and M:N scheduling shows the following ``` (fastest) RUBY_MN_THREADS=0 RACTOR=1 ~/src/ruby/miniruby --disable-gems mtime.rb 8 ran 1.00 ± 0.09 times faster than RUBY_MN_THREADS=1 RACTOR=1 ~/src/ruby/miniruby --disable-gems mtime.rb 8 1.10 ± 0.08 times faster than RUBY_MN_THREADS=0 RACTOR=1 ruby --disable-gems mtime.rb 8 1.12 ± 0.05 times faster than RUBY_MN_THREADS=0 RACTOR=0 ruby --disable-gems mtime.rb 1 1.12 ± 0.07 times faster than RUBY_MN_THREADS=1 RACTOR=1 ruby --disable-gems mtime.rb 8 1.13 ± 0.05 times faster than RUBY_MN_THREADS=1 RACTOR=0 ~/src/ruby/miniruby --disable-gems mtime.rb 1 1.15 ± 0.05 times faster than RUBY_MN_THREADS=1 RACTOR=0 ~/src/ruby/miniruby --disable-gems mtime.rb 8 1.15 ± 0.05 times faster than RUBY_MN_THREADS=0 RACTOR=0 ~/src/ruby/miniruby --disable-gems mtime.rb 1 1.16 ± 0.05 times faster than RUBY_MN_THREADS=1 RACTOR=0 ruby --disable-gems mtime.rb 1 1.17 ± 0.05 times faster than RUBY_MN_THREADS=0 RACTOR=0 ~/src/ruby/miniruby --disable-gems mtime.rb 8 1.22 ± 0.06 times faster than RUBY_MN_THREADS=1 RACTOR=1 ~/src/ruby/miniruby --disable-gems mtime.rb 1 1.22 ± 0.06 times faster than RUBY_MN_THREADS=0 RACTOR=1 ~/src/ruby/miniruby --disable-gems mtime.rb 1 1.23 ± 0.06 times faster than RUBY_MN_THREADS=1 RACTOR=1 ruby --disable-gems mtime.rb 1 1.24 ± 0.06 times faster than RUBY_MN_THREADS=0 RACTOR=1 ruby --disable-gems mtime.rb 1 2.53 ± 0.29 times faster than RUBY_MN_THREADS=1 RACTOR=0 ruby --disable-gems mtime.rb 8 4.23 ± 0.33 times faster than RUBY_MN_THREADS=0 RACTOR=0 ruby --disable-gems mtime.rb 8 ``` Full results https://gist.github.com/jhawthorn/9945282384f005e0deeb2be889853727 These land in a couple groups. This branch with Ractors is clearly the fastest (as expected RUBY_MN_THREADS makes no difference) Following that it's about 10-20% slower running the benchmark in any of single threaded, in 8 Ractors, or in 1 Ractor (one Ractor seems a little slower) Then much below that it's 2x slower again to run 8 threads with `RUBY_MN_THREADS=1`, and 4x slower with `RUBY_MN_THREADS=0` --- I think this technique looks very promising. There are many syscalls that 99% of the time return immediately but we end up being pessimistic and doing expensive synchronization and context switches. It's definitely too close to Ruby 4.0, but I really think we should explore this for 4.1 ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115622 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by ko1 (Koichi Sasada). https://github.com/ruby/ruby/pull/15529 It seems enough small, so I want to introduce it to 4.0. ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115633 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by luke-gru (Luke Gruber). I'm with John that I think this is very promising but should not be introduced in ruby 4.0. It needs more testing to make sure it's working correctly and it needs more benchmarks for more realistic workloads. We can't do that in 2 weeks. Any non-minor changes to the scheduler right before release scares me. ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115635 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by byroot (Jean Boussier). From: https://github.com/ruby/ruby/pull/15529
The timer thread checks the running threads to provide timeslice every 10ms and finds the thread (th) performing blocking operation.
Isn't 10ms a massive amount of time in that context? I suspect such threshold would ruin the throughput for threaded applications (typically Puma). The original prototype had a 100usec interval, which seems more reasonable (but already a bit long?). ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-115641 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by jpl-coconut (Jacob Lacouture). I have a much better solution here: https://github.com/ruby/ruby/pull/15840 It's simpler. It dispenses with the additional thread and the magic number (sleep time). And in addition to bringing multi-core perf into line with single-core, it improves single-core perf on these two benchmarks by 20-30%. ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-116021 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by luke-gru (Luke Gruber). File puma_bench_results.txt added File ruby-bench_results.txt added I did some benchmarking based on the updated solution (https://github.com/ruby/ruby/pull/15840). I've attached Puma benchmark results that I got using the PR from the previous post (d7ef5ecc9b) vs that PR's base commit (ad6b85450d). The Puma benchmarks scripts are located in Puma's repository and are mostly exercising Puma's internals and how it handles many requests/connections. There is a minimal Rack app that is run and in some of the benchmarks a small `sleep` is added before the response is generated. This is all from Puma's benchmarking system. I also ran `ruby-bench` and have posted that as well, but `ruby-bench` doesn't use threads in many (any?) of its benchmarks and there was no noticeable difference. ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-116197 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` ---Files-------------------------------- puma_bench_results.txt (14.3 KB) ruby-bench_results.txt (4.24 KB) -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by stanhu (Stan Hu). @ko1 @hsbt Would it be possible to backport https://github.com/ruby/ruby/pull/15840 to Ruby 3.3, 3.4, and 4.0? We're seeing a significant regression after upgrading to Ruby 3.2 where this contention causes Sidekiq worker threads to silently deregister from Redis under load. ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-117039 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` ---Files-------------------------------- puma_bench_results.txt (14.3 KB) ruby-bench_results.txt (4.24 KB) -- https://bugs.ruby-lang.org/
Issue #21685 has been updated by hsbt (Hiroshi SHIBATA). Backport changed from 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN to 3.3: WONTFIX, 3.4: UNKNOWN, 4.0: UNKNOWN Ruby 3.3 is security maitenance status now. I don't backport normal bug to Ruby 3.3. ---------------------------------------- Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. https://bugs.ruby-lang.org/issues/21685#change-117042 * Author: jpl-coconut (Jacob Lacouture) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux] * Backport: 3.3: WONTFIX, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [[here]](https://gist.github.com/jpl-coconut/cb3679ce885eb578e1071c4b3a525d5c) I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec **(10x slower) on a machine with 2 or more cores** . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls. I prepared a fix in [[this branch]](https://github.com/jpl-coconut/ruby/tree/deferred_thread_wait) which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable. I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4. After the fix: - Single-core performance improves by 55%, from 3.6sec to 2sec. - Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower. - Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80 - system_time/user_time ratio drops from (1.2 - 1.6) to 0.65 Here are the benchmark results before my change: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1140773 nonvoluntary_ctxt_switches: 9487 real 0m3.619s user 0m1.653s sys 0m1.950s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 1400110 nonvoluntary_ctxt_switches: 3 real 0m36.223s user 0m9.380s sys 0m14.927s ``` And after: ``` # time taskset --cpu-list 1 ./ruby qtest_simple.rb voluntary_ctxt_switches: 88 nonvoluntary_ctxt_switches: 899 real 0m2.031s user 0m1.209s sys 0m0.743s # time taskset --cpu-list 1,2 ./ruby qtest_simple.rb voluntary_ctxt_switches: 75 nonvoluntary_ctxt_switches: 8 real 0m2.062s user 0m1.279s sys 0m0.783s ``` I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [[here]](https://gist.github.com/jpl-coconut/aa14e59354abf98f808daaf39baa9a72). Results summary: - Single-core performance improves (this time by only 30%) - Multi-core penalty drops from 4x to 0. - No change to context-switching rates. - system_time/user_time ratio drops from (0.5-1) to 0.15 Before fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.336s user 0m0.211s sys 0m0.118s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m1.424s user 0m0.468s sys 0m0.496s ``` After fix: ``` # time taskset --cpu-list 1 ./ruby mbenchmark.rb voluntary_ctxt_switches: 59 real 0m0.241s user 0m0.202s sys 0m0.032s # time taskset --cpu-list 1,2 ./ruby mbenchmark.rb voluntary_ctxt_switches: 60 real 0m0.238s user 0m0.195s sys 0m0.035s ``` ---Files-------------------------------- puma_bench_results.txt (14.3 KB) ruby-bench_results.txt (4.24 KB) -- https://bugs.ruby-lang.org/
participants (7)
-
byroot (Jean Boussier) -
hsbt (Hiroshi SHIBATA) -
jhawthorn (John Hawthorn) -
jpl-coconut (Jacob Lacouture) -
ko1 (Koichi Sasada) -
luke-gru (Luke Gruber) -
stanhu (Stan Hu)