[ruby-core:125653] [Ruby Bug#22098] Reconsider whether THREAD_EVENT_RESUMED runs with GVL held
Issue #22098 has been reported by luke-gru (Luke Gruber). ---------------------------------------- Bug #22098: Reconsider whether THREAD_EVENT_RESUMED runs with GVL held https://bugs.ruby-lang.org/issues/22098 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by byroot (Jean Boussier). I don't think I understand the bug, but:
change the documentation to be clear that the GVL is not held.
This I really don't understand. Are you saying the GVL isn't held when `THREAD_EVENT_RESUMED` triggers? That would really surprise me. The main use case for this event is to measure how long it took to acquire the GVL, as such it MUST be called after the GVL held. ---------------------------------------- Bug #22098: Reconsider whether THREAD_EVENT_RESUMED runs with GVL held https://bugs.ruby-lang.org/issues/22098#change-117513 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by luke-gru (Luke Gruber). It really depends on what "having the GVL" means. Right now, `ruby_thread_has_gvl_p()` can return `false` in hooks for this event. This is what can cause a deadlock. ---------------------------------------- Bug #22098: Reconsider whether THREAD_EVENT_RESUMED runs with GVL held https://bugs.ruby-lang.org/issues/22098#change-117514 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by byroot (Jean Boussier).
Right now, ruby_thread_has_gvl_p() can return false in hooks for this event.
Then that's a bug. ---------------------------------------- Bug #22098: THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117524 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by jhawthorn (John Hawthorn). Here's a reproduction showing that `ruby_thread_has_gvl_p()` is false https://github.com/ruby/ruby/pull/17263. I think it has been since the introduction of the callback. I think it might be fine for `ruby_thread_has_gvl_p()` to return false (we could adjust the documentation to say "about to acquire the GVL" or "has exclusive hold on the GVL, but hasn't yet fully acquired it"). But I think the bigger question is **are we allowed to allocate in this callback**? Doing so seems broken today and I also don't think it should be allowed. ---------------------------------------- Bug #22098: THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117527 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by luke-gru (Luke Gruber). Yes, this has always been buggy and I also don't think we should allow allocating objects/xmalloc in the hook. Even if we changed `ruby_thread_has_gvl_p()`, the problem is that this hook runs with the scheduler lock held (and always has). We would have to move the invocation of the hook outside of any locks, so it would need to be copied to many call sites (~10). Otherwise, under Ractors, a different type of deadlock could occur if the hook caused a GC and another Ractor was blocked on this Ractor's scheduler lock (it wouldn't be able to join the VM barrier). ---------------------------------------- Bug #22098: THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117530 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by byroot (Jean Boussier).
But I think the bigger question is are we allowed to allocate in this callback? Doing so seems broken today and I also don't think it should be allowed.
I'm fine with adding that restriction is there really isn't any other solution, but ideally we'd make the necessary changes to allow it. These callbacks typically want to record information, and `THREAD_EVENT_RESUMED` is the ideal hooks to update whatever statistics were collected.
so it would need to be copied to many call sites (~10).
Is that inevitable, or could it be solved with a bit of a refactoring. (Sorry, the scheduler code is all but fresh in my head, I'm still unclear on what you are suggesting exactly) ---------------------------------------- Bug #22098: THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117532 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by Eregon (Benoit Daloze). As I wrote in https://github.com/Shopify/gvltools/pull/34#issuecomment-4667841039, my expectation is also that this hook is called with the GVL, on the thread which just acquired the GVL. I think not being to allocate is not the end (maybe already a requirement for this hook in previous releases?), but it sure would be nice if it was. Since the hook is called after the GVL is acquired, one would expect to be able to run regular Ruby code there. However as you say the scheduler lock is held and so one should use a `postponed_job` for now like [here](https://github.com/DataDog/dd-trace-rb/blob/b75cf221d321bfea448ede1349c108a8...) (we should document that). Being able to do more in that hook would simplify things quite a bit, I actually [tried that last week](https://github.com/DataDog/dd-trace-rb/pull/5865) by coincidence but eventually found about the undocumented fact that it runs with the scheduler lock held and so probably not a good idea. FWIW `RB_INTERNAL_THREAD_HOOK(RUBY_INTERNAL_THREAD_EVENT_SUSPENDED, th);` is currently called in 6 places, so having it in 10 callers sure feels not so nice but there is also precedent, and maybe something that could be refactored. FWIW in Ruby 3.2 there were actually much fewer `RB_INTERNAL_THREAD_HOOK()` call sites, maybe the new scheduler is still "taking shape"? ---------------------------------------- Bug #22098: THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117533 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by Eregon (Benoit Daloze). jhawthorn (John Hawthorn) wrote in #note-7:
we could adjust the documentation to say "about to acquire the GVL" or "has exclusive hold on the GVL, but hasn't yet fully acquired it"
This sounds pretty weird (to me at least) because intuitively, either you (a thread) have a lock (i.e. you are its sole owner and won a compare-and-swap or so), or you don't and can't know when you'll get it (depends on another thread releasing it and this thread winning the race to acquire it). Interesting that `ruby_thread_has_gvl_p()` is false, but I think it's something to fix, not to make even harder to use :sweat: ---------------------------------------- Bug #22098: THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117534 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by jhawthorn (John Hawthorn). Subject changed from THREAD_EVENT_RESUMED runs without GVL held to RUBY_INTERNAL_THREAD_EVENT_RESUMED runs without GVL held Eregon (Benoit Daloze) wrote in #note-10:
Since the hook is called after the GVL is acquired, one would expect to be able to run regular Ruby code there.
I feel very strongly that we must not allow arbitrary Ruby code in these hooks. Arbitrary Ruby code may release the GVL and so that would require these hooks to deal with reentrancy, which is a really bad idea. As a subscriber to this API (as a profiler) I need the guarantee that these events are delivered reliably and in order (for the given thread). It's not acceptable that an earlier subscriber delays and changes the order in which events are delivered, breaking the documented invariants of the GVL state machine (ex. if we re-suspend in a hook the hook after it will see the SUSPENDED event before the previous RESUMED, will later see multiple RESUMED in a row). We could detect re-entrancy and disable the hooks (as tracepoint does), I really do not want that. It's important for my profiler to have accurate delivery of all events. This is a very low level API and we should expect consumers to all be careful and well behaved. **At best** a limited selection of Ruby APIs could be made safe to use here (currently all are unsafe). We _could_ probably support allocation here (through the rewrite discussed above to call the hooks later from N different locations). I don't think we should. To support allocations we need to make `ruby_thread_has_gvl_p()` return true. However we're in some intermediate "you have the GVL but releasing it is illegal" state that is going to result with buggy instrumentation extensions (like we just found). With the current state where `ruby_thread_has_gvl_p()` returns false, we can at least still detect these bugs via assertions. IMO the INTERNAL thread event API should adopt the [same warning in documentation the INTERNAL tracepoint events have](https://github.com/ruby/ruby/blob/master/include/ruby/debug.h#L402-L407). It effectively has this restriction for all events _except_ for RUBY_INTERNAL_THREAD_EVENT_RESUMED.
You can use any Ruby APIs (calling methods and so on) on normal event hooks. **However, in internal events, you can not use any Ruby APIs (even object creations).**
---------------------------------------- Bug #22098: RUBY_INTERNAL_THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117553 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by Eregon (Benoit Daloze). jhawthorn (John Hawthorn) wrote in #note-12:
I feel very strongly that we must not allow arbitrary Ruby code in these hooks.
I agree, I expressed that poorly. I meant to say one would expect the GVL is held (given the name of the hook); and it would be nice if we can call various `rb_*()` functions, as in regular C extension code. But the latter does conflict with what you say, good points. Anything that check interrupts or may switch threads or release the GVL, etc is clearly a no-go.
As a subscriber to this API (as a profiler)
I'm also using this API for a profiler :)
I need the guarantee that these events are delivered reliably and in order (for the given thread). It's not acceptable that an earlier subscriber delays and changes the order in which events are delivered, breaking the documented invariants of the GVL state machine (ex. if we re-suspend in a hook the hook after it will see the SUSPENDED event before the previous RESUMED, will later see multiple RESUMED in a row). We could detect re-entrancy and disable the hooks (as tracepoint does), I really do not want that. It's important for my profiler to have accurate delivery of all events. This is a very low level API and we should expect consumers to all be careful and well behaved.
Agreed. (FTR, Ruby 3.2 does emit multiple SUSPENDED events in a row with ConditionVariable, but 3.3+ does not seem to have the issue) Better documentation would go a long way. I think we still want the RESUMED hook to be called on the thread which just acquired the GVL, it's too surprising otherwise and doesn't seem advantageous either.
However, in internal events, you can not use any Ruby APIs (even object creations).
"you can not use any Ruby APIs" is too strict, at minimum we need: * rb_internal_thread_specific_get()/rb_internal_thread_specific_set() * rb_postponed_job_trigger() * TypedData_Get_Struct() * some way to check if on main Ractor or not (which is BTW what the datadog gem uses, and similar for gvltools) If we forbid APIs like allocation, would it be possibly to reliably fail if called from these hooks with a RUBY_DEBUG build? That'd be a good way to validate it (which we could even document on `rb_internal_thread_add_event_hook()`). ---------------------------------------- Bug #22098: RUBY_INTERNAL_THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117573 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by jhawthorn (John Hawthorn). Eregon (Benoit Daloze) wrote in #note-13:
"you can not use any Ruby APIs" is too strict, at minimum we need: * rb_internal_thread_specific_get()/rb_internal_thread_specific_set() * rb_postponed_job_trigger()
These are all documented to be async-signal-safe, and so can be called everywhere (including the existing hooks with the quoted restriction). I don't think that needs clarification here, but we could spell it out I guess.
* TypedData_Get_Struct()
Should not be explicitly supported as it can raise an exception (though I'm sure it doesn't in your use) and so runs arbitrary user code. `RTYPEDDATA_GET_DATA` is probably fine, but unlike the async-signal-safe examples doesn't feel like the type that should be explicitly allowed here or in the internal GC hooks. Why do you need it? Could you pass the raw pointer?
If we forbid APIs like allocation, would it be possibly to reliably fail if called from these hooks with a RUBY_DEBUG build? That'd be a good way to validate it (which we could even document on `rb_internal_thread_add_event_hook()`).
I'd love to add [`ASSERT(ruby_thread_has_gvl_p()) to newobj`](https://github.com/ruby/ruby/pull/17289) (and maybe to `rb_funcall`), but I don't know if the performance is acceptable even for debug builds. ---------------------------------------- Bug #22098: RUBY_INTERNAL_THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117580 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by Eregon (Benoit Daloze). One reason it would be valuable to be able to allocate in RESUMED is that's a good spot to set the per-thread state. I looked at the datadog gem, gvltools, gvl-tracing and vernier. All of them except vernier use `rb_internal_thread_specific_get()/rb_internal_thread_specific_set()`, but also all of them need to mark some things inside the per-thread state (to mark a sample taken in a signal handler, etc). * vernier uses a `std::vector` to keep all per-thread states and mark them (that probably means linear lookup with the # of thread then + need to lock around it so some contention). * until recently, datadog used a `st_table` to keep all per-thread states and mark them (+ an integer in `rb_internal_thread_specific`, as it couldn't touch the `st_table` in GVL events). * datadog master uses a TypedData object stored as a hidden ivar on the Thread, this is simpler and convenient, it always has the full per-thread state with just `rb_internal_thread_specific_get` (no need for a `st_table` and no need to remove dead threads) * gvltools and gvl-tracing use `rb_internal_thread_specific` and a TypedData object stored as a Fiber-local (`rb_thread_local_aset`) on the Thread Storing as a Fiber-local/Thread-local/ivar on the Thread is not ideal because that would fail if the Thread is frozen (but I think nobody freezes Threads). Maybe `rb_internal_thread_specific_*()` should support GC-marking the value to fix that, but that still wouldn't help regarding allocations. A good place to allocate that TypedData object would be in RESUMED, that's both the initial state of a thread, and a good state to start tracking at for threads that existed before the profiler. The approach in https://github.com/Shopify/gvltools/pull/34 is to iterate the threads when starting the profiler, and to use the `RUBY_INTERNAL_THREAD_EVENT_STARTED` event, but that's not guaranteed to be called with the GVL according to the docs at least (and unclear if guaranteed to be called on the new Thread, if not could be problematic if the thread runs for a very short time and we try to cleanup the state concurrently (in EXITED) with initializing it). So we're saying in this issue the event that tracks getting the GVL may not have the GVL but the STARTED event which might not have the GVL is OK to rely on having the GVL? (I'm just highlighting how surprising that may be, but I think it may be fine if well documented and if it holds on existing releases) The current approach in the datadog gem is to allocate the thread states in a postponed job after the periodic signal handler sample, where it also samples other threads. Any RESUMED event before that sample is effectively lost, because we can't allocate in RESUMED, allocating in a postponed job after RESUMED is too late since we have nowhere to store the data. We could iterate threads when the profiler starts, but that wouldn't capture new threads (and unclear if `RUBY_INTERNAL_THREAD_EVENT_STARTED` can allocate). Maybe a regular :thread_start TracePoint should be used? That has guarantees about the GVL and running on the right thread, but is heavier. BTW because we can't call Ruby APIs in the RESUMED event, we need to use a postponed job. The means the postponed job must be scheduled quickly after the RESUMED event, as notably if there would be another GVL event in between that would cause problems. ---------------------------------------- Bug #22098: RUBY_INTERNAL_THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117598 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by Eregon (Benoit Daloze). jhawthorn (John Hawthorn) wrote in #note-14:
Why do you need it? Could you pass the raw pointer?
In this case it's to access some configuration of the profiler for whether to sample after RESUMED (`state->waiting_for_gvl_threshold_ns`), i.e. if it took a while to acquire the GVL (from READY to RESUMED). There are ways to work around it, e.g. by storing that in a C global, or passing it to the GVL callback as the extra data pointer, but since this hook should effectively "have" the GVL I thought it's fine enough to get it from the Ruby object. ---------------------------------------- Bug #22098: RUBY_INTERNAL_THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117600 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by Eregon (Benoit Daloze). From https://github.com/ruby/ruby/blob/c2879d4eb1c2f232c893656a49a2c53353aebf04/i... ```c /** * Triggered when a new thread is started. * * @note The callback will be called *without* the GVL held. */ #define RUBY_INTERNAL_THREAD_EVENT_STARTED 1 << 0 /** * Triggered when a thread attempt to acquire the GVL. * * @note The callback will be called *without* the GVL held. */ #define RUBY_INTERNAL_THREAD_EVENT_READY 1 << 1 /** acquiring GVL */ /** * Triggered when a thread successfully acquired the GVL. * * @note The callback will be called *with* the GVL held. */ #define RUBY_INTERNAL_THREAD_EVENT_RESUMED 1 << 2 /** acquired GVL */ /** * Triggered when a thread released the GVL. * * @note The callback will be called *without* the GVL held. */ #define RUBY_INTERNAL_THREAD_EVENT_SUSPENDED 1 << 3 /** released GVL */ /** * Triggered when a thread exits. * * @note The callback will be called *without* the GVL held. */ #define RUBY_INTERNAL_THREAD_EVENT_EXITED 1 << 4 /** thread terminated */ ``` * `RUBY_INTERNAL_THREAD_EVENT_RESUMED` is explicitly documented as having the GVL. * `RUBY_INTERNAL_THREAD_EVENT_STARTED` is documented as being called without the GVL, but it seems to be called on the parent thread with the GVL. Documentation bug then? ---------------------------------------- Bug #22098: RUBY_INTERNAL_THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117684 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
Issue #22098 has been updated by Eregon (Benoit Daloze). Some more context from Claude on `RUBY_INTERNAL_THREAD_EVENT_STARTED`:
Ruby 3.2: STARTED fires in thread_start_func_1 (thread_pthread.c:1166) — this runs on the new thread itself, without the GVL. The new thread hasn't acquired the GVL yet (that happens inside thread_start_func_2). The doc is correct for 3.2.
Ruby 3.3, 3.4, 4.0 (master): STARTED fires in native_thread_create — this runs on the parent thread, with the GVL. The doc is wrong for these versions.
So the behavior changed between 3.2 and 3.3 (with the M:N scheduler work). In 3.2 it was called on the new thread without the GVL as documented; from 3.3 onwards it's called on the parent thread with the GVL.
---------------------------------------- Bug #22098: RUBY_INTERNAL_THREAD_EVENT_RESUMED runs without GVL held https://bugs.ruby-lang.org/issues/22098#change-117686 * Author: luke-gru (Luke Gruber) * Status: Open * Assignee: luke-gru (Luke Gruber) * Backport: 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- Today, it's possible to get a deadlock when allocating during a hook for this event. I attached a reproduction script using the `gvltools` gem. One way to fix it would be to not allocate during this hook and change the documentation to be clear that the GVL is not held. Any gems relying on this behavior, like `gvltools`, would need to be patched. Another way to approach it would be to change the call site for this hook invocation. I believe it would need to be added to quite a few places. To maintain Ractor safety, it would also need to be invoked without any locks held. I'm curious about your thoughts @byroot. ---Files-------------------------------- repro.rb (3.11 KB) run_loop.sh (2.82 KB) -- https://bugs.ruby-lang.org/
participants (4)
-
byroot (Jean Boussier) -
Eregon (Benoit Daloze) -
jhawthorn (John Hawthorn) -
luke-gru (Luke Gruber)