Issue #19991 has been updated by ivoanjo (Ivo Anjo).
I've also documented how to workaround this issue in the Datadog Ruby profiler in
https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#segmenta…
although we're still considering putting in a workaround in the profiler itself.
----------------------------------------
Bug #19991: rb_register_postponed_job async-signal-unsafety causes crash in GC
https://bugs.ruby-lang.org/issues/19991#change-106703
* Author: kjtsanaktsidis (KJ Tsanaktsidis)
* Status: Closed
* Priority: Normal
* Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN
----------------------------------------
Our production application experienced an interpreter crash today that I’m fairly sure is
the result of a bug in the `rb_register_postponed_job` infrastructure.
## Diagnosis
I’ve attached a more complete version of the crash dump, but the key information I think
is the following backtrace:
```
<internal:gc>:35: [BUG] Segmentation fault at 0x00000000000000c8
....
/usr/lib/libruby.so.3.1(rb_vm_bugreport+0x60c) [0xffffaaf57a6c] vm_dump.c:759
/usr/lib/libruby.so.3.1(rb_bug_for_fatal_signal+0xd8) [0xffffaad6d120] error.c:821
/usr/lib/libruby.so.3.1(sigsegv+0x58) [0xffffaaeb4350] signal.c:964
/usr/lib/libruby.so.3.1(sigill) (null):0
linux-vdso.so.1(__kernel_rt_sigreturn+0x0) [0xffffab20f78c]
/usr/lib/libruby.so.3.1(rbimpl_atomic_exchange+0x0) [0xffffaad94018] gc.c:4081
/usr/lib/libruby.so.3.1(gc_finalize_deferred) gc.c:4081
/usr/lib/libruby.so.3.1(rb_postponed_job_flush+0x218) [0xffffaaf5dd60] vm_trace.c:1728
/usr/lib/libruby.so.3.1(rb_threadptr_execute_interrupts+0x328) [0xffffaaef7520]
thread.c:2444
/usr/lib/libruby.so.3.1(vm_exec_core+0x4e54) [0xffffaaf41c6c] vm.inc:588
/usr/lib/libruby.so.3.1(rb_vm_exec+0x144) [0xffffaaf427cc] vm.c:2211
```
I disassembled `gc_finalize_deferred` in GDB (different execution, so the addresses don’t
line up - this is NOT from a core dump):
```
(gdb) disassemble gc_finalize_deferred
Dump of assembler code for function gc_finalize_deferred:
0x0000fffff7cc3ff8 <+0>: stp x29, x30, [sp, #-48]!
0x0000fffff7cc3ffc <+4>: mov w1, #0x1 // #1
0x0000fffff7cc4000 <+8>: mov x29, sp
0x0000fffff7cc4004 <+12>: stp x19, x20, [sp, #16]
0x0000fffff7cc4008 <+16>: mov x19, x0
0x0000fffff7cc400c <+20>: add x20, x0, #0xc8
0x0000fffff7cc4010 <+24>: ldaxr w0, [x20]
.... continues ....
```
Based on the line numbers from the debuginfo, the faulting instruction is likely to be
`gc_finalize_deferred+24` (which is the inlined `rbimpl_atomic_exchange`). That would
attempt a load of 0xc8 if x0 was zero - i.e. if `gc_finalize_deferred` was called with a
NULL objspace.
The enqueuing of `gc_finalize_deferred` with postponed job only happens in one place (in
gc_sweep_page,
[
here](https://github.com/ruby/ruby/blob/5cff4c5aa375787924e2df5c0b981dd922b…)
and if `objspace` was null there then the crash would have had to have already happened in
`gc_sweep_page`. Thus, I think that `gc_finalize_deferred` was _enqueued_ into the
postponed job list with a not-NULL argument, but the argument was corrupted whilst it was
in `vm->postponed_job_buffer`, and if objspace was null there then the crash would have
had to have already happened in `gc_sweep_page`. Thus, I think that `gc_finalize_deferred`
was enqueued into the postponed job list with a not-NULL argument, but the argument was
corrupted whilst it was in `vm->postponed_job_buffer`.
I had a look at the postponed job code, which is of course very tricky because it needs to
be async-signal-safe. More specifically:
* It needs to work if run from a thread not holding the GVL
* It needs to work if run from a thread, whilst another thread is actually executing
`rb_postponed_job_flush`
* It needs to work if run from a signal caught in a thread that is currently executing
`rb_postponed_job_flush` (this rules out trivial mutex-based solutions)
We use the Datadog continuous profiler in our application (CC: @ivoanjo ;), which calls
`rb_postponed_job_register` to capture profiling samples. Thus, I think our application is
likely to hit all of those scenarios semi-reguarly.
My best guess at a plausible sequence of events, to generate this particular crash, is
that:
1. `rb_postponed_job_flush` was running on thread T1.
2. There is a queued call to gc_finalize_deferred sitting in
`vm->postponed_job_buffer[vm->postponed_job_index-1]`.
3. T1 executed the `ATOMIC_CAS` at vm_trace.c:1800, decrementing
`vm->postponed_job_index` (which now equals `index - 1`) and determining that a job at
index index should be run.
4. Thread T2 received a signal, and the Datadog continuous profiler called
`rb_postponed_job_register_one(0, sample_from_postponed_job, NULL)`
5. T2 executed the `ATOMIC_CAS` at vm_trace.c:1679, re-incrementing
`vm->postponed_job_index`. It’s now equal to `index` from T1 again.
6. T2 then executes the sets on `pjob->func` and `pjob->data` at vm_trace.c:1687. It
sets `->func` to `sample_from_postponed_job` (from ddtrace), and `->data` to 0.
7. T1 then goes to call `(*pjob->func)(pjob->data)` at vm_trace.c:1802
8. Since there is no memory barrier between 6 & 7, T1 is allowed to see the effects of
the set on `pjob->data` and not see that of `pjob->func`.
9, T1 thus calls `gc_finalize_deferred` (which it was meant to do) with an argument of 0
(which it was not).
## Solution
Managing a thread-safe list of too-big-to-be-atomic objects (like `rb_postponed_job_t`) is
really tricky. I think it might be best for this code if we use a combination of masking
signals (to prevent manipulation of the postponed job list occurring during
`rb_postponed_job_flush`, and using a semaphore to protect the list (since semaphores are
required to be async-signal-safe on POSIX systems). I've implemented this in a PR
here:
https://github.com/ruby/ruby/pull/8856
It seems _slightly_ slower to do it this way - semaphores require system calls in the
uncontended case, which is why they're async-signal-safe but also makes them more
expensive than pthread mutexes, which don't, on most systems. I ran my branch through
yjit bench:
With my patch:
```
interp: ruby 3.3.0dev (2023-11-07T08:14:09Z ktsanaktsidis/post.. 342f30f566)
[arm64-darwin22]
yjit: ruby 3.3.0dev (2023-11-07T08:14:09Z ktsanaktsidis/post.. 342f30f566) +YJIT
[arm64-darwin22]
-------------- ----------- ---------- --------- ---------- ------------
-----------
bench interp (ms) stddev (%) yjit (ms) stddev (%) yjit 1st itr
interp/yjit
activerecord 31.2 3.4 17.0 5.7 1.29 1.83
chunky-png 543.5 0.5 367.0 0.7 1.40 1.48
erubi-rails 1044.2 0.6 564.7 1.3 1.69 1.85
hexapdf 1517.6 3.1 917.6 1.2 1.46 1.65
liquid-c 37.1 1.3 28.9 1.4 0.89 1.29
liquid-compile 39.0 1.4 29.9 1.6 0.76 1.30
liquid-render 89.9 1.8 39.6 1.4 1.37 2.27
lobsters 598.2 1.7 435.4 5.2 0.63 1.37
mail 79.8 3.1 52.5 1.0 0.79 1.52
psych-load 1441.5 1.7 885.4 0.5 1.60 1.63
railsbench 1010.8 1.0 609.3 1.3 1.24 1.66
ruby-lsp 40.9 3.4 29.2 30.0 0.66 1.40
sequel 39.8 1.8 33.0 2.4 1.18 1.21
-------------- ----------- ---------- --------- ---------- ------------
-----------
```
Without the patch:
```
interp: ruby 3.3.0dev (2023-11-07T07:56:43Z master 5a2779d40f) [arm64-darwin22]
yjit: ruby 3.3.0dev (2023-11-07T07:56:43Z master 5a2779d40f) +YJIT [arm64-darwin22]
-------------- ----------- ---------- --------- ---------- ------------
-----------
bench interp (ms) stddev (%) yjit (ms) stddev (%) yjit 1st itr
interp/yjit
activerecord 31.3 3.3 16.7 5.5 1.36 1.88
chunky-png 521.6 0.6 348.8 0.7 1.40 1.50
erubi-rails 1038.9 0.9 566.3 1.2 1.70 1.83
hexapdf 1501.9 1.1 951.7 3.9 1.42 1.58
liquid-c 36.7 1.2 29.3 1.7 0.86 1.25
liquid-compile 38.8 1.1 29.7 3.7 0.73 1.31
liquid-render 92.2 0.9 38.3 1.0 1.47 2.40
lobsters 582.5 2.0 429.8 5.6 0.59 1.36
mail 77.9 1.3 54.8 0.9 0.76 1.42
psych-load 1419.1 0.7 887.7 0.5 1.60 1.60
railsbench 1017.8 1.1 609.9 1.2 1.24 1.67
ruby-lsp 41.0 2.2 28.8 28.8 0.64 1.43
sequel 36.0 1.5 30.4 1.8 1.11 1.18
-------------- ----------- ---------- --------- ---------- ------------
-----------
```
Maybe this is within the noise floor, but I thought I should bring it up.
---Files--------------------------------
crash.txt (3.69 KB)
--
https://bugs.ruby-lang.org/