New subject: [ruby-core:115313] [Ruby master Bug#19991] rb_register_postponed_job async-signal-unsafety causes crash in GC

7 Nov 2023

Issue #19991 has been reported by kjtsanaktsidis (KJ Tsanaktsidis).

----------------------------------------
Bug #19991: rb_register_postponed_job async-signal-unsafety causes crash in GC
https://bugs.ruby-lang.org/issues/19991

* Author: kjtsanaktsidis (KJ Tsanaktsidis)
* Status: Open
* Priority: Normal
* Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN
----------------------------------------
Our production application experienced an interpreter crash today that I’m fairly sure is
the result of a bug in the `rb_register_postponed_job` infrastructure.

## Diagnosis

I’ve attached a more complete version of the crash dump, but the key information I think
is the following backtrace:

```
<internal:gc>:35: [BUG] Segmentation fault at 0x00000000000000c8
....
/usr/lib/libruby.so.3.1(rb_vm_bugreport+0x60c) [0xffffaaf57a6c] vm_dump.c:759
/usr/lib/libruby.so.3.1(rb_bug_for_fatal_signal+0xd8) [0xffffaad6d120] error.c:821
/usr/lib/libruby.so.3.1(sigsegv+0x58) [0xffffaaeb4350] signal.c:964
/usr/lib/libruby.so.3.1(sigill) (null):0
linux-vdso.so.1(__kernel_rt_sigreturn+0x0) [0xffffab20f78c]
/usr/lib/libruby.so.3.1(rbimpl_atomic_exchange+0x0) [0xffffaad94018] gc.c:4081
/usr/lib/libruby.so.3.1(gc_finalize_deferred) gc.c:4081
/usr/lib/libruby.so.3.1(rb_postponed_job_flush+0x218) [0xffffaaf5dd60] vm_trace.c:1728
/usr/lib/libruby.so.3.1(rb_threadptr_execute_interrupts+0x328) [0xffffaaef7520]
thread.c:2444
/usr/lib/libruby.so.3.1(vm_exec_core+0x4e54) [0xffffaaf41c6c] vm.inc:588
/usr/lib/libruby.so.3.1(rb_vm_exec+0x144) [0xffffaaf427cc] vm.c:2211
```

I disassembled `gc_finalize_deferred` in GDB (different execution, so the addresses don’t
line up - this is NOT from a core dump):

```
(gdb) disassemble gc_finalize_deferred
Dump of assembler code for function gc_finalize_deferred:
   0x0000fffff7cc3ff8 <+0>:  stp x29, x30, [sp, #-48]!
   0x0000fffff7cc3ffc <+4>:  mov w1, #0x1 // #1
   0x0000fffff7cc4000 <+8>:  mov x29, sp
   0x0000fffff7cc4004 <+12>: stp x19, x20, [sp, #16]
   0x0000fffff7cc4008 <+16>: mov x19, x0
   0x0000fffff7cc400c <+20>: add x20, x0, #0xc8
   0x0000fffff7cc4010 <+24>: ldaxr w0, [x20]
   .... continues ....
```

Based on the line numbers from the debuginfo, the faulting instruction is likely to be
`gc_finalize_deferred+24` (which is the inlined `rbimpl_atomic_exchange`). That would
attempt a load of 0xc8 if x0 was zero - i.e. if `gc_finalize_deferred` was called with a
NULL objspace.

The enqueuing of `gc_finalize_deferred` with postponed job only happens in one place (in
gc_sweep_page,
[here](https://github.com/ruby/ruby/blob/5cff4c5aa375787924e2df5c0b981dd922b…)
and if `objspace` was null there then the crash would have had to have already happened in
`gc_sweep_page`. Thus, I think that `gc_finalize_deferred` was _enqueued_ into the
postponed job list with a not-NULL argument, but the argument was corrupted whilst it was
in `vm->postponed_job_buffer`, and if objspace was null there then the crash would have
had to have already happened in `gc_sweep_page`. Thus, I think that `gc_finalize_deferred`
was enqueued into the postponed job list with a not-NULL argument, but the argument was
corrupted whilst it was in `vm->postponed_job_buffer`.

I had a look at the postponed job code, which is of course very tricky because it needs to
be async-signal-safe. More specifically:
* It needs to work if run from a thread not holding the GVL
* It needs to work if run from a thread, whilst another thread is actually executing
`rb_postponed_job_flush`
* It needs to work if run from a signal caught in a thread that is currently executing
`rb_postponed_job_flush` (this rules out trivial mutex-based solutions)

We use the Datadog continuous profiler in our application (CC: @ivoanjo ;), which calls
`rb_postponed_job_register` to capture profiling samples. Thus, I think our application is
likely to hit all of those scenarios semi-reguarly.

My best guess at a plausible sequence of events, to generate this particular crash, is
that:
1. `rb_postponed_job_flush` was running on thread T1.
2. There is a queued call to gc_finalize_deferred sitting in
`vm->postponed_job_buffer[vm->postponed_job_index-1]`.
3. T1 executed the `ATOMIC_CAS` at vm_trace.c:1800, decrementing
`vm->postponed_job_index` (which now equals `index - 1`) and determining that a job at
index index should be run.
4. Thread T2 received a signal, and the Datadog continuous profiler called
`rb_postponed_job_register_one(0, sample_from_postponed_job, NULL)`
5. T2 executed the `ATOMIC_CAS` at vm_trace.c:1679, re-incrementing
`vm->postponed_job_index`. It’s now equal to `index` from T1 again.
6. T2 then executes the sets on `pjob->func` and `pjob->data` at vm_trace.c:1687. It
sets `->func` to `sample_from_postponed_job` (from ddtrace), and `->data` to 0.
7. T1 then goes to call `(*pjob->func)(pjob->data)` at vm_trace.c:1802
8. Since there is no memory barrier between 6 & 7, T1 is allowed to see the effects of
the set on `pjob->data` and not see that of `pjob->func`.
9, T1 thus calls `gc_finalize_deferred` (which it was meant to do) with an argument of 0
(which it was not).

## Solution

Managing a thread-safe list of too-big-to-be-atomic objects (like `rb_postponed_job_t`) is
really tricky. I think it might be best for this code if we use a combination of masking
signals (to prevent manipulation of the postponed job list occurring during
`rb_postponed_job_flush`, and using a semaphore to protect the list (since semaphores are
required to be async-signal-safe on POSIX systems). I've implemented this in a PR
here: https://github.com/ruby/ruby/pull/8856

It seems _slightly_ slower to do it this way - semaphores require system calls in the
uncontended case, which is why they're async-signal-safe but also makes them more
expensive than pthread mutexes, which don't, on most systems. I ran my branch through
yjit bench:

With my patch:

```
interp: ruby 3.3.0dev (2023-11-07T08:14:09Z ktsanaktsidis/post.. 342f30f566)
[arm64-darwin22]
yjit: ruby 3.3.0dev (2023-11-07T08:14:09Z ktsanaktsidis/post.. 342f30f566) +YJIT
[arm64-darwin22]

--------------  -----------  ----------  ---------  ----------  ------------ 
-----------
bench           interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr 
interp/yjit
activerecord    31.2         3.4         17.0       5.7         1.29          1.83
chunky-png      543.5        0.5         367.0      0.7         1.40          1.48
erubi-rails     1044.2       0.6         564.7      1.3         1.69          1.85
hexapdf         1517.6       3.1         917.6      1.2         1.46          1.65
liquid-c        37.1         1.3         28.9       1.4         0.89          1.29
liquid-compile  39.0         1.4         29.9       1.6         0.76          1.30
liquid-render   89.9         1.8         39.6       1.4         1.37          2.27
lobsters        598.2        1.7         435.4      5.2         0.63          1.37
mail            79.8         3.1         52.5       1.0         0.79          1.52
psych-load      1441.5       1.7         885.4      0.5         1.60          1.63
railsbench      1010.8       1.0         609.3      1.3         1.24          1.66
ruby-lsp        40.9         3.4         29.2       30.0        0.66          1.40
sequel          39.8         1.8         33.0       2.4         1.18          1.21
--------------  -----------  ----------  ---------  ----------  ------------ 
-----------
```

Without the patch:
```
interp: ruby 3.3.0dev (2023-11-07T07:56:43Z master 5a2779d40f) [arm64-darwin22]
yjit: ruby 3.3.0dev (2023-11-07T07:56:43Z master 5a2779d40f) +YJIT [arm64-darwin22]

--------------  -----------  ----------  ---------  ----------  ------------ 
-----------
bench           interp (ms)  stddev (%)  yjit (ms)  stddev (%)  yjit 1st itr 
interp/yjit
activerecord    31.3         3.3         16.7       5.5         1.36          1.88
chunky-png      521.6        0.6         348.8      0.7         1.40          1.50
erubi-rails     1038.9       0.9         566.3      1.2         1.70          1.83
hexapdf         1501.9       1.1         951.7      3.9         1.42          1.58
liquid-c        36.7         1.2         29.3       1.7         0.86          1.25
liquid-compile  38.8         1.1         29.7       3.7         0.73          1.31
liquid-render   92.2         0.9         38.3       1.0         1.47          2.40
lobsters        582.5        2.0         429.8      5.6         0.59          1.36
mail            77.9         1.3         54.8       0.9         0.76          1.42
psych-load      1419.1       0.7         887.7      0.5         1.60          1.60
railsbench      1017.8       1.1         609.9      1.2         1.24          1.67
ruby-lsp        41.0         2.2         28.8       28.8        0.64          1.43
sequel          36.0         1.5         30.4       1.8         1.11          1.18
--------------  -----------  ----------  ---------  ----------  ------------ 
-----------
```

Maybe this is within the noise floor, but I thought I should bring it up.

-- 
https://bugs.ruby-lang.org/

[ruby-core:115286] [Ruby master Bug#19991] rb_register_postponed_job async-signal-unsafety causes crash in GC