
Issue #20493 has been updated by kjtsanaktsidis (KJ Tsanaktsidis). OK, I believe I've worked out what's wrong here. The BLOCKING_REGION macro is used to release the GVL, run some code, and re-acquire the GVL. It looks like this: ``` #define BLOCKING_REGION(th, exec, ubf, ubfarg, fail_if_interrupted) do { \ struct rb_blocking_region_buffer __region; \ if (blocking_region_begin(th, &__region, (ubf), (ubfarg), fail_if_interrupted) || \ /* always return true unless fail_if_interrupted */ \ !only_if_constant(fail_if_interrupted, TRUE)) { \ exec; \ blocking_region_end(th, &__region); \ }; \ } while(0) ``` `blocking_region_begin` calls `RB_VM_SAVE_MACHINE_CONTEXT`, which: * Updates the "end of machine stack" pointer in `ec->machine` to be the current stack pointer, * Abuses `setjmp` to save the register state in the `ec->machine` structure as well The idea being that while the thread is off doing things in C-land, another thread performing GC can mark the portion of the machine stack from before the `BLOCKING_REGION` call. The problem is that the disassembly for `blocking_region_begin` looks like this: ``` (rr) disassemble Dump of assembler code for function blocking_region_begin: 0x000074794fa36900 <+0>: push %r15 0x000074794fa36902 <+2>: push %r14 0x000074794fa36904 <+4>: mov %rdx,%r14 0x000074794fa36907 <+7>: push %r13 0x000074794fa36909 <+9>: push %r12 0x000074794fa3690b <+11>: mov %rsi,%r12 0x000074794fa3690e <+14>: push %rbp 0x000074794fa3690f <+15>: push %rbx ``` i.e. it pushes several of the callee-saved registers to the stack, and uses them for its own purposes - including `%r15`, which had the value of `str` from the `rb_io_getline_fast` frame below. When it calls `setjmp` to capture the register state, the value of `%r15` that's saved is the value used in _this_ function, not the old value. That would be OK, because it's spilled to the stack, and the "end of stack pointer" is also incremented, so it should be reachable by the GC that way. However, `blocking_region_begin` is a function, and it's not inlined, the machine stack pointer is restored at the function end. Then, the memory where `%r15` and hence `str` were spilled to gets re-used for the C code doing the without-the-GVL work. I.e. the sequence of operations is: * Call `blocking_region_begin`. Stack pointer is decremented (stack made bigger) * Callee-saved registers including the value of `str` are spilled to the stack. * %r15 is overwritten * Current $sp is set as EC's "end of the stack" value. * Registers are saved to EC's regs structure * `blocking_region_begin` returns. Stack pointer is incremented (stack made smaller). The memory containing the value of `str` now lies beyond the end of the stack. The real value of `str` is returned to %r15, because it's callee-saved. * More work happens. The stack is made bigger again by other functions, and junk is written to that stack slot * GC marking then can't see the value of `str` A "quick fix" for this is to inline the saving of machine stack state into the `BLOCKING_REGION` macro. That way, it's not in a separate function call, and so the stack frame containing the spilled registers is not invalidated until after the blocking completes. Longer term, though, I'd like to rethink how this works a bit. I think `BLOCKING_REGION` being a kind of "macro accepting a block" is a bad idea, because it mixes up the careful stack management that it has to do with the stack state of the caller. I think it should be converted to use a function pointer callback like `rb_thread_call_without_gvl` etc, and we should also consider using hand-written assembly to perform the stack management I think (like what we use in the fiber switching code). ---------------------------------------- Bug #20493: Segfault on rb_io_getline_fast https://bugs.ruby-lang.org/issues/20493#change-108326 * Author: josegomezr (Jose Gomez) * Status: Open * Assignee: kjtsanaktsidis (KJ Tsanaktsidis) * ruby -v: 3.3.1 * Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- We've spotted a consistent segfault when running bundle install with `--jobs 4` When running: `bundle install -j 4` we'd get a Segfault at: ``` /usr/lib64/ruby/3.3.0/rubygems/ext/builder.rb:93: [BUG] Segmentation fault at 0x0000000000000014 ruby 3.3.1 (2024-04-23 revision c56cd86388) [x86_64-linux-gnu] ``` Full [log is available here][0]. I could not find a shorter reproducer besides using bundler with `--jobs 4` or `--jobs 8`. Here's a sample command to trigger the behavior (it creates the Gemfile and calls bundler) [1]. We installed all debug symbols and narrowed down the location of the segfault to `rb_io_getline_fast` in io.c At [line 4001][3] `str` is `T_NONE`, which makes further usage down the line in [`io_enc_str`][4] raise a null pointer dereference. With the notes from [extension.rdoc - Appendix E. RB_GC_GUARD to protect from premature GC][8] I've prepared a patched ruby 3.3.1 package that does not segfault. It's on [OBS Project home:josegomezr:branches:ruby/ruby3.3][6]. Adding a `RB_GC_GUARD` on `rb_io_getline_fast` @ `io.c:4004` just before the return ```diff --- ruby3.3.orig/ruby-3.3.1/io.c +++ ruby3.3/ruby-3.3.1/io.c @@ -4004,6 +4004,7 @@ rb_io_getline_fast(rb_io_t *fptr, rb_enc ENC_CODERANGE_SET(str, cr); fptr->lineno++; + RB_GC_GUARD(str); return str; } ``` Fixes the segfault in our tests. `bundle` finish the installation and the image is built. I've set up a project in OBS to provide reproduceables. - [ruby3.3.1 package][5]. - [ruby3.3.1 base image with enough dependencies to reproduce][7] with [the reproducer script][1]. And the corresponding container is exported in the `containers-patched` repository. Here I leave the docker images generated by OBS: - 3.3.1 [without patches, segfaults.] ``` registry.opensuse.org/home/josegomezr/branches/ruby/containers/containers/base-ruby33:latest ``` - 3.3.1 [with patch, does not fail] ``` registry.opensuse.org/home/josegomezr/branches/ruby/containers/containers-patched/base-ruby33:latest ``` [0]: https://gist.github.com/josegomezr/441c271cc731b0ec57213cb98743a699 [1]: https://gist.github.com/josegomezr/e17129bf2df33f3bea60e84a616a8322 [2]: https://gist.github.com/josegomezr/6f81878c979af334efee59b8f2225e58 [3]: https://github.com/ruby/ruby/blob/v3_3_1/io.c#L4001 [4]: https://github.com/ruby/ruby/blob/v3_3_1/io.c#L4003 [5]: https://build.opensuse.org/package/show/devel:languages:ruby/ruby3.3 [6]: https://build.opensuse.org/package/show/home:josegomezr:branches:ruby/ruby3.... [7]: https://build.opensuse.org/package/show/home:josegomezr:branches:ruby:contai... [8]: https://github.com/ruby/ruby/blob/master/doc/extension.rdoc#label-Appendix+E... -- https://bugs.ruby-lang.org/