Issue #19541 has been updated by kjtsanaktsidis (KJ Tsanaktsidis).
Thanks Alan for your feedback and clarifying YJIT's goals for me.
First off, let me confirm I'm on the same page as you about a couple of things.
I totally agree the unwind-info-registration API's in GNU land are _awful_. Windows
does this way better with `RtlInstallFunctionTableCallback` - it covers both in-process
and out-of-process unwinding in a lazy way. Alas, not what we have available on
GNU/Linux.
I agree with your premise that YJIT does not muck around with the stack in very creative
ways, and very little information is actually needed to unwind through YJIT frames. My
approach in my POC was to make Ruby use the most obvious and well-exercised platform APIs
for registering unwind info, which to me seemed to be `__register_frame` and
`__jit_debug_register_code`. That is, I went through the rigmarole of DWARF CFI & ELF
generation to try and be a good platform citizen.
I also agree however that this is pretty heavyweight - the ELF file generation especially
because it has to be regenerated periodically if it runs out of "free" space to
jam more debuginfo in there.
Finally, I also acknowledge that adding Rust dependencies increases compile times & is
a huge pain for downstream distributors etc, and you've gone to quite some effort to
_not_ do that - I assume these are the main issues with actually adding the gimli/object
dependencies per-se?
The general "vibe" I get from your feedback is that we don't want to
introduce huge implementation complexity just to make YJIT use the "standard"
unwinding mechanisms; rather, we should actually implement the simplest thing that works
for YJIT, and then tailor _that_ to platform interfaces.
One final thing to clarify though:
For cases where Ruby already links with libunwind
(some Linux distros and BSDs), we can register with its dynamic interface
If you're referring to `UNW_INFO_FORMAT_DYNAMIC` info, that's actually totally
unimplemented in libunwind for anything except Itanium (which... I assume is not a target
YJIT wants to support xD ). `UNW_INFO_FORMAT_TABLE` works AFAICT, but requries generating
DWARF CFI info (which is something we'd like to avoid).
---
OK, so what can we do that satisfies the following constraints?
1. Lets us unwind stacks containing YJIT frames in both GDB and the crash reporter
2. Does not require us to construct complex in-memory structures which are really designed
for on-disk use (i.e. no ELF files)
3. Does not require us to use DWARF CFI (which is far too complex for the simple stacks
that YJIT lays out)
4. Has very little runtime CPU cost to construct and register
5. Has very little runtime memory cost to have hanging around
I think I have a rough idea of something that might fit the bill.
Firstly, let's have YJIT generate a "compact unwind info format" of our own.
I definitely need to experiment with implementation before being too specific here, but
roughly...
* There would actually be two separate tables - one for inline, and one for outline.
* It would be sorted by IP
* It would be only _appended_ to when code is generated - this is because (normally) the
IP of generated code for each code block only increases. This means hopefully a minimum of
gratuitous memcpy'g around of data (except for when it needs to grow).
* Need to do something about Code GC, which violates the "IP only increases"
invariant. Since Code GC frees only whole pages, perhaps the unwind info could be
per-page, and the pages would be stored in a hash table. That would make it O(1) both to
get the right block of unwind info to append to when generating code, as well as when
looking up the unwind info for a given IP.
* For each block, the unwind info would store:
* Start/end IP of the block
* Whether or not this block has a frame_setup prologue
* Whether or not this block has a frame_teardown epilogue
* Whether or not this block is split into the next inline/outline page as well
* ~A pointer to the iseq structure~ (this can come later - it'd be needed for
naming the block, but also introduces some fun GC mark/compaction issues).
If we're allowed to rely on the frame pointer being setup [1], and the shape of our
prologue/epilogues, I think that's all the information needed to do frame unwinding.
[1] This would mean we'd need to add it to x86_64 code generation. The register
isn't actually used for any of YJIT's generated code for any other reason, so I
doubt it'll have a big performance impact.
Now, how do we connect that to GDB & the crash backtracer? Let's treat those
separately...
For GDB, there are actually _three_ JIT code registration mechanisms (that I could
count)...
1. The one using `__jit_debug_register_code` (which I used in my POC):
https://sourceware.org/gdb/onlinedocs/gdb/JIT-Interface.html
2. One that lets you load a .so file in GDB to help it understand your JIT stacks:
https://sourceware.org/gdb/onlinedocs/gdb/Writing-JIT-Debug-Info-Readers.ht…
3. One based on the Python interface:
https://sourceware.org/gdb/onlinedocs/gdb/Unwinding-Frames-in-Python.html
We already ship GDB helpers with Ruby (in `.gdbinit`). It's hopefully possible to
write some Python which can unwind YJIT stacks using the custom unwind info, and also
distribute that inside the Ruby source tree (perhaps it's even possible to distribute
it inline in `.gdbinit` - I can experiment with the specifics of this).
For the crash backtracer, I _think_ libunwind can be bent into shape for our purposes.
* We can add a configure flag `--with-libunwind` or such to compile Ruby against libunwind
if present, even when that would not normally be the case on a given platform.
* If libunwind is present, instead of using `backtrace(3)` to collect the stack all at
once, instead use `unw_init_local` to begin unwinding, and unwind frame-by-frame with
`unw_step`.
* If we encounter an IP we recognise as belonging to YJIT, do _NOT_ call `unw_step` to
unwind that frame.
* Instead, perform the unwinding logic ourselves using the YJIT unwind info, and then
construct a `unw_context_t` for the previous frame by hand (it looks like the necessary
struct definitions are present in the `libunwind-${arch}.h` header files.
* Start unwinding again based on this custom context struct by calling `unw_init_local`;
this _should_ start unwinding from the frame below if we've done it right.
Essentially, the tradeoff here is that we can make unwind info generation much simpler, at
the expense of making unwinding itself more complex (because we can't just rely on the
platform's DWARF unwinder). That seems like a reasonable tradeoff to me.
Does this sound like a fruitful path to go down? I should have a few weeks more or less
full time to work on this coming up (I'm taking a sabbatical from work to do open
source stuff!), so I'd really like to know if something along these lines would be
useful, more in line with YJIT's goals, and something which would be considered for
merging.
Thanks again for your time, I appreciate it.
---
Footnote:
it's (GDB's jit interface) also known to be
not have the best speed.
I think this concern only applies while GDB is actually _attached_; I don't think the
speed of running the program under a debugger should be a primary concern of this
unwinding work. This is moot anyway though because the ELF generation is a huge pain as
you point out.
----------------------------------------
Feature #19541: Proposal: Generate frame unwinding info for YJIT code
https://bugs.ruby-lang.org/issues/19541#change-102510
* Author: kjtsanaktsidis (KJ Tsanaktsidis)
* Status: Assigned
* Priority: Normal
* Assignee: yjit
----------------------------------------
## What is being propsed?
Currently, when Ruby crashes with yjit generated code on the stack, `rb_print_backtrace()`
is unable to actually show any frames underneath the yjit code. For example, if you send
SIGSEGV to a Ruby process running yjit, this is what you see:
```
/ruby/miniruby(rb_print_backtrace+0xc) [0xaaaad0276884] /ruby/vm_dump.c:785
/ruby/miniruby(rb_vm_bugreport) /ruby/vm_dump.c:1093
/ruby/miniruby(rb_bug_for_fatal_signal+0xd0) [0xaaaad0075580] /ruby/error.c:813
/ruby/miniruby(sigsegv+0x5c) [0xaaaad01bedac] /ruby/signal.c:919
linux-vdso.so.1(__kernel_rt_sigreturn+0x0) [0xffff91a3e8bc]
/ruby/miniruby(map<(usize, yjit::backend::ir::Insn), (usize, yjit::backend::ir::Insn),
yjit::backend::ir::{impl#17}::next_mapped::{closure_env#0}>+0x8c) [0xaaaad03b8b00]
/rustc/897e37553bba8b42751c67658967889d11ecd120/library/core/src/option.rs:929
/ruby/miniruby(next_mapped+0x3c) [0xaaaad0291dc0] src/backend/ir.rs:1225
/ruby/miniruby(arm64_split+0x114) [0xaaaad0287744] src/backend/arm64/mod.rs:359
/ruby/miniruby(compile_with_regs+0x80) [0xaaaad028bf84] src/backend/arm64/mod.rs:1106
/ruby/miniruby(compile+0xc4) [0xaaaad0291ae0] src/backend/ir.rs:1158
/ruby/miniruby(gen_single_block+0xe44) [0xaaaad02b1f88] src/codegen.rs:854
/ruby/miniruby(gen_block_series_body+0x9c) [0xaaaad03b0250] src/core.rs:1698
/ruby/miniruby(gen_block_series+0x50) [0xaaaad03b0100] src/core.rs:1676
/ruby/miniruby(branch_stub_hit_body+0x80c) [0xaaaad03b1f68] src/core.rs:2021
/ruby/miniruby({closure#0}+0x28) [0xaaaad02eb86c] src/core.rs:1924
/ruby/miniruby(do_call<yjit::core::branch_stub_hit::{closure_env#0}, *const
u8>+0x98) [0xaaaad035ba3c]
/rustc/897e37553bba8b42751c67658967889d11ecd120/library/std/src/panicking.rs:492
[0xaaaad035c9b4]
```
(n.b. - I compiled Ruby with `-fasynchronous-unwind-tables –rdynamic –g` in cflags to make
sure gcc generates appropriate unwind info & keeps the symbol tables).
Likewise, if you attach gdb to a Ruby process with yjit enabled, gdb can't show thread
backtraces through yjit-generated code either.
My proposal is that YJIT generate sufficient unwinding and debug information on all
platforms to allow both `rb_print_backtrace()` and the platform's debugger
(gdb/lldb/WinDbg) to show:
* Full stack traces all the way back to `main`. That is, it should be possible to see
frames _underneath_ `[0xaaaad035c9b4]` from the backtrace above.
* Names for the dynamically generated yjit blocks (e.g. instead of `[0xaaaad035c9b4]`, we
should see something like `yjit$$name_of_ruby_method`, where `name_of_ruby_method` is the
`label` for the iseq this is JIT'd code for).
## Motivation
I have a few motivations for wanting this. Firstly, I feel this functionality is
independently useful. When Ruby crashes, the more information we can get, the more likely
we are to find the root cause. Likewise, the same principle applies to debugging with gdb
- you can get a fuller understanding of what the process is doing if you see the whole
stack.
I have often found attaching gdb to the Ruby interpreter helps in understanding problems
in Ruby code or C extensions and is something I do relatively frequently; yjit breaking
that will definitely be inconvenient for me!
## Implementation
I have a draft implementation here on how I'd implement this:
https://github.com/ruby/ruby/pull/7567. It's currently missing tests & platform
support (it only works on Linux aarch64). Also, it implements unwind info generation, so
unwinding can work _through_ yjit code, but it does not currently emit symbols to give
_names_ to those yjit frames.
My PR contains a document which explains how the Linux interfaces for registering unwind
info for JIT'd code work, so I won't duplicate that information here.
The biggest implementation question I had is around the use of Rust crates. Currently, I
prototyped my implementation using the gimli & object crates, for generating DWARF
info and ELF binaries. However, the yjit build does purposefully does not use cargo &
external crates for release builds. There are a few different ways we could go here:
* Don't use the gimli & object crates; instead, re-implement all debug info &
object file generation code in yjit.
* Don't use the crates; instead, link againt C libraries to provide this functionality
& call them from Rust (perhaps some combination of libelf, libdw, libbfd, or llvm
might do what we need)
* Use cargo after all for the release build & download the crates at build-time
* Use cargo for the release build, but vendor everything, so the build doesn't need to
download anything
* Only make unwind info generation available in dev mode where cargo is used, and so mark
the gimli/object dependencies as optional in Cargo.toml.
We'd need to decide on one of these approaches for this proposal to work. I don't
really have a strong sense of the pros/cons of each.
(Side note - my PR actually depends on a _fork_ of gimli - I've been discussing adding
the needed interfaces upstream here:
https://github.com/gimli-rs/gimli/issues/648).
## Benchmarks
I ran the yit-bench suite on my branch and compared it to Ruby master:
* My branch:
https://gist.github.com/KJTsanaktsidis/5741a9f64e5cd75cdf5fedd846091a4f
* Ruby master:
https://gist.github.com/KJTsanaktsidis/592d3ebcf98f6745dfa3efbd30a25acf
This is a (very simple) comparison:
```
-------------- ------------ ------------ ---------------
bench yjit (ms) branch (ms) branch/yjit (%)
activerecord 97.5 98.5 101.03%
hexapdf 2415.3 2458.2 101.78%
liquid-c 61.9 63.1 101.94%
liquid-render 135.3 135.0 99.78%
mail 104.6 105.5 100.86%
psych-load 1887.1 1922.0 101.85%
railsbench 1544.4 1556.0 100.75%
ruby-lsp 88.4 89.5 101.24%
sequel 147.5 151.1 102.44%
binarytrees 303 305.6 100.86%
chunky_png 1075.8 1079.4 100.33%
erubi 392.9 392.3 99.85%
erubi_rails 14.7 14.7 100.00%
etanni 792.3 791.4 99.89%
fannkuchredux 3815.9 3813.6 99.94%
lee 1030.2 1039.2 100.87%
nbody 49.2 49.3 100.20%
optcarrot 4142 4143.3 100.03%
ruby-json 2860.7 2874.0 100.46%
rubykon 7906.6 7904.2 99.97%
30k_ifelse 348.7 345.4 99.05%
30k_methods 828.6 831.8 100.39%
cfunc_itself 28.8 28.9 100.35%
fib 34.4 34.5 100.29%
getivar 115.5 109.7 94.98%
keyword_args 37.7 38.0 100.80%
respond_to 26 26.1 100.38%
setivar 33.8 33.5 99.11%
setivar_object 208.7 194.3 93.10%
str_concat 52.6 52.2 99.24%
throw 23.8 24.1 101.26%
-------------- ------------ ------------ ---------------
```
It seems like the performance impact of generating and registering the debug info is
marginal.
--
https://bugs.ruby-lang.org/