[ruby-core:115346] [Ruby master Bug#20001] Make Ruby work properly with ASAN enabled

Issue #20001 has been reported by kjtsanaktsidis (KJ Tsanaktsidis). ---------------------------------------- Bug #20001: Make Ruby work properly with ASAN enabled https://bugs.ruby-lang.org/issues/20001 * Author: kjtsanaktsidis (KJ Tsanaktsidis) * Status: Open * Priority: Normal * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- This ticket covers some work I want to do to get the ASAN infrastructure working in Ruby working again. I don't know if it ever worked well, but if it did, it appears to have bitrotted. Here are a few of its current problems: ### Stack size calculations are wrong Ruby takes the address of a local variable as part of the process of working out the top/bottom of the native stack in `native_thread_init_stack`. Because ASAN can end up putting some local variables on a "fake stack", this calculation can wind up producing the wrong result and setting `th->ec->machine.stack_start` incorrectly. This then leads to `stack_check` thinking that the machine stack has overflowed all the time, and thus, leading to programs like the following to fail: ``` ASAN_OPTIONS=use_sigaltstack=0:detect_leaks=0 ./miniruby -e 'Thread.new { puts "hi" }.join' #<Thread:0x00007fb5d79f3f28 -e:1 run> terminated with exception (report_on_exception is true): SystemStackError -e: stack level too deep (SystemStackError) ``` Another consequence of stack size detection not working properly is that the machine stack is not properly marked during GC, so things on the stack which should be considered live get prematurely collected. ASAN provides the `__asan_addr_is_in_fake_stack` function which can be used to get the address of a local variable on the real stack; I think Ruby's various stack-detecting macros could then make use of this to make it work. ### VALUEs in fake stacks are not marked Another consequece of ASAN storing local variables in fake stacks is that we don't see them when doing the machine stack mark. Again, the `__asan_addr_is_in_fake_stack` function can help us here. ASAN leaves a pointer to the fake stack on the real stack in every frame. When marking the machine stack, we can check each word to see if it's a pointer to a fake stack frame, and then use `__asan_addr_is_in_fake_stack` to get the extents of the fake frame and scan that too. This seems to be e.g. [how V8 does it](https://github.com/v8/v8/blob/1c7d3a94fc051e0fd69584b930faab448026941e/src/h...) ### Doesn't work with GCC Our ASAN implementation doesn't work with GCC, even though GCC supports ASAN. This is because we use the `__has_feature(address_sanitizer)` macro in sanitizers.h, which is a clang-ism. The equivalent GCCism is `__SANITIZE_ADDRESS__` and we should check that too. ## Plan of attack At the moment, I can't even run a full build of ruby to run, because miniruby crashees during the build process. My plan of attack here is to: * Address those known problems I've already identified above * Get `make` to actually work with asan * Try running the test suite through ASAN, and fix any issues that turns up * I'm thinking we should add an `--enable-asan` or `--enable-address-sanitizer` or some such to our configure script, to make it easy to build Ruby with ASAN without having to poke around with individual CFLAGS/LDFLAGS * Eventually, it would be great to actually run the tests under ASAN in CI This is probably a medium term body of work, but I'll try and tackle it in bits. Also: @HParker and @peterzhu2118 - I know you folks have been working on getting Valgrind to work better with Ruby, for leak detection. I _think_ I see my efforts here as complementary to yours, rather than duplicative. The ASAN infrastructure for poisoning/unpoisoning stuff in the GC already exists and is _close_ to working properly, and it really did help me solve a bug yesterday (https://bugs.ruby-lang.org/issues/19994), so it seems useful and should be made to work. Your work on freeing memory on shutdown (https://bugs.ruby-lang.org/issues/19993) should actually help ASAN usefully detect leaks as well. I think ASAN might be better for eventually running CI checks against the full Ruby test suite, since allegedly it's faster. However, if you think solving these issues with ASAN is a waste of time and Valgrind can catch the same bugs already, please chime in! -- https://bugs.ruby-lang.org/

Issue #20001 has been updated by kjtsanaktsidis (KJ Tsanaktsidis). This is the first part of this work: https://github.com/ruby/ruby/pull/8903 With this PR, i can at least run `make` with asan enabled on both macos & linux with clang. GCC still doesn't work and i haven't tried actually running the test suite yet! ---------------------------------------- Bug #20001: Make Ruby work properly with ASAN enabled https://bugs.ruby-lang.org/issues/20001#change-105281 * Author: kjtsanaktsidis (KJ Tsanaktsidis) * Status: Open * Priority: Normal * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- This ticket covers some work I want to do to get the ASAN infrastructure working in Ruby working again. I don't know if it ever worked well, but if it did, it appears to have bitrotted. Here are a few of its current problems: ### Stack size calculations are wrong Ruby takes the address of a local variable as part of the process of working out the top/bottom of the native stack in `native_thread_init_stack`. Because ASAN can end up putting some local variables on a "fake stack", this calculation can wind up producing the wrong result and setting `th->ec->machine.stack_start` incorrectly. This then leads to `stack_check` thinking that the machine stack has overflowed all the time, and thus, leading to programs like the following to fail: ``` ASAN_OPTIONS=use_sigaltstack=0:detect_leaks=0 ./miniruby -e 'Thread.new { puts "hi" }.join' #<Thread:0x00007fb5d79f3f28 -e:1 run> terminated with exception (report_on_exception is true): SystemStackError -e: stack level too deep (SystemStackError) ``` Another consequence of stack size detection not working properly is that the machine stack is not properly marked during GC, so things on the stack which should be considered live get prematurely collected. ASAN provides the `__asan_addr_is_in_fake_stack` function which can be used to get the address of a local variable on the real stack; I think Ruby's various stack-detecting macros could then make use of this to make it work. ### VALUEs in fake stacks are not marked Another consequece of ASAN storing local variables in fake stacks is that we don't see them when doing the machine stack mark. Again, the `__asan_addr_is_in_fake_stack` function can help us here. ASAN leaves a pointer to the fake stack on the real stack in every frame. When marking the machine stack, we can check each word to see if it's a pointer to a fake stack frame, and then use `__asan_addr_is_in_fake_stack` to get the extents of the fake frame and scan that too. This seems to be e.g. [how V8 does it](https://github.com/v8/v8/blob/1c7d3a94fc051e0fd69584b930faab448026941e/src/h...) ### Doesn't work with GCC Our ASAN implementation doesn't work with GCC, even though GCC supports ASAN. This is because we use the `__has_feature(address_sanitizer)` macro in sanitizers.h, which is a clang-ism. The equivalent GCCism is `__SANITIZE_ADDRESS__` and we should check that too. ## Plan of attack At the moment, I can't even run a full build of ruby to run, because miniruby crashees during the build process. My plan of attack here is to: * Address those known problems I've already identified above * Get `make` to actually work with asan * Try running the test suite through ASAN, and fix any issues that turns up * I'm thinking we should add an `--enable-asan` or `--enable-address-sanitizer` or some such to our configure script, to make it easy to build Ruby with ASAN without having to poke around with individual CFLAGS/LDFLAGS * Eventually, it would be great to actually run the tests under ASAN in CI This is probably a medium term body of work, but I'll try and tackle it in bits. Also: @HParker and @peterzhu2118 - I know you folks have been working on getting Valgrind to work better with Ruby, for leak detection. I _think_ I see my efforts here as complementary to yours, rather than duplicative. The ASAN infrastructure for poisoning/unpoisoning stuff in the GC already exists and is _close_ to working properly, and it really did help me solve a bug yesterday (https://bugs.ruby-lang.org/issues/19994), so it seems useful and should be made to work. Your work on freeing memory on shutdown (https://bugs.ruby-lang.org/issues/19993) should actually help ASAN usefully detect leaks as well. I think ASAN might be better for eventually running CI checks against the full Ruby test suite, since allegedly it's faster. However, if you think solving these issues with ASAN is a waste of time and Valgrind can catch the same bugs already, please chime in! -- https://bugs.ruby-lang.org/

Issue #20001 has been updated by HParker (Adam Hess).
I think I see my efforts here as complementary to yours, rather than duplicative
💯 I agree. I believe having both ASAN and Valgrind available is ideal. Valgind, I tend to be very confident in the results, but ASAN will be easier to integrate with CI and run consistently. The ability to run a bigger suite in CI might make ASAN a better long term solution if it can find the same problems. Maybe we can share the environment variable since i assume we will end up needing to manage/free the same things? ---------------------------------------- Bug #20001: Make Ruby work properly with ASAN enabled https://bugs.ruby-lang.org/issues/20001#change-105290 * Author: kjtsanaktsidis (KJ Tsanaktsidis) * Status: Open * Priority: Normal * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- This ticket covers some work I want to do to get the ASAN infrastructure working in Ruby working again. I don't know if it ever worked well, but if it did, it appears to have bitrotted. Here are a few of its current problems: ### Stack size calculations are wrong Ruby takes the address of a local variable as part of the process of working out the top/bottom of the native stack in `native_thread_init_stack`. Because ASAN can end up putting some local variables on a "fake stack", this calculation can wind up producing the wrong result and setting `th->ec->machine.stack_start` incorrectly. This then leads to `stack_check` thinking that the machine stack has overflowed all the time, and thus, leading to programs like the following to fail: ``` ASAN_OPTIONS=use_sigaltstack=0:detect_leaks=0 ./miniruby -e 'Thread.new { puts "hi" }.join' #<Thread:0x00007fb5d79f3f28 -e:1 run> terminated with exception (report_on_exception is true): SystemStackError -e: stack level too deep (SystemStackError) ``` Another consequence of stack size detection not working properly is that the machine stack is not properly marked during GC, so things on the stack which should be considered live get prematurely collected. ASAN provides the `__asan_addr_is_in_fake_stack` function which can be used to get the address of a local variable on the real stack; I think Ruby's various stack-detecting macros could then make use of this to make it work. ### VALUEs in fake stacks are not marked Another consequece of ASAN storing local variables in fake stacks is that we don't see them when doing the machine stack mark. Again, the `__asan_addr_is_in_fake_stack` function can help us here. ASAN leaves a pointer to the fake stack on the real stack in every frame. When marking the machine stack, we can check each word to see if it's a pointer to a fake stack frame, and then use `__asan_addr_is_in_fake_stack` to get the extents of the fake frame and scan that too. This seems to be e.g. [how V8 does it](https://github.com/v8/v8/blob/1c7d3a94fc051e0fd69584b930faab448026941e/src/h...) ### Doesn't work with GCC Our ASAN implementation doesn't work with GCC, even though GCC supports ASAN. This is because we use the `__has_feature(address_sanitizer)` macro in sanitizers.h, which is a clang-ism. The equivalent GCCism is `__SANITIZE_ADDRESS__` and we should check that too. ## Plan of attack At the moment, I can't even run a full build of ruby to run, because miniruby crashees during the build process. My plan of attack here is to: * Address those known problems I've already identified above * Get `make` to actually work with asan * Try running the test suite through ASAN, and fix any issues that turns up * I'm thinking we should add an `--enable-asan` or `--enable-address-sanitizer` or some such to our configure script, to make it easy to build Ruby with ASAN without having to poke around with individual CFLAGS/LDFLAGS * Eventually, it would be great to actually run the tests under ASAN in CI This is probably a medium term body of work, but I'll try and tackle it in bits. Also: @HParker and @peterzhu2118 - I know you folks have been working on getting Valgrind to work better with Ruby, for leak detection. I _think_ I see my efforts here as complementary to yours, rather than duplicative. The ASAN infrastructure for poisoning/unpoisoning stuff in the GC already exists and is _close_ to working properly, and it really did help me solve a bug yesterday (https://bugs.ruby-lang.org/issues/19994), so it seems useful and should be made to work. Your work on freeing memory on shutdown (https://bugs.ruby-lang.org/issues/19993) should actually help ASAN usefully detect leaks as well. I think ASAN might be better for eventually running CI checks against the full Ruby test suite, since allegedly it's faster. However, if you think solving these issues with ASAN is a waste of time and Valgrind can catch the same bugs already, please chime in! -- https://bugs.ruby-lang.org/

Issue #20001 has been updated by kjtsanaktsidis (KJ Tsanaktsidis).
Valgind, I tend to be very confident in the results, but ASAN will be easier to integrate with CI and run consistently
That's how I see the two tools too
Maybe we can share the environment variable since i assume we will end up needing to manage/free the same things?
I don't think the ASAN stuff needs anything specific for free-on-exit above what you're working on; once you have that merged it should Just Work (tm) and make ASAN report fewer leaks at termination. ---------------------------------------- Bug #20001: Make Ruby work properly with ASAN enabled https://bugs.ruby-lang.org/issues/20001#change-105413 * Author: kjtsanaktsidis (KJ Tsanaktsidis) * Status: Open * Priority: Normal * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- This ticket covers some work I want to do to get the ASAN infrastructure working in Ruby working again. I don't know if it ever worked well, but if it did, it appears to have bitrotted. Here are a few of its current problems: ### Stack size calculations are wrong Ruby takes the address of a local variable as part of the process of working out the top/bottom of the native stack in `native_thread_init_stack`. Because ASAN can end up putting some local variables on a "fake stack", this calculation can wind up producing the wrong result and setting `th->ec->machine.stack_start` incorrectly. This then leads to `stack_check` thinking that the machine stack has overflowed all the time, and thus, leading to programs like the following to fail: ``` ASAN_OPTIONS=use_sigaltstack=0:detect_leaks=0 ./miniruby -e 'Thread.new { puts "hi" }.join' #<Thread:0x00007fb5d79f3f28 -e:1 run> terminated with exception (report_on_exception is true): SystemStackError -e: stack level too deep (SystemStackError) ``` Another consequence of stack size detection not working properly is that the machine stack is not properly marked during GC, so things on the stack which should be considered live get prematurely collected. ASAN provides the `__asan_addr_is_in_fake_stack` function which can be used to get the address of a local variable on the real stack; I think Ruby's various stack-detecting macros could then make use of this to make it work. ### VALUEs in fake stacks are not marked Another consequece of ASAN storing local variables in fake stacks is that we don't see them when doing the machine stack mark. Again, the `__asan_addr_is_in_fake_stack` function can help us here. ASAN leaves a pointer to the fake stack on the real stack in every frame. When marking the machine stack, we can check each word to see if it's a pointer to a fake stack frame, and then use `__asan_addr_is_in_fake_stack` to get the extents of the fake frame and scan that too. This seems to be e.g. [how V8 does it](https://github.com/v8/v8/blob/1c7d3a94fc051e0fd69584b930faab448026941e/src/h...) ### Doesn't work with GCC Our ASAN implementation doesn't work with GCC, even though GCC supports ASAN. This is because we use the `__has_feature(address_sanitizer)` macro in sanitizers.h, which is a clang-ism. The equivalent GCCism is `__SANITIZE_ADDRESS__` and we should check that too. ## Plan of attack At the moment, I can't even run a full build of ruby to run, because miniruby crashees during the build process. My plan of attack here is to: * Address those known problems I've already identified above * Get `make` to actually work with asan * Try running the test suite through ASAN, and fix any issues that turns up * I'm thinking we should add an `--enable-asan` or `--enable-address-sanitizer` or some such to our configure script, to make it easy to build Ruby with ASAN without having to poke around with individual CFLAGS/LDFLAGS * Eventually, it would be great to actually run the tests under ASAN in CI This is probably a medium term body of work, but I'll try and tackle it in bits. Also: @HParker and @peterzhu2118 - I know you folks have been working on getting Valgrind to work better with Ruby, for leak detection. I _think_ I see my efforts here as complementary to yours, rather than duplicative. The ASAN infrastructure for poisoning/unpoisoning stuff in the GC already exists and is _close_ to working properly, and it really did help me solve a bug yesterday (https://bugs.ruby-lang.org/issues/19994), so it seems useful and should be made to work. Your work on freeing memory on shutdown (https://bugs.ruby-lang.org/issues/19993) should actually help ASAN usefully detect leaks as well. I think ASAN might be better for eventually running CI checks against the full Ruby test suite, since allegedly it's faster. However, if you think solving these issues with ASAN is a waste of time and Valgrind can catch the same bugs already, please chime in! -- https://bugs.ruby-lang.org/
participants (2)
-
HParker (Adam Hess)
-
kjtsanaktsidis (KJ Tsanaktsidis)