
Issue #20587 has been updated by ivoanjo (Ivo Anjo). File repro.rb added File hello.c added Nice! I've tried the PR and it actually makes the issue in https://bugs.ruby-lang.org/issues/20586#How-to-reproduce disappear in my tests. (It does not fully solve the issue, but what made it trigger a lot more often was that the `readdir` call blocked with the GVL held, since the Datadog profiler tries to interrupt only threads holding the GVL. With the GVL released, this can only happen in the remote chance the profiler decided to interrupt the thread just as it was preparing to call `readdir`). I also decided recover an old experiment I had to try to trigger some of these problems using a custom "adversarial" FUSE filesystem. Here's my script to trigger the syscalls: ```ruby $liveness = true $looping = false liveness_check = Thread.new do print "Is Ruby alive? (Should print every 0.1s) " while $liveness do print '.' before_sleep = Time.now $looping = true sleep 0.1 delay = Time.now - before_sleep puts "\n[HICCUP -- VM was probably blocked for #{delay - 0.1}s!]" if delay > 0.5 end end Thread.pass while !$looping directory = ARGV.last result = case ARGV.first.to_sym when :readdir d = Dir.new(directory) print "(Opened!)" d.children when :fdopendir # This one doesn't seem to work as a reproducer, the opendir for d0 seems to get reused by libc and that one already # released the GVL raise 'Broken?' d0 = Dir.new(directory) d1 = Dir.for_fd(d0.fileno) d1.to_s when :chdir Dir.chdir(directory) { } when :fchdir d = Dir.new(directory) print "(Opened!)" d.chdir { } when :chroot begin Dir.chroot(directory) rescue => e e end when :lstat Dir.glob("#{directory}/*", File::FNM_DOTMATCH) when :stat Dir.glob("#{directory}/", File::FNM_DOTMATCH) when :fstatat raise 'Did not figure out this one?' else raise "Unknown syscall #{ARGV.first}" end $liveness = false liveness_check.join puts "\n#{result.inspect}" ``` ...and here's how it looks before your PR: ``` $ ./miniruby repro.rb readdir /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) ........................................(Opened!) [HICCUP -- VM was probably blocked for 1.9858527449999999s!] ["hello"] $ ./miniruby repro.rb chdir /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) . [HICCUP -- VM was probably blocked for 1.9008266639999998s!] nil $ ./miniruby repro.rb fchdir /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) ........................................(Opened!) [HICCUP -- VM was probably blocked for 1.987699879s!] nil $ ./miniruby repro.rb chroot /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) . [HICCUP -- VM was probably blocked for 1.9008783179999997s!] #<Errno::EPERM: Operation not permitted @ dir_s_chroot - /home/ivo.anjo/tmp-hello-mount-deleteme> $ ./miniruby repro.rb lstat /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) ........................................ [HICCUP -- VM was probably blocked for 3.9877458889999997s!] ["/home/ivo.anjo/tmp-hello-mount-deleteme/.", "/home/ivo.anjo/tmp-hello-mount-deleteme/hello"] $ ./miniruby repro.rb stat /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) . [HICCUP -- VM was probably blocked for 1.9009780909999998s!] ["/home/ivo.anjo/tmp-hello-mount-deleteme/"] ``` ...here's how it looks after your PR: ``` $ ./miniruby repro.rb readdir /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) ........................................(Opened!).................... ["hello"] $ ./miniruby repro.rb chdir /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) .................... nil $ ./miniruby repro.rb fchdir /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) ........................................(Opened!).................... nil $ ./miniruby repro.rb chroot /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) .................... #<Errno::EPERM: Operation not permitted @ dir_s_chroot - /home/ivo.anjo/tmp-hello-mount-deleteme> $ ./miniruby repro.rb lstat /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) ................................................................................ ["/home/ivo.anjo/tmp-hello-mount-deleteme/.", "/home/ivo.anjo/tmp-hello-mount-deleteme/hello"] $ ./miniruby repro.rb stat /home/ivo.anjo/tmp-hello-mount-deleteme Is Ruby alive? (Should print every 0.1s) .................... ["/home/ivo.anjo/tmp-hello-mount-deleteme/"] ``` I'm also attaching the "adversarial" FUSE filesystem -- I basically added a bunch of sleeps to https://github.com/libfuse/libfuse/blob/master/example/hello.c (my version attached), built the examples and mounted the filesystem. Hopefully this helps, and thanks again for picking this one up :)) ---------------------------------------- Bug #20587: dir.c calls blocking filesystem APIs/system calls while holding the GVL https://bugs.ruby-lang.org/issues/20587#change-109106 * Author: ivoanjo (Ivo Anjo) * Status: Open * Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Hey! I work for Datadog on the Ruby profiler part of the [`datadog` (previously `ddtrace`)](https://github.com/datadog/dd-trace-rb) gem. While I was investigating https://bugs.ruby-lang.org/issues/20586, I spotted that there's a number of cases where, in `dir.c`, blocking system calls are being made (e.g. `readdir()`, ~~`opendir()`~~, ... see comments for more detailed list...) without releasing the GVL. This means that if they block for a long time (as happens in the gcsfuse example in https://bugs.ruby-lang.org/issues/20586 ), the Ruby VM will just be blocked and not make any progress. The combination of not releasing the GVL + slow system calls actually makes the issue in https://bugs.ruby-lang.org/issues/20586 more likely to happen with the Datadog profiler, although even if the code releases the GVL the underlying issue could still happen, and this is why I decided to file this bug separately. ---Files-------------------------------- repro.rb (1.18 KB) hello.c (4.89 KB) -- https://bugs.ruby-lang.org/