[ruby-core:124342] [Ruby Feature#21800] `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name
Issue #21800 has been reported by byroot (Jean Boussier). ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall. I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name, stat| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
Issue #21800 has been updated by matz (Yukihiro Matsumoto). The point is `dirent.d_type` is **not** defined in POSIX, so the API may be unreliable. Matz. ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800#change-116084 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
Issue #21800 has been updated by byroot (Jean Boussier). @matz yes, but not we're not exposing `dirent.d_type` to users, but a lazily initialized `File::Stat`. This means that even on platform where `d_type` is missing, the Ruby level API is identical, just less performant. ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800#change-116091 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
Issue #21800 has been updated by mame (Yusuke Endoh). Discussed at the dev meeting, several points were raised. * The lack of portability is concerning (as @matz already mentioned). * The lazy `File::Stat` is concerning about the timing: the file type reflects information at the time of the `readdir` call, while other information like `mtime` reflects the time `#mtime` method was called, which could lead to inconsistencies. * It's unclear whether the lazy `File::Stat` should use `lstat` or `stat`. * Considering the above, it might be better to simply pass the information returned by readdir (i.e., `f_type`) instead of `File::Stat`. * Yielding `f_type` requires explicitly writing code to call `File.stat` (or `File.lstat`) for `DT_UNKNOWN` cases, which isn't very convenient (though it might be acceptable since this method isn't for casual use?). * About the API, checking the block's arity to yield differently is not great. It would be better to simply yield the `f_type` information in a separate method (such as `Dir.each_child_with_file_type(path) {|name, f_type| ... }`) or when keyword arguments are provided (such as `Dir.each_child(path, with_file_type: true) {|name, f_type| ... }`. * There are some options how to represent `f_type`, including (1) using the raw integer (e.g., comparing against constants like `Dir::DT_UNKNOWN`), or (2) using a symbol like `:UNKNOWN`. ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800#change-116101 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
Issue #21800 has been updated by byroot (Jean Boussier).
The lazy File::Stat
My draft was just yielding a symbol: https://github.com/ruby/ruby/pull/15667, @nobu did the `File::Stat` version, I thought it was elegant, and degraded better on platforms where `d_type` is not supported.
The lazy File::Stat is concerning about the timing
That doesn't concern me too much to be honest, given all file operation are subject to this sort of race conditions. But I understand why it concerns you.
Yielding f_type requires explicitly writing code to call File.stat (or File.lstat) for DT_UNKNOWN cases, which isn't very convenient (though it might be acceptable since this method isn't for casual use?).
There is also the option of Ruby doing the `lstat` call to translate `DT_UNKNOWN` into the actual type. If we expose `DT_UNKNOWN` I'm worried users won't handle it correctly. From my understanding all the popular systems do support `f_type`, so it would be very hard to test for `DT_UNKNOWN` and would likely be broken.
It's unclear whether the lazy File::Stat should use lstat or stat.
In my opinion it should be `lstat` because for this sort of code, circular symlinks, and following symlinks in general, is a concern. But I can be convinced otherwise. Also if we go toward a new method, we can always make it a keyword argument.
About the API, checking the block's arity to yield differently is not great.
Agreed. I did it for simplicity and to avoid a discussion about naming, but I think it would be OK to make it a new method, e.g. - `Dir.scan(path) { |name, type| }` - `Dir.scan(path) [[name, type], ...]`
here are some options how to represent f_type, including (1) using the raw integer (e.g., comparing against constants like Dir::DT_UNKNOWN), or (2) using a symbol like :UNKNOWN.
I don't have any strong opinion here. We a precedent with `File::Stat#ftype` that returns a String, but I don't think that makes for a very usable API: https://docs.ruby-lang.org/en/master/File/Stat.html#method-i-ftype On one hand, symbols are easier to discover when playing with the API in `irb` and such. On the other constants are easier to discover when searching the documentation and are typo-proof. I may have a slight preference for constants. ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800#change-116102 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
Issue #21800 has been updated by Eregon (Benoit Daloze). I think yielding Symbols are nicer, returning a plain Integer feels too low level IMO and people might misuse that e.g. `DT_UNKNOWN` is 0 on some platforms. It's also more verbose and less clear (due to unclear C naming) for the Ruby caller to `case type in File::DT_REG` vs `case type in :file`. I think we should match `File::Stat#ftype` for the naming for consistency. byroot (Jean Boussier) wrote in #note-9:
degraded better on platforms where `d_type` is not supported.
I'm not following this, wouldn't it be the same if `d_type` is not supported or returns `DT_UNKNOWN` to then call `lstat` to find out, so the Ruby caller always gets the type? I'm +1 on adding such functionality. FWIW TruffleRuby already has an internal method to achieve the same thing and it's used to implement `Dir.glob` efficiently with Ruby code. ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800#change-116130 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
Issue #21800 has been updated by byroot (Jean Boussier).
I think we should match File::Stat#ftype for the naming for consistency.
In principle in make sense, but in practice that other method returns strings, and weird ones at that:
Identifies the type of stat. The return string is one of: `“file”`, `“directory”`, `“characterSpecial”`, `“blockSpecial”`, `“fifo”`, `“link”`, `“socket”`, or `“unknown”`.
If we yield symbols instead, it's already not super consistent, but also I think it would be weird to yield camel cased symbols like `:characterSpecial` or `:blockSpecial`.
I'm not following this, wouldn't it be the same if d_type is not supported or returns DT_UNKNOWN to then call lstat to find out, so the Ruby caller always gets the type?
That is what I'm suggesting. for a symlink the `d_type` will be `:link`, so in case `d_type` isn't supported or is UNKNOWN, we should use `lstat` to return the equivalent type. ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800#change-116132 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
Issue #21800 has been updated by Eregon (Benoit Daloze). byroot (Jean Boussier) wrote in #note-11:
In principle in make sense, but in practice that other method returns strings, and weird ones at that:
Identifies the type of stat. The return string is one of: `“file”`, `“directory”`, `“characterSpecial”`, `“blockSpecial”`, `“fifo”`, `“link”`, `“socket”`, or `“unknown”`.
If we yield symbols instead, it's already not super consistent, but also I think it would be weird to yield camel cased symbols like `:characterSpecial` or `:blockSpecial`.
I noticed as well but those are pretty "special" files so I think it's not a big deal to use camelCase there. Even with these 2 camelCase names I do think they are much better and clearer than `DT_` constants (`DT_BLK, DT_CHR, DT_DIR, DT_FIFO, DT_LNK, DT_REG, DT_SOCK, DT_UNKNOWN`). We could do `:character_special` and `:block_special`, I think that's fine too. My point is the DT_ constants are unreadable/cryptic so we shouldn't use that, and the names returned by `File::Stat#ftype` are much better. ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800#change-116134 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
Issue #21800 has been updated by byroot (Jean Boussier).
those are pretty "special" files so I think it's not a big deal to use camelCase there.
Yes, they're very unlikely to be used.
My point is the DT_ constants are unreadable/cryptic so we shouldn't use that,
I'm fine either way. My initial prototype was using symbols: https://github.com/ruby/ruby/pull/15667 ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800#change-116135 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
Issue #21800 has been updated by Eregon (Benoit Daloze). I think the spec in https://bugs.ruby-lang.org/issues/21839#note-3 is perfect and IMO good to go as-is:
* I propose: *`Dir.scan(path) { |entry_name, entry_type| }` *`Dir.scan(path) # => [[entry_name, entry_type], ...]` * The type is just a symbol, similar to `File::Stat#ftype` * In case of `DT_UNKNOWN`, Ruby issue a `lstat` to obtain the real type (important for portability).
It's simple, doesn't add overhead or complexity in File::Stat and serves its purpose well. We want to avoid extra syscalls for the case that `struct dirent.d_type` is available so we should not auto-resolve symlinks. Also auto-resolving symlinks is a common mistake e.g. when traversing a directories hierarchy as it could go "back up" (and loop infinitely). So lstat() is the simplest and safest choice there. People can always resolve the symlink themselves if that's what they want, but typically they'd want to do some validation first anyway. ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800#change-116275 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
Issue #21800 has been updated by Dan0042 (Daniel DeLorme). I like the feature, but it would be nice to provide an example that is not simply to reimplement `Dir["**/*.rb"].size` with 16x the lines of code. ---------------------------------------- Feature #21800: `Dir.scan` to list directory entires with their type https://bugs.ruby-lang.org/issues/21800#change-116435 * Author: byroot (Jean Boussier) * Status: Closed ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/
participants (5)
-
byroot (Jean Boussier) -
Dan0042 (Daniel DeLorme) -
Eregon (Benoit Daloze) -
mame (Yusuke Endoh) -
matz (Yukihiro Matsumoto)