Issue #21800 has been updated by mame (Yusuke Endoh). Discussed at the dev meeting, several points were raised. * The lack of portability is concerning (as @matz already mentioned). * The lazy `File::Stat` is concerning about the timing: the file type reflects information at the time of the `readdir` call, while other information like `mtime` reflects the time `#mtime` method was called, which could lead to inconsistencies. * It's unclear whether the lazy `File::Stat` should use `lstat` or `stat`. * Considering the above, it might be better to simply pass the information returned by readdir (i.e., `f_type`) instead of `File::Stat`. * Yielding `f_type` requires explicitly writing code to call `File.stat` (or `File.lstat`) for `DT_UNKNOWN` cases, which isn't very convenient (though it might be acceptable since this method isn't for casual use?). * About the API, checking the block's arity to yield differently is not great. It would be better to simply yield the `f_type` information in a separate method (such as `Dir.each_child_with_file_type(path) {|name, f_type| ... }`) or when keyword arguments are provided (such as `Dir.each_child(path, with_file_type: true) {|name, f_type| ... }`. * There are some options how to represent `f_type`, including (1) using the raw integer (e.g., comparing against constants like `Dir::DT_UNKNOWN`), or (2) using a symbol like `:UNKNOWN`. ---------------------------------------- Feature #21800: `Dir.foreach` and `Dir.each_child` to optionally yield `File::Stat` object alongside the children name https://bugs.ruby-lang.org/issues/21800#change-116101 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- When listing a directory, it's very common to need to know the type of each children, generally because you want to scan recursively. The naive way to do this is to call `stat(2)` for each children, but this is quite costly. This use case is common enough that `readdir` on most modern platforms do expose `struct dirent.d_type`, which allows to know the type of the child without an extra syscall: From the `scandir` manpage:
d_type: This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2)
I wrote a quick prototype, and relying on `dirent.d_type` instead of `stat(2)` allows to recursively scan Ruby's repository twice as fast on my machine: https://github.com/ruby/ruby/pull/15667 Given that recursively scanning directories is a common task across many popular ruby tools (`zeitwerk`, `rubocop`, etc), I think it would be very valuable to provide this more efficient interface. In addition, @nobu noticed my prototype, and implemented a nicer version of it, where a `File::Stat` is yielded: https://github.com/ruby/ruby/commit/9acf67057b9bc6f855b2c37e41c1a2f91eae643a In that case the `File::Stat` is lazy, it's only if you access something other than file type, that the actual `stat(2)` call is emitted. I think this API is both more efficient and more convenient. ### Proposed API ```ruby Dir.foreach(path) { |name| } Dir.foreach(path) { |name, stat| } Dir.each_child(path) { |name| } Dir.each_child(path) { |name, stat| } Dir.new(path).each_child { |name| } Dir.new(path).each_child { |name, stat| } Dir.new(path).each { |name| } Dir.new(path).each { |name, stat| } ``` Also important to note, the `File::Stat` is expected to be equivalent to a `lstat(2)` call, as to be able to chose to follow symlinks or not. Basic use case: ```ruby def count_ruby_files(root) count = 0 queue = [root] while dir = queue.pop Dir.each_child(dir) do |name, stat| next if name.start_with?(".") if stat.directory? queue << File.join(dir, name) elsif stat.file? count += 1 if name.end_with?(".rb") end end end count end ``` -- https://bugs.ruby-lang.org/