[ruby-core:112185] [Ruby master Bug#19402] CSV skip_lines option not behaving as documented

Issue #19402 has been reported by jamie_ca (Jamie Macey). ---------------------------------------- Bug #19402: CSV skip_lines option not behaving as documented https://bugs.ruby-lang.org/issues/19402 * Author: jamie_ca (Jamie Macey) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.0 (2022-12-25 revision a528908271) [x86_64-darwin21] * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- The [CSV documentation](https://ruby-doc.org/stdlib-3.1.0/libdoc/csv/rdoc/CSV.html#class-CSV-label-O...) for the `skip_lines` parser option says "If a String, converts it to a Regexp, ignores lines that match it." Application behaviour as well as [the source](https://github.com/ruby/csv/blob/master/lib/csv/parser.rb#L909-L919) appears to be normalizing the string encoding and running a simple substring check instead. Given the existing behaviour, this might just want a documentation update to describe it accurately? I stumbled across this on a project still on ruby 2.6.9 ([2.6 docs](https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html#method-c-new)), but it's applicable still at 3.2.0. Reproduction script: ```ruby require 'csv' data = <<CSV data,data test,data data,test CSV puts "Parsing with regexp skip_lines /^test/, expect 2 rows" CSV.parse(data, skip_lines: /^test/).each { |row| pp row } puts puts "Parsing with text skip_lines \"^test\", expect 2 rows" CSV.parse(data, skip_lines: "^test").each { |row| pp row } puts puts "Parsing with unanchored text skip_lines \"test\", expect 1 row" CSV.parse(data, skip_lines: "test").each { |row| pp row } puts ``` ``` $ ruby csv_test.rb Parsing with regexp skip_lines /^test/, expect 2 rows ["data", "data"] ["data", "test"] Parsing with text skip_lines "^test", expect 2 rows ["data", "data"] ["test", "data"] ["data", "test"] Parsing with unanchored text skip_lines "test", expect 1 row ["data", "data"] ``` -- https://bugs.ruby-lang.org/

Issue #19402 has been updated by sawa (Tsuyoshi Sawada). I agree with you that the description in the documentation is bad, but for a reason different from what you claim. The problem is that it is ambiguous. It says that the string is converted to a Regexp, but it does not specify how. That leaves a room for the reader to interpret it in one or another way. Perhaps, you interpreted that a string `str` passed as the `skip_lines:` argument is converted to a Regexp by: ```ruby Regexp.new(str) ``` However, I believe the intended interpretation was to convert the string argument to a Regexp by: ```ruby Regexp.new(Regexp.escape(str)) ``` in which case matching against the resulting Regexp is equivalent to a substring check against the original string, and the results you got is just as described in the documentation. I agree with you that the documentation can be improved. ---------------------------------------- Bug #19402: CSV skip_lines option not behaving as documented https://bugs.ruby-lang.org/issues/19402#change-101623 * Author: jamie_ca (Jamie Macey) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.0 (2022-12-25 revision a528908271) [x86_64-darwin21] * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- The [CSV documentation](https://ruby-doc.org/stdlib-3.1.0/libdoc/csv/rdoc/CSV.html#class-CSV-label-O...) for the `skip_lines` parser option says "If a String, converts it to a Regexp, ignores lines that match it." Application behaviour as well as [the source](https://github.com/ruby/csv/blob/master/lib/csv/parser.rb#L909-L919) appears to be normalizing the string encoding and running a simple substring check instead. Given the existing behaviour, this might just want a documentation update to describe it accurately? I stumbled across this on a project still on ruby 2.6.9 ([2.6 docs](https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html#method-c-new)), but it's applicable still at 3.2.0. Reproduction script: ```ruby require 'csv' data = <<CSV data,data test,data data,test CSV puts "Parsing with regexp skip_lines /^test/, expect 2 rows" CSV.parse(data, skip_lines: /^test/).each { |row| pp row } puts puts "Parsing with text skip_lines \"^test\", expect 2 rows" CSV.parse(data, skip_lines: "^test").each { |row| pp row } puts puts "Parsing with unanchored text skip_lines \"test\", expect 1 row" CSV.parse(data, skip_lines: "test").each { |row| pp row } puts ``` ``` $ ruby csv_test.rb Parsing with regexp skip_lines /^test/, expect 2 rows ["data", "data"] ["data", "test"] Parsing with text skip_lines "^test", expect 2 rows ["data", "data"] ["test", "data"] ["data", "test"] Parsing with unanchored text skip_lines "test", expect 1 row ["data", "data"] ``` -- https://bugs.ruby-lang.org/

Issue #19402 has been updated by kou (Kouhei Sutou). Status changed from Open to Third Party's Issue Assignee set to kou (Kouhei Sutou) It's intentional. `String` `skip_lines:` value is matched as-is. (You can't use special characters such as `^`.) If you have further discussion including how to improve the current documentation, could you report this to upstream? https://github.com/ruby/csv ---------------------------------------- Bug #19402: CSV skip_lines option not behaving as documented https://bugs.ruby-lang.org/issues/19402#change-101625 * Author: jamie_ca (Jamie Macey) * Status: Third Party's Issue * Priority: Normal * Assignee: kou (Kouhei Sutou) * ruby -v: ruby 3.2.0 (2022-12-25 revision a528908271) [x86_64-darwin21] * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- The [CSV documentation](https://ruby-doc.org/stdlib-3.1.0/libdoc/csv/rdoc/CSV.html#class-CSV-label-O...) for the `skip_lines` parser option says "If a String, converts it to a Regexp, ignores lines that match it." Application behaviour as well as [the source](https://github.com/ruby/csv/blob/master/lib/csv/parser.rb#L909-L919) appears to be normalizing the string encoding and running a simple substring check instead. Given the existing behaviour, this might just want a documentation update to describe it accurately? I stumbled across this on a project still on ruby 2.6.9 ([2.6 docs](https://ruby-doc.org/stdlib-2.6.1/libdoc/csv/rdoc/CSV.html#method-c-new)), but it's applicable still at 3.2.0. Reproduction script: ```ruby require 'csv' data = <<CSV data,data test,data data,test CSV puts "Parsing with regexp skip_lines /^test/, expect 2 rows" CSV.parse(data, skip_lines: /^test/).each { |row| pp row } puts puts "Parsing with text skip_lines \"^test\", expect 2 rows" CSV.parse(data, skip_lines: "^test").each { |row| pp row } puts puts "Parsing with unanchored text skip_lines \"test\", expect 1 row" CSV.parse(data, skip_lines: "test").each { |row| pp row } puts ``` ``` $ ruby csv_test.rb Parsing with regexp skip_lines /^test/, expect 2 rows ["data", "data"] ["data", "test"] Parsing with text skip_lines "^test", expect 2 rows ["data", "data"] ["test", "data"] ["data", "test"] Parsing with unanchored text skip_lines "test", expect 1 row ["data", "data"] ``` -- https://bugs.ruby-lang.org/
participants (3)
-
jamie_ca (Jamie Macey)
-
kou (Kouhei Sutou)
-
sawa (Tsuyoshi Sawada)