[ruby-core:118299] [Ruby master Feature#20576] Add MatchData#bytebegin and MatchData#byteend

Issue #20576 has been reported by shugo (Shugo Maeda). ---------------------------------------- Feature #20576: Add MatchData#bytebegin and MatchData#byteend https://bugs.ruby-lang.org/issues/20576 * Author: shugo (Shugo Maeda) * Status: Open * Target version: 3.4 ---------------------------------------- I'd like to propose MatchData#bytebegin and MatchData#byteend. These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints. Pull request: https://github.com/ruby/ruby/pull/10973 One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array. Here's a benchmark result: ``` voyager:ruby$ cat b.rb require "benchmark" require "strscan" text = "あ" * 100000 Benchmark.bmbm do |b| b.report("byteoffset(0)[1]") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteoffset(0)[1] end end b.report("byteend(0)") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteend(0) end end end voyager:ruby$ ./tool/runruby.rb b.rb Rehearsal ---------------------------------------------------- byteoffset(0)[1] 0.020558 0.000393 0.020951 ( 0.020963) byteend(0) 0.018149 0.000000 0.018149 ( 0.018151) ------------------------------------------- total: 0.039100sec user system total real byteoffset(0)[1] 0.020821 0.000000 0.020821 ( 0.020822) byteend(0) 0.017455 0.000000 0.017455 ( 0.017455) ``` -- https://bugs.ruby-lang.org/

Issue #20576 has been updated by Eregon (Benoit Daloze). Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there? Regarding naming, `byteend` seems hard to read, I think `byte_begin`/`byte_end` is much clearer. ---------------------------------------- Feature #20576: Add MatchData#bytebegin and MatchData#byteend https://bugs.ruby-lang.org/issues/20576#change-108807 * Author: shugo (Shugo Maeda) * Status: Open * Target version: 3.4 ---------------------------------------- I'd like to propose MatchData#bytebegin and MatchData#byteend. These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints. Pull request: https://github.com/ruby/ruby/pull/10973 One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array. Here's a benchmark result: ``` voyager:ruby$ cat b.rb require "benchmark" require "strscan" text = "あ" * 100000 Benchmark.bmbm do |b| b.report("byteoffset(0)[1]") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteoffset(0)[1] end end b.report("byteend(0)") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteend(0) end end end voyager:ruby$ ./tool/runruby.rb b.rb Rehearsal ---------------------------------------------------- byteoffset(0)[1] 0.020558 0.000393 0.020951 ( 0.020963) byteend(0) 0.018149 0.000000 0.018149 ( 0.018151) ------------------------------------------- total: 0.039100sec user system total real byteoffset(0)[1] 0.020821 0.000000 0.020821 ( 0.020822) byteend(0) 0.017455 0.000000 0.017455 ( 0.017455) ``` -- https://bugs.ruby-lang.org/

Issue #20576 has been updated by shugo (Shugo Maeda). Eregon (Benoit Daloze) wrote in #note-1:
Does this difference matter in realistic usages (e.g. that net-imap one)? How much improvement is it there?
I guess the diffrence doesn't matter so much compared to I/O etc, but it's frustrating to write code like `$~.byteoffset(0)[1]` when only the end offset is needed.
Regarding naming, `byteend` seems hard to read, I think `byte_begin`/`byte_end` is much clearer.
I proposed `byteend` for consistency with existing methods such as byteoffset. If we choose `byte_end`, it may be better to introduce new aliases for such existing methods. ---------------------------------------- Feature #20576: Add MatchData#bytebegin and MatchData#byteend https://bugs.ruby-lang.org/issues/20576#change-108816 * Author: shugo (Shugo Maeda) * Status: Open * Target version: 3.4 ---------------------------------------- I'd like to propose MatchData#bytebegin and MatchData#byteend. These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints. Pull request: https://github.com/ruby/ruby/pull/10973 One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array. Here's a benchmark result: ``` voyager:ruby$ cat b.rb require "benchmark" require "strscan" text = "あ" * 100000 Benchmark.bmbm do |b| b.report("byteoffset(0)[1]") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteoffset(0)[1] end end b.report("byteend(0)") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteend(0) end end end voyager:ruby$ ./tool/runruby.rb b.rb Rehearsal ---------------------------------------------------- byteoffset(0)[1] 0.020558 0.000393 0.020951 ( 0.020963) byteend(0) 0.018149 0.000000 0.018149 ( 0.018151) ------------------------------------------- total: 0.039100sec user system total real byteoffset(0)[1] 0.020821 0.000000 0.020821 ( 0.020822) byteend(0) 0.017455 0.000000 0.017455 ( 0.017455) ``` -- https://bugs.ruby-lang.org/

Issue #20576 has been updated by matz (Yukihiro Matsumoto). I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names `bytebegin`, `byteend` are follow the `byteindex` tradition, but it is very hard to read (especially `byteend`). Any other name suggestions? Matz. ---------------------------------------- Feature #20576: Add MatchData#bytebegin and MatchData#byteend https://bugs.ruby-lang.org/issues/20576#change-108818 * Author: shugo (Shugo Maeda) * Status: Open * Target version: 3.4 ---------------------------------------- I'd like to propose MatchData#bytebegin and MatchData#byteend. These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints. Pull request: https://github.com/ruby/ruby/pull/10973 One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array. Here's a benchmark result: ``` voyager:ruby$ cat b.rb require "benchmark" require "strscan" text = "あ" * 100000 Benchmark.bmbm do |b| b.report("byteoffset(0)[1]") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteoffset(0)[1] end end b.report("byteend(0)") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteend(0) end end end voyager:ruby$ ./tool/runruby.rb b.rb Rehearsal ---------------------------------------------------- byteoffset(0)[1] 0.020558 0.000393 0.020951 ( 0.020963) byteend(0) 0.018149 0.000000 0.018149 ( 0.018151) ------------------------------------------- total: 0.039100sec user system total real byteoffset(0)[1] 0.020821 0.000000 0.020821 ( 0.020822) byteend(0) 0.017455 0.000000 0.017455 ( 0.017455) ``` -- https://bugs.ruby-lang.org/

Issue #20576 has been updated by shugo (Shugo Maeda). matz (Yukihiro Matsumoto) wrote in #note-3:
I understand the use-case. I agree with the addition of the feature, but I don't like the name. The names `bytebegin`, `byteend` are follow the `byteindex` tradition, but it is very hard to read (especially `byteend`). Any other name suggestions?
I came up with names `begin_in_bytes` and `end_in_bytes`, but `byte_begin` / `byte_end` suggested by Eregon may be better. ---------------------------------------- Feature #20576: Add MatchData#bytebegin and MatchData#byteend https://bugs.ruby-lang.org/issues/20576#change-108822 * Author: shugo (Shugo Maeda) * Status: Open * Target version: 3.4 ---------------------------------------- I'd like to propose MatchData#bytebegin and MatchData#byteend. These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints. Pull request: https://github.com/ruby/ruby/pull/10973 One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array. Here's a benchmark result: ``` voyager:ruby$ cat b.rb require "benchmark" require "strscan" text = "あ" * 100000 Benchmark.bmbm do |b| b.report("byteoffset(0)[1]") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteoffset(0)[1] end end b.report("byteend(0)") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteend(0) end end end voyager:ruby$ ./tool/runruby.rb b.rb Rehearsal ---------------------------------------------------- byteoffset(0)[1] 0.020558 0.000393 0.020951 ( 0.020963) byteend(0) 0.018149 0.000000 0.018149 ( 0.018151) ------------------------------------------- total: 0.039100sec user system total real byteoffset(0)[1] 0.020821 0.000000 0.020821 ( 0.020822) byteend(0) 0.017455 0.000000 0.017455 ( 0.017455) ``` -- https://bugs.ruby-lang.org/

Issue #20576 has been updated by matz (Yukihiro Matsumoto). OK. I didn't like the names (especially byteend), but after looking at them for a while I got used to it and was ready to compromise. Matz. ---------------------------------------- Feature #20576: Add MatchData#bytebegin and MatchData#byteend https://bugs.ruby-lang.org/issues/20576#change-109128 * Author: shugo (Shugo Maeda) * Status: Open * Target version: 3.4 ---------------------------------------- I'd like to propose MatchData#bytebegin and MatchData#byteend. These methods are similar to MatchData#begin and MatchData#end, but returns offsets in bytes instead of codepoints. Pull request: https://github.com/ruby/ruby/pull/10973 One of the use cases is scanning strings: https://github.com/ruby/net-imap/pull/286/files MatchData#byteend is faster than MatchData#byteoffset because there is no need to allocate an Array. Here's a benchmark result: ``` voyager:ruby$ cat b.rb require "benchmark" require "strscan" text = "あ" * 100000 Benchmark.bmbm do |b| b.report("byteoffset(0)[1]") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteoffset(0)[1] end end b.report("byteend(0)") do pos = 0 while text.byteindex(/\G./, pos) pos = $~.byteend(0) end end end voyager:ruby$ ./tool/runruby.rb b.rb Rehearsal ---------------------------------------------------- byteoffset(0)[1] 0.020558 0.000393 0.020951 ( 0.020963) byteend(0) 0.018149 0.000000 0.018149 ( 0.018151) ------------------------------------------- total: 0.039100sec user system total real byteoffset(0)[1] 0.020821 0.000000 0.020821 ( 0.020822) byteend(0) 0.017455 0.000000 0.017455 ( 0.017455) ``` -- https://bugs.ruby-lang.org/
participants (3)
-
Eregon (Benoit Daloze)
-
matz (Yukihiro Matsumoto)
-
shugo (Shugo Maeda)