[ruby-core:114703] [Ruby master Bug#19875] Ruby 2.7 -> 3.1 Performance regression in String#count

Issue #19875 has been reported by iz (Illia Zub). ---------------------------------------- Bug #19875: Ruby 2.7 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875 * Author: iz (Illia Zub) * Status: Open * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by nobu (Nobuyoshi Nakada). Status changed from Open to Feedback It seems compiler dependent. My results on x86_64-darwin are: ``` $ ruby -I ./benchmark/benchmark-driver/lib/ benchmark/benchmark-driver/exe/benchmark-driver ~/tmp/string_count_benchmark_driver.yml --executables=/opt/local/bin/ruby{2.7,3.{0,1,2,3}} Calculating ------------------------------------- /opt/local/bin/ruby2.7 /opt/local/bin/ruby3.0 /opt/local/bin/ruby3.1 /opt/local/bin/ruby3.2 /opt/local/bin/ruby3.3 count 171.134 334.699 668.206 666.938 666.382 i/s - 10.000k times in 58.433576s 29.877601s 14.965446s 14.993903s 15.006408s Comparison: count /opt/local/bin/ruby3.1: 668.2 i/s /opt/local/bin/ruby3.2: 666.9 i/s - 1.00x slower /opt/local/bin/ruby3.3: 666.4 i/s - 1.00x slower /opt/local/bin/ruby3.0: 334.7 i/s - 2.00x slower /opt/local/bin/ruby2.7: 171.1 i/s - 3.90x slower ``` ---------------------------------------- Bug #19875: Ruby 2.7 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104545 * Author: iz (Illia Zub) * Status: Feedback * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by iz (Illia Zub). Subject changed from Ruby 2.7 -> 3.1 Performance regression in String#count to Ruby 3.0 -> 3.1 Performance regression in String#count Thanks for your feedback, @nobu! It looks like compiler dependent. I used `x86_64-linux`.\ May you please help me to find the root cause? (*I'll do my own research, just asking for starting points.* ) ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104570 * Author: iz (Illia Zub) * Status: Feedback * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by byroot (Jean Boussier).
May you please help me to find the root cause?
A start would be to decompile `rb_str_count` on two versions. @alanwu guessed that it might be the change from signed to unsigned that is preventing some compilers from auto-vectorizing that loop. Decompiling could confirm that, however it's unclear how it could be fixed. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104571 * Author: iz (Illia Zub) * Status: Feedback * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by nobu (Nobuyoshi Nakada). byroot (Jean Boussier) wrote in #note-4:
May you please help me to find the root cause?
A start would be to decompile `rb_str_count` on two versions.
Tried `objdump --disassemble=rb_str_count` on Ubuntu 22.04, but the shared library is stripped and has no symbols.
@alanwu guessed that it might be the change from signed to unsigned that is preventing some compilers from auto-vectorizing that loop.
Decompiling could confirm that, however it's unclear how it could be fixed.
Use `ssize_t` instead? Or, it may depend on the variable size, not on the signedness. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104604 * Author: iz (Illia Zub) * Status: Feedback * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by Freaky (Thomas Hurst). File rb_str_len.fast added File rb_str_len.slow added File revert-4001.patch added Annotated disassemblies attached. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104607 * Author: iz (Illia Zub) * Status: Feedback * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by nobu (Nobuyoshi Nakada). Freaky (Thomas Hurst) wrote in #note-6:
Annotated disassemblies attached.
Thank you. These are all generated by the same compiler? The "slow" version scans per WORD (= 2bytes), while the "fast" version uses DWORD (= 4bytes). ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104611 * Author: iz (Illia Zub) * Status: Feedback * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by nobu (Nobuyoshi Nakada). File rb_str_count.S added Attaching the generated code by gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04). This seems vectorizing per 16 bytes. ```S movdqu (%rdi), %xmm2 # MEM <vector(16) unsigned char> [(unsigned char *)_95], tmp383 addq $16, %rdi #, ivtmp.4747 ``` ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104612 * Author: iz (Illia Zub) * Status: Feedback * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by Freaky (Thomas Hurst). nobu (Nobuyoshi Nakada) wrote in #note-7:
These are all generated by the same compiler?
Yes - FreeBSD clang version 14.0.5. I also try 15.0.7 to no effect. gcc 12.2.0 performs much better, though it too has a slightly more modest performance regression from 4001: ``` './ruby.gcc12.revert test.rb' ran 1.28 ± 0.01 times faster than './ruby.clang.revert test.rb' 1.61 ± 0.01 times faster than './ruby.gcc12.master test.rb' 4.38 ± 0.04 times faster than './ruby.clang.master test.rb' ``` The obvious solution works for both: ``` diff --git string.c string.c index deeed4a12a..70644b2338 100644 --- string.c +++ string.c @@ -8455,6 +8455,13 @@ rb_str_count(int argc, VALUE *argv, VALUE str) s = RSTRING_PTR(str); if (!s || RSTRING_LEN(str) == 0) return INT2FIX(0); send = RSTRING_END(str); + if (RSTRING_LEN(str) < INT_MAX) { + i = 0; + while (s < send) { + if (*(unsigned char*)s++ == c) i++; + } + return INT2NUM(i); + } while (s < send) { if (*(unsigned char*)s++ == c) n++; } ``` ``` './ruby.gcc12.fix test.rb' ran 1.04 ± 0.01 times faster than './ruby.gcc12.revert test.rb' 1.32 ± 0.01 times faster than './ruby.clang.fix test.rb' 1.33 ± 0.02 times faster than './ruby.revert test.rb' 1.68 ± 0.01 times faster than './ruby.gcc12.master test.rb' 4.57 ± 0.03 times faster than './ruby.master test.rb' ``` Oh. And you know what this reminds me of? That [one time](https://github.com/Freaky/fast-bytecount) I ported Rust's bytecount crate to C. ``` './ruby.bytecount test.rb' ran 3.73 ± 0.08 times faster than './ruby.gcc12.fix test.rb' 3.87 ± 0.08 times faster than './ruby.gcc12.revert test.rb' 4.94 ± 0.10 times faster than './ruby.clang.fix test.rb' 4.96 ± 0.11 times faster than './ruby.revert test.rb' 6.27 ± 0.13 times faster than './ruby.gcc12.master test.rb' 17.05 ± 0.36 times faster than './ruby.master test.rb' ``` Who needs a compiler to vectorise for you when you can just ... copy what someone smarter than you did. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104621 * Author: iz (Illia Zub) * Status: Feedback * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by nobu (Nobuyoshi Nakada). Freaky (Thomas Hurst) wrote in #note-9:
Oh. And you know what this reminds me of? That [one time](https://github.com/Freaky/fast-bytecount) I ported Rust's bytecount crate to C.
Thank you, tried with it. https://github.com/nobu/ruby/tree/mm_bytecount But it seems no significant difference on x86_64-darwin. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104646 * Author: iz (Illia Zub) * Status: Open * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by Freaky (Thomas Hurst). nobu (Nobuyoshi Nakada) wrote in #note-11:
Freaky (Thomas Hurst) wrote in #note-9:
Oh. And you know what this reminds me of? That [one time](https://github.com/Freaky/fast-bytecount) I ported Rust's bytecount crate to C.
Thank you, tried with it. https://github.com/nobu/ruby/tree/mm_bytecount
Thanks for trying it!
But it seems no significant difference on x86_64-darwin.
I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it. There's also an AVX version of the algorithm that operates in 32-byte chunks instead of SSE's 16. I measure it as twice as fast again on my Ryzen 5700X - I might port that too if there's interest. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104647 * Author: iz (Illia Zub) * Status: Open * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by nobu (Nobuyoshi Nakada). Freaky (Thomas Hurst) wrote in #note-12:
I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.
Who/what do you mean by "we"? It feels like a compiler's (or optimizer's) job.
There's also an AVX version of the algorithm that operates in 32-byte chunks instead of SSE's 16. I measure it as twice as fast again on my Ryzen 5700X - I might port that too if there's interest.
You may want to discuss it with compiler porting teams. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104651 * Author: iz (Illia Zub) * Status: Open * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by Freaky (Thomas Hurst). File bytecount.c added nobu (Nobuyoshi Nakada) wrote in #note-13:
Freaky (Thomas Hurst) wrote in #note-12:
I see a difference if I configure with cflags=-msse4.2 - it's off by default. We probably want some runtime CPU feature detection if people are actually going to use it.
Who/what do you mean by "we"? It feels like a compiler's (or optimizer's) job.
If only! I've added an AVX2 path and worked on getting runtime dispatch working on both clang and gcc. I've combined the result into a single file suitable to drop straight over the top of your `missing/bytecount.c`, which yields this on my Zen 3: ``` './ruby test.rb' ran 22.37 ± 1.03 times faster than './ruby.master test.rb' ``` And this - using exactly the same binary - on an old K8 machine which lacks AVX2 and SSE4: ``` ./ruby test.rb ran 2.97 ± 0.02 times faster than ./ruby.master test.rb ``` ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104667 * Author: iz (Illia Zub) * Status: Open * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) bytecount.c (7.23 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by ahorek (Pavel Rosický). @Freaky here's an alternative approach https://godbolt.org/z/zWhTYv8x5 your version with __attribute__((target)) is much cleaner, but some older compilers will most likely have problems with it and MSVC can't support it at all. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104669 * Author: iz (Illia Zub) * Status: Open * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) bytecount.c (7.23 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by Freaky (Thomas Hurst). ahorek (Pavel Rosický) wrote in #note-15:
@Freaky here's an alternative approach https://godbolt.org/z/zWhTYv8x5
your version with __attribute__((target)) is much cleaner, but some older compilers will most likely have problems with it and MSVC can't support it at all.
I think it's too much of a maintenance burden to reimplement this stuff from first principles when there's compiler support on most systems *right there*. It can just fall back to scalar versions where it isn't supported. Maybe it's also doable to support __AVX2__/__SSE2__ in such cases. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104766 * Author: iz (Illia Zub) * Status: Open * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) bytecount.c (7.23 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by Freaky (Thomas Hurst). I've updated [fast-bytecount](https://github.com/Freaky/fast-bytecount) with autotools support, which detects the needed compiler features and exposes flags to disable bits of it. Perhaps there are other places in MRI that would benefit from some of this, which would help amortize the cost somewhat. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104769 * Author: iz (Illia Zub) * Status: Open * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) bytecount.c (7.23 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by ahorek (Pavel Rosický). I think if the platform support is sufficient, there are multiple places where this feature could be beneficial. Even without explicit SIMD code, some existing C code could be autovectorized by the compiler itself for the AVX2 target while keeping the binary compatibility with platforms that don't support these instructions. You can always recompile Ruby with optimization flags for your own platform, but most users use precompiled binaries that can't benefit from it now. could prepare a proof of concept PR for the Ruby repo? also, see the previous discussion #16487 there are already other existing examples where this feature could help. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104770 * Author: iz (Illia Zub) * Status: Open * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) bytecount.c (7.23 KB) -- https://bugs.ruby-lang.org/

Issue #19875 has been updated by Freaky (Thomas Hurst). ahorek (Pavel Rosický) wrote in #note-18:
I think if the platform support is sufficient, there are multiple places where this feature could be beneficial.
I think the platform support covers the major bases. I'm testing a PHP build with ifunc and target support enabled, after it was disabled for FreeBSD in 2018 and nobody thought to try turning it back on.
Even without explicit SIMD code, some existing C code could be autovectorized by the compiler itself for the AVX2 target while keeping the binary compatibility with platforms that don't support these instructions.
Yes - `__attribute__((target_clones(..))` makes this particularly easy to do -- the difficulty lies in using it appropriately.
You can always recompile Ruby with optimization flags for your own platform, but most users use precompiled binaries that can't benefit from it now.
It would be interesting to see some benchmarks comparing Ruby (and other things) targetting basline x86-64, and the -v2, -v3, and -v4 profiles.
could prepare a proof of concept PR for the Ruby repo?
This is what I'm working towards, yes.
also, see the previous discussion #16487 there are already other existing examples where this feature could help.
Thanks. ---------------------------------------- Bug #19875: Ruby 3.0 -> 3.1 Performance regression in String#count https://bugs.ruby-lang.org/issues/19875#change-104772 * Author: iz (Illia Zub) * Status: Open * Priority: Normal * ruby -v: 3.2.2 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- `String#count` became slower since Ruby 3.1. Originally found by `@Freaky`: https://github.com/ruby/ruby/pull/4001#issuecomment-1714779781 Compared using the [`benchmark-driver` gem](https://github.com/benchmark-driver/benchmark-driver). ``` $ benchmark-driver tmp/string_count_benchmark_driver.yml --rbenv '3.1.1;3.1.4;2.7.2;3.2.2;3.0.6' Calculating ------------------------------------- 3.1.1 3.1.4 2.7.2 3.2.2 3.0.6 count 465.804 463.741 865.783 462.711 857.395 i/s - 10.000k times in 21.468251s 21.563768s 11.550239s 21.611783s 11.663235s Comparison: count 2.7.2: 865.8 i/s 3.0.6: 857.4 i/s - 1.01x slower 3.1.1: 465.8 i/s - 1.86x slower 3.1.4: 463.7 i/s - 1.87x slower 3.2.2: 462.7 i/s - 1.87x slower ``` Benchmark: ```yml $ cat ./tmp/string_count_benchmark_driver.yml loop_count: 10_000 prelude: | html = "\nruby\n" * 1024 * 1024 benchmark: count: html.count($/) ``` --- *Initially, I noticed the difference between `str.count($/)` and `str.lines.size` when working on the performance improvement: https://serpapi.com/blog/lines-count-failed-deployments/* ---Files-------------------------------- rb_str_len.fast (31.9 KB) rb_str_len.slow (34 KB) revert-4001.patch (1.71 KB) rb_str_count.S (11.8 KB) bytecount.c (7.23 KB) -- https://bugs.ruby-lang.org/
participants (5)
-
ahorek
-
byroot (Jean Boussier)
-
Freaky (Thomas Hurst)
-
iz (Illia Zub)
-
nobu (Nobuyoshi Nakada)