[ruby-core:118037] [Ruby master Bug#20512] Order of magnitude performance differenfce in single character slicing UTF-8 strings before and after length method is executed

Issue #20512 has been reported by giner (Stanislav German-Evtushenko). ---------------------------------------- Bug #20512: Order of magnitude performance differenfce in single character slicing UTF-8 strings before and after length method is executed https://bugs.ruby-lang.org/issues/20512 * Author: giner (Stanislav German-Evtushenko) * Status: Open * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Slicing of a single character of UTF-8 string becomes ~15 times faster after method "length" is executed on the string. ```ruby # Single byte symbols letters = ("a".."z").to_a length = 100000 str = length.times.map{letters[rand(26)]}.join # Slow start = Time.now length.times{|i| str[i]} puts Time.now - start # 0.169156201 str.length # performance hack # Fast start = Time.now length.times{|i| str[i]} puts Time.now - start # 0.009883919 # UTF-8 Symbols letters = ("а".."я").to_a length = 10000 str = length.times.map{letters[rand(26)]}.join # Slow start = Time.now length.times{|i| str[i]} puts Time.now - start # 0.326204007 str.length # performance hack # Fast start = Time.now length.times{|i| str[i]} puts Time.now - start # 0.016943093 ``` -- https://bugs.ruby-lang.org/

Issue #20512 has been updated by byroot (Jean Boussier). What is happening here is that `length` triggers scanning the string `coderange`. And when the coderange is unknown, `String#[]` is slower for variable-length character encodings (like UTF-8). On 3.3: ```ruby require 'json' require 'objspace' require 'benchmark' # Single byte symbols letters = ("a".."z").to_a length = 100000 str = length.times.map{letters[rand(26)]}.join # Slow p Benchmark.realtime { length.times{|i| str[i]} } p Benchmark.realtime { length.times{|i| str[i]} } puts JSON.parse(ObjectSpace.dump(str))["coderange"] p Benchmark.realtime { str.length } # performance hack puts JSON.parse(ObjectSpace.dump(str))["coderange"] ``` ``` $ ruby -v /tmp/str.rb ruby 3.3.1 (2024-04-23 revision c56cd86388) [arm64-darwin23] 0.17216699989512563 0.1763450000435114 unknown 5.999580025672913e-06 7bit 0.004894999787211418 ``` See how `coderange` changes from `unknown` to `7bit`, allowing `String#[]` to treat the string as pure ASCII, hence can directly compute the substring position with a simple offset. The question here is whether `String#[]` should trigger scanning the coderange. It would definitely make some code faster, but may slow down some others, so it's a bit debatable, but I'd be in favor of it. ---------------------------------------- Bug #20512: Order of magnitude performance differenfce in single character slicing UTF-8 strings before and after length method is executed https://bugs.ruby-lang.org/issues/20512#change-108459 * Author: giner (Stanislav German-Evtushenko) * Status: Open * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Slicing of a single character of UTF-8 string becomes ~15 times faster after method "length" is executed on the string. ```ruby # Single byte symbols letters = ("a".."z").to_a length = 100000 str = length.times.map{letters[rand(26)]}.join # Slow start = Time.now length.times{|i| str[i]} puts Time.now - start # 0.169156201 str.length # performance hack # Fast start = Time.now length.times{|i| str[i]} puts Time.now - start # 0.009883919 # UTF-8 Symbols letters = ("а".."я").to_a length = 10000 str = length.times.map{letters[rand(26)]}.join # Slow start = Time.now length.times{|i| str[i]} puts Time.now - start # 0.326204007 str.length # performance hack # Fast start = Time.now length.times{|i| str[i]} puts Time.now - start # 0.016943093 ``` -- https://bugs.ruby-lang.org/
participants (2)
-
byroot (Jean Boussier)
-
giner (Stanislav German-Evtushenko)