[ruby-core:111952] [Ruby master Bug#19361] String#[Integer] is orders slower for strings with some UTF characters

Issue #19361 has been reported by vzdor (Vladimir Zdorovenco). ---------------------------------------- Bug #19361: String#[Integer] is orders slower for strings with some UTF characters https://bugs.ruby-lang.org/issues/19361 * Author: vzdor (Vladimir Zdorovenco) * Status: Open * Priority: Normal * ruby -v: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu] * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- #[] is not only slower compared to itself, but slower compared to #each_char. seq1 ``` # s = '*' * 10e4 s = 'ф' * 10e4 count = 0 size = s.size while count < size s[count] count += 1 end ``` seq2 ``` ss = 'ф' * 10e4 s = ss.chars count = 0 size = s.size while count < size s[count] count += 1 end ``` On my computer seq1 runs in 11 seconds and seq2 in 0.5 second. It can be '克' symbol, too, I'm sure not only those symbols. I would not have assumed seq1 can be slower, I do not call s[n] more than once for some n. It is a Debian package with some patches, but they do not touch string.c. $ locale LANG=en_US.UTF-8 -- https://bugs.ruby-lang.org/

Issue #19361 has been updated by byroot (Jean Boussier). Status changed from Open to Rejected This is expected. `String#[Integer]` doesn't return a byte but a character, which in UTF-8 may be of variable size, so Ruby has to scan the string from the beginning every time. ---------------------------------------- Bug #19361: String#[Integer] is orders slower for strings with some UTF characters https://bugs.ruby-lang.org/issues/19361#change-101391 * Author: vzdor (Vladimir Zdorovenco) * Status: Rejected * Priority: Normal * ruby -v: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux-gnu] * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- #[] is not only slower compared to itself, but slower compared to #each_char. seq1 ``` # s = '*' * 10e4 s = 'ф' * 10e4 count = 0 size = s.size while count < size s[count] count += 1 end ``` seq2 ``` ss = 'ф' * 10e4 s = ss.chars count = 0 size = s.size while count < size s[count] count += 1 end ``` On my computer seq1 runs in 11 seconds and seq2 in 0.5 second. It can be '克' symbol, too, I'm sure not only those symbols. I would not have assumed seq1 can be slower, I do not call s[n] more than once for some n. It is a Debian package with some patches, but they do not touch string.c. $ locale LANG=en_US.UTF-8 -- https://bugs.ruby-lang.org/
participants (2)
-
byroot (Jean Boussier)
-
vzdor (Vladimir Zdorovenco)