[ruby-core:114276] [Ruby master Bug#19784] String#delete_prefix! problem

List overview All Threads
Download

newer

older

[ruby-core:114612] [Ruby master...

[ruby-core:114607] [Ruby master...

inversion (Yura Babak)

25 Jul 2023 25 Jul '23

10:37 a.m.

Issue #19784 has been reported by inversion (Yura Babak). ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784 * Author: inversion (Yura Babak) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/

Show replies by date

byroot (Jean Boussier)

25 Jul 25 Jul

11:10 a.m.

New subject: [ruby-core:114278] [Ruby master Bug#19784] String#delete_prefix! problem

Issue #19784 has been updated by byroot (Jean Boussier). I suspect it's because `"\xFF\xFE1\u00001\u0000" (UTF-8)` has invalid encoding. That said, if `starts_with?` returns true, `delete_prefix` should arguably work. I'll try to see if I can find why, but in the meantime I tested that you can workaround this by casting the string to `Encoding::BINARY`: ```ruby str = "\xff\xfe1\u00001\u0000" str.force_encoding(Encoding::BINARY) str.delete_prefix!("\xff\xfe".b) str.force_encoding(Encoding::UTF_8) p str # => "1\u00001\u0000" ``` ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784#change-103973 * Author: inversion (Yura Babak) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/

byroot (Jean Boussier)

11:18 a.m.

New subject: [ruby-core:114279] [Ruby master Bug#19784] String#delete_prefix! problem

Issue #19784 has been updated by byroot (Jean Boussier). Ok, so as suspected, in the code strings with a broken encoding are very explicitly rejected: https://github.com/ruby/ruby/commit/10082360b9124c3eaabfccf4fe10a3640c40a05… ```c if (is_broken_string(prefix)) return 0; ``` This is from the initial implementation decided in [Feature #12694], and the PR implementing it was https://github.com/ruby/ruby/pull/1632 I don't see much discussion of that behavior in either the ticket or PR, but there are tests for it so it was intentional from the PR author, not necessarily from committers. I'd be in favor of changing it so that it matches `start_with?` behavior, but I think that needs to be discussed at the developer meeting. It's also unclear whether if accepted it would be as a feature or a bug fix. ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784#change-103974 * Author: inversion (Yura Babak) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/

byroot (Jean Boussier)

11:22 a.m.

New subject: [ruby-core:114281] [Ruby master Bug#19784] String#delete_prefix! problem

Issue #19784 has been updated by byroot (Jean Boussier). Ticket added to the next dev meeting. ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784#change-103976 * Author: inversion (Yura Babak) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/

duerst

26 Jul 26 Jul

10:25 a.m.

New subject: [ruby-core:114294] [Ruby master Bug#19784] String#delete_prefix! problem

Issue #19784 has been updated by duerst (Martin Dürst). I agree that `start_with?` and `delete_prefix!` should probably be aligned. But in general, encoding problems should lead to early errors, not be deferred. ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784#change-103992 * Author: inversion (Yura Babak) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/

mame (Yusuke Endoh)

24 Aug 24 Aug

12:48 p.m.

New subject: [ruby-core:114486] [Ruby master Bug#19784] String#delete_prefix! problem

Issue #19784 has been updated by mame (Yusuke Endoh). Discussed at the dev meeting. We agreed that both `start_with?` and `delete_prefix` should compare character by character if valid, and byte by byte for invalid. ```ruby "\xFF\xFE".start_with?("\xFF") #=> true # not changed "\xFF\xFE".delete_prefix("\xFF") #=> "\xFE" # changed "Ä".start_with?("\xC3") #=> false # changed "Ä".delete_prefix("\xC3") #=> "Ä" # not changed ``` Note that encoding validity is checked on a position-by-position basis: ```ruby "\xFFÄ".delete_prefix("\xFF") #=> should be "Ä" "\xFFÄ".delete_prefix("\xFF\xC3") #=> should be "\xFFÄ" "\xFFÄ".delete_prefix("\xFF\xC3\x84") #=> should be "" ``` ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784#change-104272 * Author: inversion (Yura Babak) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/

Eregon (Benoit Daloze)

4:08 p.m.

New subject: [ruby-core:114494] [Ruby master Bug#19784] String#delete_prefix! problem

Issue #19784 has been updated by Eregon (Benoit Daloze). mame (Yusuke Endoh) wrote in #note-5:

...

Note that encoding validity is checked on a position-by-position basis:

That's not clear to me. Do you mean that if the argument or if the receiver String is not `String#valid_encoding?`, then we compare byte-by-byte, and otherwise we compare character-by-character ? I think we should not consider whether a given substring is valid, that would be very expensive to check. I wonder if always comparing byte-by-byte for both is not a lot simpler, faster, and also what most users expect. And of course still have the encoding check to ensure both strings have a compatible encoding. ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784#change-104279 * Author: inversion (Yura Babak) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/

mame (Yusuke Endoh)

25 Aug 25 Aug

7:44 a.m.

New subject: [ruby-core:114527] [Ruby master Bug#19784] String#delete_prefix! problem

Issue #19784 has been updated by mame (Yusuke Endoh).

...

Do you mean that if the argument or if the receiver String is not `String#valid_encoding?`, then we compare byte-by-byte, and otherwise we compare character-by-character ?

No.

...

I think we should not consider whether a given substring is valid

I think it's this way. In the case of `"\xFF\xC3\x84"` (= `"\xFFÄ"`), the byte sequence from byteoffset 0 is invalid, so we take out one byte `"\xFF"`, and the next byte sequence from byteoffset 1 is valid, so we take out two bytes (one character) `"\xC3\x84"`, and so on, I think @akr or @naruse can explain the rationale. ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784#change-104325 * Author: inversion (Yura Babak) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/

byroot (Jean Boussier)

26 Aug 26 Aug

10:48 a.m.

New subject: [ruby-core:114563] [Ruby master Bug#19784] String#delete_prefix! problem

Issue #19784 has been updated by byroot (Jean Boussier). Note that we have the same issue with `end_with?` and `delete_suffix`. ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784#change-104377 * Author: inversion (Yura Babak) * Status: Closed * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/

jhawthorn (John Hawthorn)

1 Sep 1 Sep

1:07 a.m.

New subject: [ruby-core:114608] [Ruby master Bug#19784] String#delete_prefix! problem

Issue #19784 has been updated by jhawthorn (John Hawthorn). @ywenc and I found a regression from this patch. We have some code handling a broken UTF-8 String with a combination of valid and invalid bytes (UTF-8 followed by binary, which IMO should probably be binary encoded, but it's surprising that the behaviour changed). ``` "hello\xBE".start_with?("hello") #=> false in trunk, was true on 3.2 "hello\xFE".start_with?("hello") #=> true (both 3.2 and trunk, intended behaviour) "hello\xBE".delete_prefix("hello") => "\xBE" (both on 3.2 and trunk), because we skip the check when the prefix is valid "\xFFhello\xBE".delete_prefix("\xFFhello") => "\xFFhello\xBE" in trunk ``` This is because we're looking at character following the prefix, observing that it looks like a UTF-8 continuation byte, and so returns false. This approach might work for ends_with?/delete_suffix, where we don't break on an invalid character in the suffix, but doesn't feel right for prefixes. It sounds like the intended design is that to the user this should feel like we were comparing from the start of the strings char-by-char for valid and byte-by-byte for invalid. We added tests and tried using the end of the previous character, rather than the "start" of the current, to determine if the prefix ends at a char boundary. https://github.com/ruby/ruby/pull/8348 ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784#change-104432 * Author: inversion (Yura Babak) * Status: Closed * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/

262

days inactive

300

days old

ruby-core@ml.ruby-lang.org

Manage subscription

9 comments

6 participants

tags (0)

participants (6)

byroot (Jean Boussier)
duerst
Eregon (Benoit Daloze)
inversion (Yura Babak)
jhawthorn (John Hawthorn)
mame (Yusuke Endoh)