[ruby-core:114662] [Ruby master Bug#19867] Unicode line and paragraph separator are not stripped

Issue #19867 has been reported by iainbeeston (Iain Beeston). ---------------------------------------- Bug #19867: Unicode line and paragraph separator are not stripped https://bugs.ruby-lang.org/issues/19867 * Author: iainbeeston (Iain Beeston) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Unicode newline and paragraph separators are not removed by any of the strip methods: `"\u2028\u2029\u0000\t\n\v\f\r ".strip # => "\u2028\u2029"` I would have expected `strip` (and `lstrip`, `rstrip`) to remove unicode whitespace as well. It looks like #7154 reported something similar but for regular expressions and way back In ruby 1.9. I think that fixing this should be simple (just checking for `\x2028` and `\x2029` in ctype.h) but I'm not sure if it's supposed to behave this way or if changing it could introduce unexpected consequences. -- https://bugs.ruby-lang.org/

Issue #19867 has been updated by iainbeeston (Iain Beeston). I can see that the `[[:space:]]` regex class does match unicode whitespace characters (`"\u2028" =~ /[[:space:]]/` # => 0`) but `\s` does not (`"\u2028" =~ /\s/` # => nil`) ---------------------------------------- Bug #19867: Unicode line and paragraph separator are not stripped https://bugs.ruby-lang.org/issues/19867#change-104491 * Author: iainbeeston (Iain Beeston) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Unicode newline and paragraph separators are not removed by any of the strip methods: `"\u2028\u2029\u0000\t\n\v\f\r ".strip # => "\u2028\u2029"` I would have expected `strip` (and `lstrip`, `rstrip`) to remove unicode whitespace as well. It looks like #7154 reported something similar but for regular expressions and way back In ruby 1.9. I think that fixing this should be simple (just checking for `\x2028` and `\x2029` in ctype.h) but I'm not sure if it's supposed to behave this way or if changing it could introduce unexpected consequences. -- https://bugs.ruby-lang.org/

Issue #19867 has been updated by nobu (Nobuyoshi Nakada). Yes, `\s`, `\w` etc match only single-byte ASCII characters. I don't think changing the behavior by default is good idea. An optional (keyword) argument may be better. ---------------------------------------- Bug #19867: Unicode line and paragraph separator are not stripped https://bugs.ruby-lang.org/issues/19867#change-104492 * Author: iainbeeston (Iain Beeston) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Unicode newline and paragraph separators are not removed by any of the strip methods: `"\u2028\u2029\u0000\t\n\v\f\r ".strip # => "\u2028\u2029"` I would have expected `strip` (and `lstrip`, `rstrip`) to remove unicode whitespace as well. It looks like #7154 reported something similar but for regular expressions and way back In ruby 1.9. I think that fixing this should be simple (just checking for `\x2028` and `\x2029` in ctype.h) but I'm not sure if it's supposed to behave this way or if changing it could introduce unexpected consequences. -- https://bugs.ruby-lang.org/

Issue #19867 has been updated by nobu (Nobuyoshi Nakada). As for the implementation, changing ctype.h is not desirable. There is `rb_enc_isspace` function for such purpose already. ---------------------------------------- Bug #19867: Unicode line and paragraph separator are not stripped https://bugs.ruby-lang.org/issues/19867#change-104493 * Author: iainbeeston (Iain Beeston) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin22] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Unicode newline and paragraph separators are not removed by any of the strip methods: `"\u2028\u2029\u0000\t\n\v\f\r ".strip # => "\u2028\u2029"` I would have expected `strip` (and `lstrip`, `rstrip`) to remove unicode whitespace as well. It looks like #7154 reported something similar but for regular expressions and way back In ruby 1.9. I think that fixing this should be simple (just checking for `\x2028` and `\x2029` in ctype.h) but I'm not sure if it's supposed to behave this way or if changing it could introduce unexpected consequences. -- https://bugs.ruby-lang.org/
participants (2)
-
iainbeeston (Iain Beeston)
-
nobu (Nobuyoshi Nakada)