[ruby-core:112533] [Ruby master Bug#19455] Ruby 3.2: wrong Regexp encoding with non-ASCII comments

Issue #19455 has been reported by janosch-x (Janosch Müller). ---------------------------------------- Bug #19455: Ruby 3.2: wrong Regexp encoding with non-ASCII comments https://bugs.ruby-lang.org/issues/19455 * Author: janosch-x (Janosch Müller) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.0 (2022-12-25 revision a528908271) * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- comments and comment groups don't trigger the correct `Regexp#encoding` on Ruby 3.2 anymore: ```ruby # ruby 3.1 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:UTF-8> # OK /(?#ü)/.encoding # => #<Encoding:UTF-8> # OK # ruby 3.2 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:US-ASCII> # BUG /(?#ü)/.encoding # => #<Encoding:US-ASCII> # BUG /#ü/x.inspect # => "/#\\xC3\\xBC/x" /(?#ü)/.inspect # => "/(?#\\xC3\\xBC)/" # bug is hidden if there are non-ascii chars outside comments /ü#ü/x.encoding # => #<Encoding:UTF-8> /ü(?#ü)/.encoding # => #<Encoding:UTF-8> ``` i think these changes might be the cause: https://github.com/ruby/ruby/commit/ec3542229b29ec93062e9d90e877ea29d3c19472... @jeremyevans0 JFYI -- https://bugs.ruby-lang.org/

Issue #19455 has been updated by jeremyevans0 (Jeremy Evans). I'm not sure that this a bug. If all non-comment characters considered in the regexp are in the US-ASCII range, it seems reasonable for US-ASCII to be used as the regexp encoding. I'll add this ticket to the next developer meeting and see what other committers think. ---------------------------------------- Bug #19455: Ruby 3.2: wrong Regexp encoding with non-ASCII comments https://bugs.ruby-lang.org/issues/19455#change-101983 * Author: janosch-x (Janosch Müller) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.0 (2022-12-25 revision a528908271) * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- comments and comment groups don't trigger the correct `Regexp#encoding` on Ruby 3.2 anymore: ```ruby # ruby 3.1 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:UTF-8> # OK /(?#ü)/.encoding # => #<Encoding:UTF-8> # OK # ruby 3.2 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:US-ASCII> # BUG /(?#ü)/.encoding # => #<Encoding:US-ASCII> # BUG /#ü/x.inspect # => "/#\\xC3\\xBC/x" /(?#ü)/.inspect # => "/(?#\\xC3\\xBC)/" # bug is hidden if there are non-ascii chars outside comments /ü#ü/x.encoding # => #<Encoding:UTF-8> /ü(?#ü)/.encoding # => #<Encoding:UTF-8> ``` i think these changes might be the cause: https://github.com/ruby/ruby/commit/ec3542229b29ec93062e9d90e877ea29d3c19472... @jeremyevans0 JFYI -- https://bugs.ruby-lang.org/

Issue #19455 has been updated by mame (Yusuke Endoh). @janosch-x Do you have any specific problem with this change? For example, a string that used to match no longer matches, or vice versa. ---------------------------------------- Bug #19455: Ruby 3.2: wrong Regexp encoding with non-ASCII comments https://bugs.ruby-lang.org/issues/19455#change-101993 * Author: janosch-x (Janosch Müller) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.0 (2022-12-25 revision a528908271) * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- comments and comment groups don't trigger the correct `Regexp#encoding` on Ruby 3.2 anymore: ```ruby # ruby 3.1 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:UTF-8> # OK /(?#ü)/.encoding # => #<Encoding:UTF-8> # OK # ruby 3.2 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:US-ASCII> # BUG /(?#ü)/.encoding # => #<Encoding:US-ASCII> # BUG /#ü/x.inspect # => "/#\\xC3\\xBC/x" /(?#ü)/.inspect # => "/(?#\\xC3\\xBC)/" # bug is hidden if there are non-ascii chars outside comments /ü#ü/x.encoding # => #<Encoding:UTF-8> /ü(?#ü)/.encoding # => #<Encoding:UTF-8> ``` i think these changes might be the cause: https://github.com/ruby/ruby/commit/ec3542229b29ec93062e9d90e877ea29d3c19472... @jeremyevans0 JFYI -- https://bugs.ruby-lang.org/

Issue #19455 has been updated by janosch-x (Janosch Müller). i don't have a problem with this myself and the matching behavior is not affected as far as i can tell. notable behavioral differences are: - `/#ü/x.source == '#ü'` used to be true but is now false - this might break some tests or metaprogramming (not very likely IMO) - `/#{/#ü/x.source}/` now raises `ArgumentError` (invalid multibyte character) ---------------------------------------- Bug #19455: Ruby 3.2: wrong Regexp encoding with non-ASCII comments https://bugs.ruby-lang.org/issues/19455#change-102000 * Author: janosch-x (Janosch Müller) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.0 (2022-12-25 revision a528908271) * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- comments and comment groups don't trigger the correct `Regexp#encoding` on Ruby 3.2 anymore: ```ruby # ruby 3.1 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:UTF-8> # OK /(?#ü)/.encoding # => #<Encoding:UTF-8> # OK # ruby 3.2 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:US-ASCII> # BUG /(?#ü)/.encoding # => #<Encoding:US-ASCII> # BUG /#ü/x.inspect # => "/#\\xC3\\xBC/x" /(?#ü)/.inspect # => "/(?#\\xC3\\xBC)/" # bug is hidden if there are non-ascii chars outside comments /ü#ü/x.encoding # => #<Encoding:UTF-8> /ü(?#ü)/.encoding # => #<Encoding:UTF-8> ``` i think these changes might be the cause: https://github.com/ruby/ruby/commit/ec3542229b29ec93062e9d90e877ea29d3c19472... @jeremyevans0 JFYI -- https://bugs.ruby-lang.org/

Issue #19455 has been updated by mame (Yusuke Endoh). Discussed at the dev meeting. @matz said he would prefer 3.1 behavior if possible (but not high priority). @nobu said he would take a look. ---------------------------------------- Bug #19455: Ruby 3.2: wrong Regexp encoding with non-ASCII comments https://bugs.ruby-lang.org/issues/19455#change-102280 * Author: janosch-x (Janosch Müller) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.0 (2022-12-25 revision a528908271) * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- comments and comment groups don't trigger the correct `Regexp#encoding` on Ruby 3.2 anymore: ```ruby # ruby 3.1 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:UTF-8> # OK /(?#ü)/.encoding # => #<Encoding:UTF-8> # OK # ruby 3.2 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:US-ASCII> # BUG /(?#ü)/.encoding # => #<Encoding:US-ASCII> # BUG /#ü/x.inspect # => "/#\\xC3\\xBC/x" /(?#ü)/.inspect # => "/(?#\\xC3\\xBC)/" # bug is hidden if there are non-ascii chars outside comments /ü#ü/x.encoding # => #<Encoding:UTF-8> /ü(?#ü)/.encoding # => #<Encoding:UTF-8> ``` i think these changes might be the cause: https://github.com/ruby/ruby/commit/ec3542229b29ec93062e9d90e877ea29d3c19472... @jeremyevans0 JFYI -- https://bugs.ruby-lang.org/

Issue #19455 has been updated by jeremyevans0 (Jeremy Evans). I submitted a pull request to fix this: https://github.com/ruby/ruby/pull/7592 ---------------------------------------- Bug #19455: Ruby 3.2: wrong Regexp encoding with non-ASCII comments https://bugs.ruby-lang.org/issues/19455#change-102532 * Author: janosch-x (Janosch Müller) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.0 (2022-12-25 revision a528908271) * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- comments and comment groups don't trigger the correct `Regexp#encoding` on Ruby 3.2 anymore: ```ruby # ruby 3.1 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:UTF-8> # OK /(?#ü)/.encoding # => #<Encoding:UTF-8> # OK # ruby 3.2 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:US-ASCII> # BUG /(?#ü)/.encoding # => #<Encoding:US-ASCII> # BUG /#ü/x.inspect # => "/#\\xC3\\xBC/x" /(?#ü)/.inspect # => "/(?#\\xC3\\xBC)/" # bug is hidden if there are non-ascii chars outside comments /ü#ü/x.encoding # => #<Encoding:UTF-8> /ü(?#ü)/.encoding # => #<Encoding:UTF-8> ``` i think these changes might be the cause: https://github.com/ruby/ruby/commit/ec3542229b29ec93062e9d90e877ea29d3c19472... @jeremyevans0 JFYI -- https://bugs.ruby-lang.org/

Issue #19455 has been updated by nagachika (Tomoyuki Chikanaga). Backport changed from 3.0: DONTNEED, 3.1: DONTNEED, 3.2: REQUIRED to 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONE ruby_3_2 be09d77b966c7bcc77957927f16cefe66b365495 merged revision(s) a8ba1ddd78544b4bda749051d44f7b2a8a0ec5ff. ---------------------------------------- Bug #19455: Ruby 3.2: wrong Regexp encoding with non-ASCII comments https://bugs.ruby-lang.org/issues/19455#change-103901 * Author: janosch-x (Janosch Müller) * Status: Closed * Priority: Normal * ruby -v: ruby 3.2.0 (2022-12-25 revision a528908271) * Backport: 3.0: DONTNEED, 3.1: DONTNEED, 3.2: DONE ---------------------------------------- comments and comment groups don't trigger the correct `Regexp#encoding` on Ruby 3.2 anymore: ```ruby # ruby 3.1 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:UTF-8> # OK /(?#ü)/.encoding # => #<Encoding:UTF-8> # OK # ruby 3.2 /#a/x.encoding # => #<Encoding:US-ASCII> # OK /(?#a)/.encoding # => #<Encoding:US-ASCII> # OK /#ü/x.encoding # => #<Encoding:US-ASCII> # BUG /(?#ü)/.encoding # => #<Encoding:US-ASCII> # BUG /#ü/x.inspect # => "/#\\xC3\\xBC/x" /(?#ü)/.inspect # => "/(?#\\xC3\\xBC)/" # bug is hidden if there are non-ascii chars outside comments /ü#ü/x.encoding # => #<Encoding:UTF-8> /ü(?#ü)/.encoding # => #<Encoding:UTF-8> ``` i think these changes might be the cause: https://github.com/ruby/ruby/commit/ec3542229b29ec93062e9d90e877ea29d3c19472... @jeremyevans0 JFYI -- https://bugs.ruby-lang.org/
participants (6)
-
janosch-x
-
jeremyevans0 (Jeremy Evans)
-
jeremyevans0 (Jeremy Evans)
-
mame (Yusuke Endoh)
-
mame (Yusuke Endoh)
-
nagachika (Tomoyuki Chikanaga)