[ruby-core:123894] [Ruby Bug#21709] Inconsistent encoding by Regexp.escape
Issue #21709 has been reported by thyresias (Thierry Lambert). ---------------------------------------- Bug #21709: Inconsistent encoding by Regexp.escape https://bugs.ruby-lang.org/issues/21709 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by jeremyevans0 (Jeremy Evans). Status changed from Open to Feedback This is not a bug, it is deliberate behavior for ASCII-only strings in `rb_reg_quote` (internal function called by `Regexp.escape`): ```c if (ascii_only) { rb_enc_associate(tmp, rb_usascii_encoding()); } ``` `US-ASCII` strings will be automatically converted to UTF-8 if necessary: ```ruby ("foo".encode("US-ASCII") + "\u1234").encoding # => #<Encoding:UTF-8> ``` Does this behavior cause any problems in your application? ---------------------------------------- Bug #21709: Inconsistent encoding by Regexp.escape https://bugs.ruby-lang.org/issues/21709#change-115299 * Author: thyresias (Thierry Lambert) * Status: Feedback * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by thyresias (Thierry Lambert).
Does this behavior cause any problems in your application?
Yes: ```ruby search_text = "foo" s_search = Regexp.escape(search_text) re_prefix = /\p{In_Arabic}.+ / s_search.prepend re_prefix.source _re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError) ``` ---------------------------------------- Bug #21709: Inconsistent encoding by Regexp.escape https://bugs.ruby-lang.org/issues/21709#change-115300 * Author: thyresias (Thierry Lambert) * Status: Feedback * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by jeremyevans0 (Jeremy Evans). Status changed from Feedback to Open thyresias (Thierry Lambert) wrote in #note-2:
Does this behavior cause any problems in your application?
Yes: ```ruby search_text = "foo" s_search = Regexp.escape(search_text) re_prefix = /\p{In_Arabic}.+ / s_search.prepend re_prefix.source _re = /^#{s_search}|(?<=– |: )#{s_search}/ #=> encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError) ```
Thank you for providing an example. This seems more like an issue with the literal Regexp support in general than with `Regexp.escape`. You can trigger the issue without `Regexp.escape`: ```ruby re = /#{"\\p{In_Arabic}".encode("US-ASCII")}\u1234/ # encoding mismatch in dynamic regexp : US-ASCII and UTF-8 ``` It seems to require you specify unicode properties inside an interpolated string that isn't in UTF-8. You get a different error without that unicode character at the end: ```ruby re = /#{"\\p{In_Arabic}".encode("US-ASCII")}/ # invalid character property name {In_Arabic}: /\p{In_Arabic}/ ``` Using `Regexp.new` instead of a literal Regexp may work around the issue: ```ruby search_text = "foo" s_search = Regexp.escape(search_text) re_prefix = /\p{In_Arabic}.+ / s_search.prepend re_prefix.source _re = Regexp.new("^#{s_search}|(?<=– |: )#{s_search}") ``` ---------------------------------------- Bug #21709: Inconsistent encoding by Regexp.escape https://bugs.ruby-lang.org/issues/21709#change-115301 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by thyresias (Thierry Lambert). Ok for the workaround, but don't you think all this is inconsistent? For me, it's a bug, not a feature. ^_^ ---------------------------------------- Bug #21709: Inconsistent encoding by Regexp.escape https://bugs.ruby-lang.org/issues/21709#change-115302 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by jeremyevans0 (Jeremy Evans). thyresias (Thierry Lambert) wrote in #note-4:
Ok for the workaround, but don't you think all this is inconsistent? For me, it's a bug, not a feature. ^_^
I agree this represents a bug, which is why I changed the status back to Open. However, I think the bug is in the literal Regexp support, not in `Regexp.escape`. In general, US-ASCII strings are implicitly convertible to UTF-8 strings, so having `Regexp.escape` return a US-ASCII string for data that is solely US-ASCII is reasonable. This implicit use of US-ASCII happens in other cases: ``` # Literal Symbol $ ruby -e "p :a.encoding" #<Encoding:US-ASCII> # Array#join $ ruby -e "p [].join.encoding" #<Encoding:US-ASCII> # Literal Regexp $ ruby -e "p //.encoding" #<Encoding:US-ASCII> ``` ---------------------------------------- Bug #21709: Inconsistent encoding by Regexp.escape https://bugs.ruby-lang.org/issues/21709#change-115303 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by thyresias (Thierry Lambert). jeremyevans0 (Jeremy Evans) wrote in #note-5:
I agree this represents a bug, which is why I changed the status back to Open. However, I think the bug is in the literal Regexp support, not in `Regexp.escape`.
Thank you. I agree with your analysis of the bug origin: should I edit this to re-qualify it as "inconsistent Regexp interpolation behavior", and update the example code using your examples? ---------------------------------------- Bug #21709: Inconsistent encoding by Regexp.escape https://bugs.ruby-lang.org/issues/21709#change-115306 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by jeremyevans0 (Jeremy Evans). thyresias (Thierry Lambert) wrote in #note-6:
Thank you. I agree with your analysis of the bug origin: should I edit this to re-qualify it as "inconsistent Regexp interpolation behavior", and update the example code using your examples?
Sure, that sounds like a good idea. ---------------------------------------- Bug #21709: Inconsistent encoding by Regexp.escape https://bugs.ruby-lang.org/issues/21709#change-115313 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by thyresias (Thierry Lambert). Subject changed from Inconsistent encoding by Regexp.escape to Regexp interpolation is inconsistent with String interpolation jeremyevans0 (Jeremy Evans) wrote in #note-7:
Sure, that sounds like a good idea.
It seems I cannot change the description, just the title. Should I open a new bug report? As an aside, you said about the encoding of the result of `Regexp.escape`:
This is not a bug, it is deliberate behavior for ASCII-only strings in `rb_reg_quote` (internal function called by `Regexp.escape`):
What is the logic in this? It's surprising that the encoding of the output does not match the encoding of the input, and I read somewhere that Matz follows the principle of least surprise... ---------------------------------------- Bug #21709: Regexp interpolation is inconsistent with String interpolation https://bugs.ruby-lang.org/issues/21709#change-115330 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by jeremyevans0 (Jeremy Evans). thyresias (Thierry Lambert) wrote in #note-8:
jeremyevans0 (Jeremy Evans) wrote in #note-7:
Sure, that sounds like a good idea.
It seems I cannot change the description, just the title. Should I open a new bug report?
Updating just the title is fine. I don't think you need to open a new bug report.
As an aside, you said about the encoding of the result of `Regexp.escape`:
This is not a bug, it is deliberate behavior for ASCII-only strings in `rb_reg_quote` (internal function called by `Regexp.escape`):
What is the logic in this? It's surprising that the encoding of the output does not match the encoding of the input, and I read somewhere that Matz follows the principle of least surprise...
The related line was last changed in commit:0f4199fb56ec12dae32a6fa099f15aaa7e55d10f. However, that appears to be a bug fix, and even before that, the function was designed to return US-ASCII for ASCII-only strings. Looks like the actual change was made in commit:b2e60b2ce7a7cbcb8a67ac78606a18d3c2591d81. The reasoning given: ``` (rb_reg_quote): return ascii-8bit string if the argument is ascii-only to generate encoding generic regexp if possible. ``` ---------------------------------------- Bug #21709: Regexp interpolation is inconsistent with String interpolation https://bugs.ruby-lang.org/issues/21709#change-115345 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by thyresias (Thierry Lambert). Ok. Here is the code that shows the inconsistency Regexp/String for interpolation, from your examples: ```ruby # inconsistent Regexp/String interpolation behavior prefix = '\p{In_Arabic}' suffix = '\p{In_Arabic}'.encode('US-ASCII') begin re = /#{prefix}#{suffix}/ rescue => ex puts "fail: #{ex.message} (#{ex.class})" # fail: encoding mismatch in dynamic regexp : US-ASCII and UTF-8 (RegexpError) end s = "#{prefix}#{suffix}" re = /#{s}/ puts "ok: #{s.inspect} (#{s.encoding}) -> #{re.inspect} (#{re.encoding})" # ok: "\\p{In_Arabic}\\p{In_Arabic}" (UTF-8) -> /\p{In_Arabic}\p{In_Arabic}/ (UTF-8) begin re = /#{suffix}/ rescue => ex puts "fail: #{ex.message} (#{ex.class})" # fail: invalid character property name {In_Arabic}: /\p{In_Arabic}/ (RegexpError) end s = "#{suffix}" re = /#{s}/ puts "ok: #{s.inspect} (#{s.encoding}) -> #{re.inspect} (#{re.encoding})" # ok: "\\p{In_Arabic}" (UTF-8) -> /\p{In_Arabic}/ (UTF-8) ``` ---------------------------------------- Bug #21709: Regexp interpolation is inconsistent with String interpolation https://bugs.ruby-lang.org/issues/21709#change-115347 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by naruse (Yui NARUSE). ```ruby re = /#{"\\p{In_Arabic}".encode("US-ASCII")}\u1234/ # encoding mismatch in dynamic regexp : US-ASCII and UTF-8 ``` This behavior looks a bug. ---------------------------------------- Bug #21709: Regexp interpolation is inconsistent with String interpolation https://bugs.ruby-lang.org/issues/21709#change-115590 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by Eregon (Benoit Daloze). Right, I think Regexp interpolation should be closer to String interpolation, currently it's its own separate thing with rather weird rules. It reminds me of some other issues related to Regexp interpolation like #20407 and linked issues. ---------------------------------------- Bug #21709: Regexp interpolation is inconsistent with String interpolation https://bugs.ruby-lang.org/issues/21709#change-115693 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
Issue #21709 has been updated by augustingbpe (Augustin Gottlieb). Hi everyone, I tried to give it a try to fix this issue on this PR, I hope it helps and also to get deeper into the issue, all the tests are passing https://github.com/ruby/ruby/pull/16224 ---------------------------------------- Bug #21709: Regexp interpolation is inconsistent with String interpolation https://bugs.ruby-lang.org/issues/21709#change-116546 * Author: thyresias (Thierry Lambert) * Status: Open * ruby -v: ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [x64-mingw-ucrt] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN ---------------------------------------- ```ruby %w(foo être).each do |s| puts "string: #{s.inspect} -> #{s.encoding}" puts "escaped: #{Regexp.escape(s).inspect} -> #{Regexp.escape(s).encoding}" end ``` Output: ``` string: "foo" -> UTF-8 escaped: "foo" -> US-ASCII string: "être" -> UTF-8 escaped: "être" -> UTF-8 ``` The result should always match the encoding of the argument. -- https://bugs.ruby-lang.org/
participants (5)
-
augustingbpe (Augustin Gottlieb) -
Eregon (Benoit Daloze) -
jeremyevans0 (Jeremy Evans) -
naruse (Yui NARUSE) -
thyresias (Thierry Lambert)