[ruby-core:124664] [Ruby Bug#21859] Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups
Issue #21859 has been reported by trinistr (Alexander Bulancov). ---------------------------------------- Bug #21859: Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups https://bugs.ruby-lang.org/issues/21859 * Author: trinistr (Alexander Bulancov) * Status: Open * ruby -v: ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- First issue: `Regexp.linear_time?` is `false` for a positive lookahead with a capture, but `true` for a positive lookbehind: ```ruby irb(main):002> Regexp.linear_time?(/(?=(a))/) => false irb(main):003> Regexp.linear_time?(/(?<=(a))/) => true ``` This should be `false` in both cases. Second issue: Capture group is allowed in a negative lookahead, but causes a `SyntaxError` in a negative lookbehind: ```ruby irb(main):001> /(?!(a))b/ => /(?!(a))b/ irb(main):002> /(?<!(a))b/ /home/alex/.local/share/mise/installs/ruby/4.0.1/lib/ruby/gems/4.0.0/gems/irb-1.16.0/exe/irb:9:in '<top (required)>': (irb):2: invalid pattern in look-behind: /(?<!(a))b/ (SyntaxError) ``` I believe such a capture group can never capture anything, so it probably should be an error in both cases. -- https://bugs.ruby-lang.org/
Issue #21859 has been updated by tompng (tomoya ishida). First issue
This should be false in both cases.
I think `Regexp.linear_time?(/(?<=(a))/)` matches in linear time. If the issue is just for inconsistency between lookahead and lookbehind, it's not a bug. Here's an example: ~~~ruby Regexp.linear_time?(/x+(?=(a))/) #=> false Regexp.linear_time?(/x+(?<=(a))/) #=> true /x+(?=(a))/.match?('x' * 100000) #=> processing time: 28.599804s not linear_time /x+(?<=(a))/.match?('x' * 100000) #=> processing time: 0.016630s linear_time ~~~ Second issue: `/(?!(a))b/` `/(?<!(a))b/`
I believe such a capture group can never capture anything
Capture group in negative lookahead can capture and can be used inside negative lookahead. For negative lookbehind, I think it's just a restriction of onigmo. ~~~ruby regexp = /(?!([a-z])\1)[a-z]{2}/ regexp.match?('ab') #=> true regexp.match?('aa') #=> false ~~~ ---------------------------------------- Bug #21859: Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups https://bugs.ruby-lang.org/issues/21859#change-116259 * Author: trinistr (Alexander Bulancov) * Status: Open * ruby -v: ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- First issue: `Regexp.linear_time?` is `false` for a positive lookahead with a capture, but `true` for a positive lookbehind: ```ruby irb(main):002> Regexp.linear_time?(/(?=(a))/) => false irb(main):003> Regexp.linear_time?(/(?<=(a))/) => true ``` This should be `false` in both cases. Second issue: Capture group is allowed in a negative lookahead, but causes a `SyntaxError` in a negative lookbehind: ```ruby irb(main):001> /(?!(a))b/ => /(?!(a))b/ irb(main):002> /(?<!(a))b/ /home/alex/.local/share/mise/installs/ruby/4.0.1/lib/ruby/gems/4.0.0/gems/irb-1.16.0/exe/irb:9:in '<top (required)>': (irb):2: invalid pattern in look-behind: /(?<!(a))b/ (SyntaxError) ``` I believe such a capture group can never capture anything, so it probably should be an error in both cases. -- https://bugs.ruby-lang.org/
Issue #21859 has been updated by trinistr (Alexander Bulancov).
I think `Regexp.linear_time?(/(?<=(a))/)` matches in linear time.
I apologize, it seems I got distracted and forgot to actually check the execution times. But this is interesting behavior. Maybe constant-size lookahead can be optimized to also be linear? It seems strange to me that these cases are so similar but behave very differently.
Capture group in negative lookahead can capture and can be used inside negative lookahead.
I've not been able to find a case where the capture group actually captures, not just overall regexp matches. Isn't it impossible? To match, regexp needs to satisfy *negative* lookahead, so there should *not be* anything to capture. ```ruby regexp = /(?!([a-z])\1)[a-z]{2}/ regexp.match('ab') # => #<MatchData "ab" 1:nil> regexp.match('aabaa') # => #<MatchData "ab" 1:nil> regexp = /[a-z]{2}(?!([a-z])\1)/ regexp.match('aabaa') # => #<MatchData "aa" 1:nil> ``` ---------------------------------------- Bug #21859: Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups https://bugs.ruby-lang.org/issues/21859#change-116261 * Author: trinistr (Alexander Bulancov) * Status: Open * ruby -v: ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- First issue: `Regexp.linear_time?` is `false` for a positive lookahead with a capture, but `true` for a positive lookbehind: ```ruby irb(main):002> Regexp.linear_time?(/(?=(a))/) => false irb(main):003> Regexp.linear_time?(/(?<=(a))/) => true ``` This should be `false` in both cases. Second issue: Capture group is allowed in a negative lookahead, but causes a `SyntaxError` in a negative lookbehind: ```ruby irb(main):001> /(?!(a))b/ => /(?!(a))b/ irb(main):002> /(?<!(a))b/ /home/alex/.local/share/mise/installs/ruby/4.0.1/lib/ruby/gems/4.0.0/gems/irb-1.16.0/exe/irb:9:in '<top (required)>': (irb):2: invalid pattern in look-behind: /(?<!(a))b/ (SyntaxError) ``` I believe such a capture group can never capture anything, so it probably should be an error in both cases. -- https://bugs.ruby-lang.org/
Issue #21859 has been updated by tompng (tomoya ishida).
Isn't it impossible? To match, regexp needs to satisfy negative lookahead, so there should not be anything to capture.
As you wrote, captures are not available OUTSIDE of negative lookahead and also in MatchData. But in `/(?!([a-z])\1)[a-z]{2}/`, `\1` is actually using the capture. It's available INSIDE negative lookahead. As a result, `"aa"` that matches `([a-z])\1` doesn't match to the overall regexp. So capture in negative lookahead is a valid and meaningful pattern. ---------------------------------------- Bug #21859: Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups https://bugs.ruby-lang.org/issues/21859#change-116264 * Author: trinistr (Alexander Bulancov) * Status: Open * ruby -v: ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- First issue: `Regexp.linear_time?` is `false` for a positive lookahead with a capture, but `true` for a positive lookbehind: ```ruby irb(main):002> Regexp.linear_time?(/(?=(a))/) => false irb(main):003> Regexp.linear_time?(/(?<=(a))/) => true ``` This should be `false` in both cases. Second issue: Capture group is allowed in a negative lookahead, but causes a `SyntaxError` in a negative lookbehind: ```ruby irb(main):001> /(?!(a))b/ => /(?!(a))b/ irb(main):002> /(?<!(a))b/ /home/alex/.local/share/mise/installs/ruby/4.0.1/lib/ruby/gems/4.0.0/gems/irb-1.16.0/exe/irb:9:in '<top (required)>': (irb):2: invalid pattern in look-behind: /(?<!(a))b/ (SyntaxError) ``` I believe such a capture group can never capture anything, so it probably should be an error in both cases. -- https://bugs.ruby-lang.org/
Issue #21859 has been updated by trinistr (Alexander Bulancov). Thank you, I understand now what you meant. Should this issue be changed to a feature request? ---------------------------------------- Bug #21859: Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups https://bugs.ruby-lang.org/issues/21859#change-116266 * Author: trinistr (Alexander Bulancov) * Status: Open * ruby -v: ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- First issue: `Regexp.linear_time?` is `false` for a positive lookahead with a capture, but `true` for a positive lookbehind: ```ruby irb(main):002> Regexp.linear_time?(/(?=(a))/) => false irb(main):003> Regexp.linear_time?(/(?<=(a))/) => true ``` This should be `false` in both cases. Second issue: Capture group is allowed in a negative lookahead, but causes a `SyntaxError` in a negative lookbehind: ```ruby irb(main):001> /(?!(a))b/ => /(?!(a))b/ irb(main):002> /(?<!(a))b/ /home/alex/.local/share/mise/installs/ruby/4.0.1/lib/ruby/gems/4.0.0/gems/irb-1.16.0/exe/irb:9:in '<top (required)>': (irb):2: invalid pattern in look-behind: /(?<!(a))b/ (SyntaxError) ``` I believe such a capture group can never capture anything, so it probably should be an error in both cases. -- https://bugs.ruby-lang.org/
Issue #21859 has been updated by Eregon (Benoit Daloze). Status changed from Open to Closed An interesting fact is TruffleRuby/TRegex seems to report exactly the opposite for these 4 Regexp whether they are linear: ``` truffleruby 33.0.1 (2026-01-20), like ruby 3.3.7, Oracle GraalVM Native [x86_64-linux] irb(main):001> Regexp.linear_time?(/(?=(a))/) => true irb(main):002> Regexp.linear_time?(/(?<=(a))/) => false irb(main):003> Regexp.linear_time?(/x+(?=(a))/) => true irb(main):004> Regexp.linear_time?(/x+(?<=(a))/) => false ``` ``` ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] irb(main):001> Regexp.linear_time?(/(?=(a))/) => false irb(main):002> Regexp.linear_time?(/(?<=(a))/) => true irb(main):003> Regexp.linear_time?(/x+(?=(a))/) => false irb(main):004> Regexp.linear_time?(/x+(?<=(a))/) => true ``` I think that means it depends a lot on the specifics of the Regexp engine implementation. I made a summary back then in https://bugs.ruby-lang.org/issues/19104#note-3 (FWIW I think `/x+(?<=(a))/` can never match, it would need to match both 'x' and 'a' for the same input character)
Should this issue be changed to a feature request?
I think we should close this rather. And if you want something specific and have a use for it then filing a new feature request is best. FWIW I saw on 4.0.1 `Regexp.linear_time?(/(?=a)/)` is true but `Regexp.linear_time?(/(?=(a))/)` is false. I don't think capture groups in lookahead or lookbehind are common though. ---------------------------------------- Bug #21859: Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups https://bugs.ruby-lang.org/issues/21859#change-116268 * Author: trinistr (Alexander Bulancov) * Status: Closed * ruby -v: ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- First issue: `Regexp.linear_time?` is `false` for a positive lookahead with a capture, but `true` for a positive lookbehind: ```ruby irb(main):002> Regexp.linear_time?(/(?=(a))/) => false irb(main):003> Regexp.linear_time?(/(?<=(a))/) => true ``` This should be `false` in both cases. Second issue: Capture group is allowed in a negative lookahead, but causes a `SyntaxError` in a negative lookbehind: ```ruby irb(main):001> /(?!(a))b/ => /(?!(a))b/ irb(main):002> /(?<!(a))b/ /home/alex/.local/share/mise/installs/ruby/4.0.1/lib/ruby/gems/4.0.0/gems/irb-1.16.0/exe/irb:9:in '<top (required)>': (irb):2: invalid pattern in look-behind: /(?<!(a))b/ (SyntaxError) ``` I believe such a capture group can never capture anything, so it probably should be an error in both cases. -- https://bugs.ruby-lang.org/
Issue #21859 has been updated by Eregon (Benoit Daloze). Useful context for this issue which would make sense to add the description is this `Regexp` item from the NEWS of 3.3: https://github.com/ruby/ruby/blob/master/doc/NEWS/NEWS-3.3.0.md
The cache-based optimization now supports lookarounds and atomic groupings. That is, match for Regexp containing these extensions can now also be performed in linear time to the length of the input string. However, these cannot contain captures and cannot be nested. [Feature #19725]
---------------------------------------- Bug #21859: Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups https://bugs.ruby-lang.org/issues/21859#change-116269 * Author: trinistr (Alexander Bulancov) * Status: Closed * ruby -v: ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- First issue: `Regexp.linear_time?` is `false` for a positive lookahead with a capture, but `true` for a positive lookbehind: ```ruby irb(main):002> Regexp.linear_time?(/(?=(a))/) => false irb(main):003> Regexp.linear_time?(/(?<=(a))/) => true ``` This should be `false` in both cases. Second issue: Capture group is allowed in a negative lookahead, but causes a `SyntaxError` in a negative lookbehind: ```ruby irb(main):001> /(?!(a))b/ => /(?!(a))b/ irb(main):002> /(?<!(a))b/ /home/alex/.local/share/mise/installs/ruby/4.0.1/lib/ruby/gems/4.0.0/gems/irb-1.16.0/exe/irb:9:in '<top (required)>': (irb):2: invalid pattern in look-behind: /(?<!(a))b/ (SyntaxError) ``` I believe such a capture group can never capture anything, so it probably should be an error in both cases. -- https://bugs.ruby-lang.org/
Issue #21859 has been updated by trinistr (Alexander Bulancov).
Useful context for this issue which would make sense to add the description is this Regexp item from the NEWS of 3.3
Yes, thank you, that's what lead to me making the wrong assumption about linearity of lookbehind.
I don't think capture groups in lookahead or lookbehind are common though.
Suprisingly, there is a smattering of capturing lookaheads in Ruby distribution (for example, `ext/extmk` and `optparse`), though definitely not common.
I think we should close this rather.
Good enough for me. ---------------------------------------- Bug #21859: Inconsistent behaviors in Regexp lookbehind/lookahead with capture groups https://bugs.ruby-lang.org/issues/21859#change-116281 * Author: trinistr (Alexander Bulancov) * Status: Closed * ruby -v: ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [x86_64-linux] * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- First issue: `Regexp.linear_time?` is `false` for a positive lookahead with a capture, but `true` for a positive lookbehind: ```ruby irb(main):002> Regexp.linear_time?(/(?=(a))/) => false irb(main):003> Regexp.linear_time?(/(?<=(a))/) => true ``` This should be `false` in both cases. Second issue: Capture group is allowed in a negative lookahead, but causes a `SyntaxError` in a negative lookbehind: ```ruby irb(main):001> /(?!(a))b/ => /(?!(a))b/ irb(main):002> /(?<!(a))b/ /home/alex/.local/share/mise/installs/ruby/4.0.1/lib/ruby/gems/4.0.0/gems/irb-1.16.0/exe/irb:9:in '<top (required)>': (irb):2: invalid pattern in look-behind: /(?<!(a))b/ (SyntaxError) ``` I believe such a capture group can never capture anything, so it probably should be an error in both cases. -- https://bugs.ruby-lang.org/
participants (3)
-
Eregon (Benoit Daloze) -
tompng (tomoya ishida) -
trinistr (Alexander Bulancov)