[ruby-core:116491] [Ruby master Bug#20225] Inconsistent behavior of regex matching for a regex has a null loop

List overview All Threads
Download

newer

older

[ruby-core:116941] [Ruby master...

[ruby-core:116983] [Ruby master...

make_now_just (Hiroya Fujinami)

30 Jan 2024 30 Jan '24

2:53 a.m.

Issue #20225 has been reported by make_now_just (Hiroya Fujinami). ---------------------------------------- Bug #20225: Inconsistent behavior of regex matching for a regex has a null loop https://bugs.ruby-lang.org/issues/20225 * Author: make_now_just (Hiroya Fujinami) * Status: Open * Priority: Normal * Assignee: make_now_just (Hiroya Fujinami) * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Usually, in Ruby (Onigmo), when a null loop (a loop consuming no characters) occurs on regex matching, this loop is terminated. But, if a loop has a capture and some complex condition is satisfied, this causes backtracking. This behavior invokes unexpected results, for example, ```ruby p /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => nil p /(?:.B.(?:(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => 0 ``` Because the above regex has a capture and the below does not, different matching results are returned. It is not very intuitive that the presence of a capture changes the matching result. The detailed condition for changing the null-loop behavior is 1) a previous capture in this loop holds the empty string, and 2) this capture's position is different from the current matching position. This condition is checked in `STACK_NULL_CHECK_MEMST` (https://github.com/ruby/ruby/blob/bbb7ab906ec64b963bd4b5d37e47b14796d64371/…). Perhaps, you cannot understand what this condition means. Don't worry, I also cannot understand. This condition has been introduced for at least 20 years, and no one may remember the reason for this necessity. (If you know, please tell me!) Even if there is a reason, I believe that there is no reasonable authority for allowing counter-intuitive behavior, such as the above example. This behavior can also cause memoization to be buggy. Memoization relies on the fact that backtracking only depends on positions and states (byte-code offsets of a regex). However, this condition additionally refers to captures, and the memoization is broken. My proposal is to **correct this inconsistent behavior**. Specifically, a null loop should be determined solely on the basis of whether the matching position has changed, without referring to captures. This fix changes the behavior of regex matching, but I believe that the probability that this will actually cause backward compatibility problems is remarkably low. This is because I have never seen any mention of this puzzling behavior before. -- https://bugs.ruby-lang.org/

Show replies by date

make_now_just (Hiroya Fujinami)

1 Feb 1 Feb

5:59 a.m.

New subject: [ruby-core:116544] [Ruby master Bug#20225] Inconsistent behavior of regex matching for a regex has a null loop

Issue #20225 has been updated by make_now_just (Hiroya Fujinami). These two specs block this change: - <https://github.com/ruby/spec/blob/3267023951a99dc9b79febab10e1ae04c18d5190/language/regexp/empty_checks_spec.rb#L24-L42> - <https://github.com/ruby/spec/blob/3267023951a99dc9b79febab10e1ae04c18d5190/language/regexp/empty_checks_spec.rb#L111-L134> We need to fix these specs before fixing this issue. /cc @Eregon ---------------------------------------- Bug #20225: Inconsistent behavior of regex matching for a regex has a null loop https://bugs.ruby-lang.org/issues/20225#change-106561 * Author: make_now_just (Hiroya Fujinami) * Status: Open * Priority: Normal * Assignee: make_now_just (Hiroya Fujinami) * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Usually, in Ruby (Onigmo), when a null loop (a loop consuming no characters) occurs on regex matching, this loop is terminated. But, if a loop has a capture and some complex condition is satisfied, this causes backtracking. This behavior invokes unexpected results, for example, ```ruby p /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => nil p /(?:.B.(?:(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => 0 ``` Because the above regex has a capture and the below does not, different matching results are returned. It is not very intuitive that the presence of a capture changes the matching result. The detailed condition for changing the null-loop behavior is 1) a previous capture in this loop holds the empty string, and 2) this capture's position is different from the current matching position. This condition is checked in `STACK_NULL_CHECK_MEMST` (https://github.com/ruby/ruby/blob/bbb7ab906ec64b963bd4b5d37e47b14796d64371/…). Perhaps, you cannot understand what this condition means. Don't worry, I also cannot understand. This condition has been introduced for at least 20 years, and no one may remember the reason for this necessity. (If you know, please tell me!) Even if there is a reason, I believe that there is no reasonable authority for allowing counter-intuitive behavior, such as the above example. This behavior can also cause memoization to be buggy. Memoization relies on the fact that backtracking only depends on positions and states (byte-code offsets of a regex). However, this condition additionally refers to captures, and the memoization is broken. My proposal is to **correct this inconsistent behavior**. Specifically, a null loop should be determined solely on the basis of whether the matching position has changed, without referring to captures. This fix changes the behavior of regex matching, but I believe that the probability that this will actually cause backward compatibility problems is remarkably low. This is because I have never seen any mention of this puzzling behavior before. -- https://bugs.ruby-lang.org/

make_now_just (Hiroya Fujinami)

6:05 a.m.

New subject: [ruby-core:116545] [Ruby master Bug#20225] Inconsistent behavior of regex matching for a regex has a null loop

Issue #20225 has been updated by make_now_just (Hiroya Fujinami). In another problematic case, the following example never finishes its matching. ``` p /((?=(a)))*/ =~ "a" ``` ---------------------------------------- Bug #20225: Inconsistent behavior of regex matching for a regex has a null loop https://bugs.ruby-lang.org/issues/20225#change-106562 * Author: make_now_just (Hiroya Fujinami) * Status: Open * Priority: Normal * Assignee: make_now_just (Hiroya Fujinami) * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Usually, in Ruby (Onigmo), when a null loop (a loop consuming no characters) occurs on regex matching, this loop is terminated. But, if a loop has a capture and some complex condition is satisfied, this causes backtracking. This behavior invokes unexpected results, for example, ```ruby p /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => nil p /(?:.B.(?:(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => 0 ``` Because the above regex has a capture and the below does not, different matching results are returned. It is not very intuitive that the presence of a capture changes the matching result. The detailed condition for changing the null-loop behavior is 1) a previous capture in this loop holds the empty string, and 2) this capture's position is different from the current matching position. This condition is checked in `STACK_NULL_CHECK_MEMST` (https://github.com/ruby/ruby/blob/bbb7ab906ec64b963bd4b5d37e47b14796d64371/…). Perhaps, you cannot understand what this condition means. Don't worry, I also cannot understand. This condition has been introduced for at least 20 years, and no one may remember the reason for this necessity. (If you know, please tell me!) Even if there is a reason, I believe that there is no reasonable authority for allowing counter-intuitive behavior, such as the above example. This behavior can also cause memoization to be buggy. Memoization relies on the fact that backtracking only depends on positions and states (byte-code offsets of a regex). However, this condition additionally refers to captures, and the memoization is broken. My proposal is to **correct this inconsistent behavior**. Specifically, a null loop should be determined solely on the basis of whether the matching position has changed, without referring to captures. This fix changes the behavior of regex matching, but I believe that the probability that this will actually cause backward compatibility problems is remarkably low. This is because I have never seen any mention of this puzzling behavior before. -- https://bugs.ruby-lang.org/

Eregon (Benoit Daloze)

3 Feb 3 Feb

2:25 p.m.

New subject: [ruby-core:116563] [Ruby master Bug#20225] Inconsistent behavior of regex matching for a regex has a null loop

Issue #20225 has been updated by Eregon (Benoit Daloze). Could you show how the behavior would change for these specs, e.g. with a diff of the specs or by moving the existing ones under `ruby_version_is ""..."3.4" do` and adding the new behavior under `ruby_version_is "3.4" do` ? ---------------------------------------- Bug #20225: Inconsistent behavior of regex matching for a regex has a null loop https://bugs.ruby-lang.org/issues/20225#change-106580 * Author: make_now_just (Hiroya Fujinami) * Status: Open * Priority: Normal * Assignee: make_now_just (Hiroya Fujinami) * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Usually, in Ruby (Onigmo), when a null loop (a loop consuming no characters) occurs on regex matching, this loop is terminated. But, if a loop has a capture and some complex condition is satisfied, this causes backtracking. This behavior invokes unexpected results, for example, ```ruby p /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => nil p /(?:.B.(?:(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => 0 ``` Because the above regex has a capture and the below does not, different matching results are returned. It is not very intuitive that the presence of a capture changes the matching result. The detailed condition for changing the null-loop behavior is 1) a previous capture in this loop holds the empty string, and 2) this capture's position is different from the current matching position. This condition is checked in `STACK_NULL_CHECK_MEMST` (https://github.com/ruby/ruby/blob/bbb7ab906ec64b963bd4b5d37e47b14796d64371/…). Perhaps, you cannot understand what this condition means. Don't worry, I also cannot understand. This condition has been introduced for at least 20 years, and no one may remember the reason for this necessity. (If you know, please tell me!) Even if there is a reason, I believe that there is no reasonable authority for allowing counter-intuitive behavior, such as the above example. This behavior can also cause memoization to be buggy. Memoization relies on the fact that backtracking only depends on positions and states (byte-code offsets of a regex). However, this condition additionally refers to captures, and the memoization is broken. My proposal is to **correct this inconsistent behavior**. Specifically, a null loop should be determined solely on the basis of whether the matching position has changed, without referring to captures. This fix changes the behavior of regex matching, but I believe that the probability that this will actually cause backward compatibility problems is remarkably low. This is because I have never seen any mention of this puzzling behavior before. -- https://bugs.ruby-lang.org/

Eregon (Benoit Daloze)

2:29 p.m.

New subject: [ruby-core:116564] [Ruby master Bug#20225] Inconsistent behavior of regex matching for a regex has a null loop

Issue #20225 has been updated by Eregon (Benoit Daloze). FWIW current TruffleRuby behavior for the two examples in the description is to return 0 for both. Is that what you propose to change the behavior to? ---------------------------------------- Bug #20225: Inconsistent behavior of regex matching for a regex has a null loop https://bugs.ruby-lang.org/issues/20225#change-106581 * Author: make_now_just (Hiroya Fujinami) * Status: Open * Priority: Normal * Assignee: make_now_just (Hiroya Fujinami) * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Usually, in Ruby (Onigmo), when a null loop (a loop consuming no characters) occurs on regex matching, this loop is terminated. But, if a loop has a capture and some complex condition is satisfied, this causes backtracking. This behavior invokes unexpected results, for example, ```ruby p /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => nil p /(?:.B.(?:(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => 0 ``` Because the above regex has a capture and the below does not, different matching results are returned. It is not very intuitive that the presence of a capture changes the matching result. The detailed condition for changing the null-loop behavior is 1) a previous capture in this loop holds the empty string, and 2) this capture's position is different from the current matching position. This condition is checked in `STACK_NULL_CHECK_MEMST` (https://github.com/ruby/ruby/blob/bbb7ab906ec64b963bd4b5d37e47b14796d64371/…). Perhaps, you cannot understand what this condition means. Don't worry, I also cannot understand. This condition has been introduced for at least 20 years, and no one may remember the reason for this necessity. (If you know, please tell me!) Even if there is a reason, I believe that there is no reasonable authority for allowing counter-intuitive behavior, such as the above example. This behavior can also cause memoization to be buggy. Memoization relies on the fact that backtracking only depends on positions and states (byte-code offsets of a regex). However, this condition additionally refers to captures, and the memoization is broken. My proposal is to **correct this inconsistent behavior**. Specifically, a null loop should be determined solely on the basis of whether the matching position has changed, without referring to captures. This fix changes the behavior of regex matching, but I believe that the probability that this will actually cause backward compatibility problems is remarkably low. This is because I have never seen any mention of this puzzling behavior before. -- https://bugs.ruby-lang.org/

jirkamarsik (Jirka Marsik)

10:55 p.m.

New subject: [ruby-core:116570] [Ruby master Bug#20225] Inconsistent behavior of regex matching for a regex has a null loop

Issue #20225 has been updated by jirkamarsik (Jirka Marsik). As I understand it, the idea behind the null check is for the regex matcher to be able to identify unproductive branches in the regex execution, branches which are guaranteed to never terminate. When executing the expression `X*`, where `X` is some subexpression, the greedy `*` quantifier will always prefer to enter and execute `X` and will only stop when `X` can no longer be matched. The null check lets the regex matcher know that no part of the execution state has changed since the last iteration of the loop. At that point, the regex matcher knows that continuing the loop will never produce a match and so it can afford to terminate the loop. This is because the state of the matcher is now at a state (matcher position, matched capture groups...) such that no other alternative with higher priority could produce a different state. In other words, this is the highest-priority state which could still yield a match. We can look at a simplified example from the spec: ``` /(a|\2b|())*/.match("ab").to_a.should == ["ab", "", ""] ``` and we can look at traces of 3 possible executions of this regular expression: ``` A: ENTER * -> ALTERNATIVE 1 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ALTERNATIVE 2 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ... B: ENTER * -> ALTERNATIVE 1 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ALTERNATIVE 2 -> ENTER * -> ALTERNATIVE 3 -> ENTER * -> ALTERNATIVE 3 -> EXIT * C: ENTER * -> ALTERNATIVE 1 -> ENTER * -> ALTERNATIVE 3 -> EXIT * ``` Trace `A` corresponds to the endless "naive" traversal with no null check, where we always choose the highest-priority option (alternatives are tried from left to right, greedy quantifiers always try to match). In trace `B`, we identify the endless loop of `ENTER * LOOP -> ALTERNATIVE 3 -> ...` and use the null check that's implemented currently in Ruby to terminate the loop after alternative 3 is takes 2 times in a row (the first time it updates capture group 2, the second time it doesn't update any state). Note that trace `B` has lower priority than trace `A` because at one point, it chooses `EXIT *` instead of `ENTER *`. However, trace `A` never finishes the match and even though there exists an infinity of traces that have higher priority than `B` (those that would decide to terminate the loop in some later iteration than `B`), because of the null check, we know that they are all observably equivalent to `B` and so the regex matcher can proceed with `B` as the highest-priority match. In contrast, `C` represents the trace that would be produced by a regex implementation whose null check would ignore updates to the state of capture groups and would terminate the loop after the first iteration that matches the empty string. However, the result of `C` is not the highest priority match, because there exists `B`, which has higher priority, since it chose to execute the body of a greedy quantifier instead of skipping it. I would argue that since the semantics of regular expression search is to return the highest priority match, where priority is determined by the ordering of alternatives and the greediness of quantifiers, the correct answer is the result of executing `B`. Now for the example from your issue, this is clearly a bug and not intended behavior of the null check. The null check should never cause an expression not to match, since it should only prune branches which are known to not terminate. This expression should evaluate to `0`: ``` /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" ``` and this expression should terminate: ``` /((?=(a)))*/ =~ "a" ``` The fact that neither of these is true means that there is a bug (or two bugs) in Ruby's implementation of regular expressions, quite possibly in the null check. However, it shouldn't be the reason to strengthen the null check so that it starts killing branches that could produce a valid match. From my experience working on regular expression implementations, Ruby's null check behavior is not uncommon. Python regular expression's null checks take into account capture groups too, as do the null checks in Oracle Database regular expressions. ---------------------------------------- Bug #20225: Inconsistent behavior of regex matching for a regex has a null loop https://bugs.ruby-lang.org/issues/20225#change-106586 * Author: make_now_just (Hiroya Fujinami) * Status: Open * Priority: Normal * Assignee: make_now_just (Hiroya Fujinami) * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Usually, in Ruby (Onigmo), when a null loop (a loop consuming no characters) occurs on regex matching, this loop is terminated. But, if a loop has a capture and some complex condition is satisfied, this causes backtracking. This behavior invokes unexpected results, for example, ```ruby p /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => nil p /(?:.B.(?:(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => 0 ``` Because the above regex has a capture and the below does not, different matching results are returned. It is not very intuitive that the presence of a capture changes the matching result. The detailed condition for changing the null-loop behavior is 1) a previous capture in this loop holds the empty string, and 2) this capture's position is different from the current matching position. This condition is checked in `STACK_NULL_CHECK_MEMST` (https://github.com/ruby/ruby/blob/bbb7ab906ec64b963bd4b5d37e47b14796d64371/…). Perhaps, you cannot understand what this condition means. Don't worry, I also cannot understand. This condition has been introduced for at least 20 years, and no one may remember the reason for this necessity. (If you know, please tell me!) Even if there is a reason, I believe that there is no reasonable authority for allowing counter-intuitive behavior, such as the above example. This behavior can also cause memoization to be buggy. Memoization relies on the fact that backtracking only depends on positions and states (byte-code offsets of a regex). However, this condition additionally refers to captures, and the memoization is broken. My proposal is to **correct this inconsistent behavior**. Specifically, a null loop should be determined solely on the basis of whether the matching position has changed, without referring to captures. This fix changes the behavior of regex matching, but I believe that the probability that this will actually cause backward compatibility problems is remarkably low. This is because I have never seen any mention of this puzzling behavior before. -- https://bugs.ruby-lang.org/

jirkamarsik (Jirka Marsik)

11:04 p.m.

New subject: [ruby-core:116571] [Ruby master Bug#20225] Inconsistent behavior of regex matching for a regex has a null loop

Issue #20225 has been updated by jirkamarsik (Jirka Marsik). In TruffleRuby, we implement null checks which take into account both matching position and captures, in the spirit of Ruby's regular expressions. Both of the examples from your original issue description evaluate to `0`, and `/((?=(a)))*/ =~ "a"` terminates, so none of these are in conflict with having null checks that care about capture groups and capture-aware null checks do not imply the inconsistent behavior that you found. ---------------------------------------- Bug #20225: Inconsistent behavior of regex matching for a regex has a null loop https://bugs.ruby-lang.org/issues/20225#change-106587 * Author: make_now_just (Hiroya Fujinami) * Status: Open * Priority: Normal * Assignee: make_now_just (Hiroya Fujinami) * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Usually, in Ruby (Onigmo), when a null loop (a loop consuming no characters) occurs on regex matching, this loop is terminated. But, if a loop has a capture and some complex condition is satisfied, this causes backtracking. This behavior invokes unexpected results, for example, ```ruby p /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => nil p /(?:.B.(?:(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => 0 ``` Because the above regex has a capture and the below does not, different matching results are returned. It is not very intuitive that the presence of a capture changes the matching result. The detailed condition for changing the null-loop behavior is 1) a previous capture in this loop holds the empty string, and 2) this capture's position is different from the current matching position. This condition is checked in `STACK_NULL_CHECK_MEMST` (https://github.com/ruby/ruby/blob/bbb7ab906ec64b963bd4b5d37e47b14796d64371/…). Perhaps, you cannot understand what this condition means. Don't worry, I also cannot understand. This condition has been introduced for at least 20 years, and no one may remember the reason for this necessity. (If you know, please tell me!) Even if there is a reason, I believe that there is no reasonable authority for allowing counter-intuitive behavior, such as the above example. This behavior can also cause memoization to be buggy. Memoization relies on the fact that backtracking only depends on positions and states (byte-code offsets of a regex). However, this condition additionally refers to captures, and the memoization is broken. My proposal is to **correct this inconsistent behavior**. Specifically, a null loop should be determined solely on the basis of whether the matching position has changed, without referring to captures. This fix changes the behavior of regex matching, but I believe that the probability that this will actually cause backward compatibility problems is remarkably low. This is because I have never seen any mention of this puzzling behavior before. -- https://bugs.ruby-lang.org/

mame (Yusuke Endoh)

14 Mar 14 Mar

9:42 a.m.

New subject: [ruby-core:117149] [Ruby master Bug#20225] Inconsistent behavior of regex matching for a regex has a null loop

Issue #20225 has been updated by mame (Yusuke Endoh). This was discussed at the February dev meeting and @matz said "give it a try." @make_now_just Can you proceed it? ---------------------------------------- Bug #20225: Inconsistent behavior of regex matching for a regex has a null loop https://bugs.ruby-lang.org/issues/20225#change-107236 * Author: make_now_just (Hiroya Fujinami) * Status: Open * Assignee: make_now_just (Hiroya Fujinami) * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Usually, in Ruby (Onigmo), when a null loop (a loop consuming no characters) occurs on regex matching, this loop is terminated. But, if a loop has a capture and some complex condition is satisfied, this causes backtracking. This behavior invokes unexpected results, for example, ```ruby p /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => nil p /(?:.B.(?:(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => 0 ``` Because the above regex has a capture and the below does not, different matching results are returned. It is not very intuitive that the presence of a capture changes the matching result. The detailed condition for changing the null-loop behavior is 1) a previous capture in this loop holds the empty string, and 2) this capture's position is different from the current matching position. This condition is checked in `STACK_NULL_CHECK_MEMST` (https://github.com/ruby/ruby/blob/bbb7ab906ec64b963bd4b5d37e47b14796d64371/…). Perhaps, you cannot understand what this condition means. Don't worry, I also cannot understand. This condition has been introduced for at least 20 years, and no one may remember the reason for this necessity. (If you know, please tell me!) Even if there is a reason, I believe that there is no reasonable authority for allowing counter-intuitive behavior, such as the above example. This behavior can also cause memoization to be buggy. Memoization relies on the fact that backtracking only depends on positions and states (byte-code offsets of a regex). However, this condition additionally refers to captures, and the memoization is broken. My proposal is to **correct this inconsistent behavior**. Specifically, a null loop should be determined solely on the basis of whether the matching position has changed, without referring to captures. This fix changes the behavior of regex matching, but I believe that the probability that this will actually cause backward compatibility problems is remarkably low. This is because I have never seen any mention of this puzzling behavior before. -- https://bugs.ruby-lang.org/

Eregon (Benoit Daloze)

2:40 p.m.

New subject: [ruby-core:117172] [Ruby master Bug#20225] Inconsistent behavior of regex matching for a regex has a null loop

Issue #20225 has been updated by Eregon (Benoit Daloze). What is the decision, to return 0 in both cases (like TruffleRegex behaves), and fix the behavior in Onigmo? As @jirkamarsik explained above, that is the correct and intended behavior for the null loop optimization, which like any optimization should be transparent and not change semantics. Memoization can be fixed to deal with this, see https://bugs.ruby-lang.org/issues/20225#note-6, alternatively memoization could be not used for such regexps if it's too difficult. ---------------------------------------- Bug #20225: Inconsistent behavior of regex matching for a regex has a null loop https://bugs.ruby-lang.org/issues/20225#change-107262 * Author: make_now_just (Hiroya Fujinami) * Status: Open * Assignee: make_now_just (Hiroya Fujinami) * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Usually, in Ruby (Onigmo), when a null loop (a loop consuming no characters) occurs on regex matching, this loop is terminated. But, if a loop has a capture and some complex condition is satisfied, this causes backtracking. This behavior invokes unexpected results, for example, ```ruby p /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => nil p /(?:.B.(?:(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => 0 ``` Because the above regex has a capture and the below does not, different matching results are returned. It is not very intuitive that the presence of a capture changes the matching result. The detailed condition for changing the null-loop behavior is 1) a previous capture in this loop holds the empty string, and 2) this capture's position is different from the current matching position. This condition is checked in `STACK_NULL_CHECK_MEMST` (https://github.com/ruby/ruby/blob/bbb7ab906ec64b963bd4b5d37e47b14796d64371/…). Perhaps, you cannot understand what this condition means. Don't worry, I also cannot understand. This condition has been introduced for at least 20 years, and no one may remember the reason for this necessity. (If you know, please tell me!) Even if there is a reason, I believe that there is no reasonable authority for allowing counter-intuitive behavior, such as the above example. This behavior can also cause memoization to be buggy. Memoization relies on the fact that backtracking only depends on positions and states (byte-code offsets of a regex). However, this condition additionally refers to captures, and the memoization is broken. My proposal is to **correct this inconsistent behavior**. Specifically, a null loop should be determined solely on the basis of whether the matching position has changed, without referring to captures. This fix changes the behavior of regex matching, but I believe that the probability that this will actually cause backward compatibility problems is remarkably low. This is because I have never seen any mention of this puzzling behavior before. -- https://bugs.ruby-lang.org/

make_now_just (Hiroya Fujinami)

15 Mar 15 Mar

12:37 a.m.

New subject: [ruby-core:117191] [Ruby master Bug#20225] Inconsistent behavior of regex matching for a regex has a null loop

Issue #20225 has been updated by make_now_just (Hiroya Fujinami). On dev meeting, matz concluded that, but I also think a null-loop bug must be fixed. However, there are still remaining issues: - I'm not sure "what the correct behavior of capture-aware null-loop detection". (I also am not sure TruffleRegex's behavior is correct.) - Additionally, I'm not sure it is possible to implement such correct behavior to Onigmo in an efficient way. - I also wonder if it can work with memoization. This issue is complex. ---------------------------------------- Bug #20225: Inconsistent behavior of regex matching for a regex has a null loop https://bugs.ruby-lang.org/issues/20225#change-107280 * Author: make_now_just (Hiroya Fujinami) * Status: Open * Assignee: make_now_just (Hiroya Fujinami) * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- Usually, in Ruby (Onigmo), when a null loop (a loop consuming no characters) occurs on regex matching, this loop is terminated. But, if a loop has a capture and some complex condition is satisfied, this causes backtracking. This behavior invokes unexpected results, for example, ```ruby p /(?:.B.(?<a>(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => nil p /(?:.B.(?:(?:[C-Z]|.)*)+){2}/ =~ "ABCABC" # => 0 ``` Because the above regex has a capture and the below does not, different matching results are returned. It is not very intuitive that the presence of a capture changes the matching result. The detailed condition for changing the null-loop behavior is 1) a previous capture in this loop holds the empty string, and 2) this capture's position is different from the current matching position. This condition is checked in `STACK_NULL_CHECK_MEMST` (https://github.com/ruby/ruby/blob/bbb7ab906ec64b963bd4b5d37e47b14796d64371/…). Perhaps, you cannot understand what this condition means. Don't worry, I also cannot understand. This condition has been introduced for at least 20 years, and no one may remember the reason for this necessity. (If you know, please tell me!) Even if there is a reason, I believe that there is no reasonable authority for allowing counter-intuitive behavior, such as the above example. This behavior can also cause memoization to be buggy. Memoization relies on the fact that backtracking only depends on positions and states (byte-code offsets of a regex). However, this condition additionally refers to captures, and the memoization is broken. My proposal is to **correct this inconsistent behavior**. Specifically, a null loop should be determined solely on the basis of whether the matching position has changed, without referring to captures. This fix changes the behavior of regex matching, but I believe that the probability that this will actually cause backward compatibility problems is remarkably low. This is because I have never seen any mention of this puzzling behavior before. -- https://bugs.ruby-lang.org/

days inactive

111

days old

ruby-core@ml.ruby-lang.org

Manage subscription

9 comments

4 participants

tags (0)

participants (4)

Eregon (Benoit Daloze)
jirkamarsik (Jirka Marsik)
make_now_just (Hiroya Fujinami)
mame (Yusuke Endoh)