[ruby-core:124714] [Ruby Bug#21870] Regexp: Warnings when using multiple non-overlapping \p{...} classes
Issue #21870 has been reported by jneen (Jeanine Adkisson). ---------------------------------------- Bug #21870: Regexp: Warnings when using multiple non-overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.0, 4.0.1, earlier versions to a lesser extent * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid and non-overlapping set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by tompng (tomoya ishida). I found 130 (5 sets of 26 alphabets) characters matching both `\p{S}` and `\p{Word}`. The visual looks like alphabet-ish symbol character ~~~ruby (0..0x10ffff).select{(s=''<<it; s=~/\p{Word}/&&s=~/\p{S}/) rescue false}.map{''<<it}.join # ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ # ⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ # 🄰🄱🄲🄳🄴🄵🄶🄷🄸🄹🄺🄻🄼🄽🄾🄿🅀🅁🅂🅃🅄🅅🅆🅇🅈🅉 # 🅐🅑🅒🅓🅔🅕🅖🅗🅘🅙🅚🅛🅜🅝🅞🅟🅠🅡🅢🅣🅤🅥🅦🅧🅨🅩 # 🅰🅱🅲🅳🅴🅵🅶🅷🅸🅹🅺🅻🅼🅽🅾🅿🆀🆁🆂🆃🆄🆅🆆🆇🆈🆉 ~~~ I'm not sure how to read unicode properties, but it looks like these characters are Alphabetic:Yes and also in Other_Symbol category https://util.unicode.org/UnicodeJsps/character.jsp?a=%E2%92%B6 ---------------------------------------- Bug #21870: Regexp: Warnings when using multiple non-overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116315 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid and non-overlapping set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by jneen (Jeanine Adkisson). I see! So they do have some overlap. Is it really correct to warn here though? "Fixing" the warning would require falling back to manual unicode ranges. ---------------------------------------- Bug #21870: Regexp: Warnings when using multiple non-overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116316 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid and non-overlapping set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by jneen (Jeanine Adkisson). Another example of this is `/[\p{Word}\p{Cf}]/`, which seem to overlap precisely on ZWNJ (U+200C) and ZWJ (U+200D). ```ruby [1] pry(main)> (0..0x10ffff).select{(s=[it].pack('U'); s=~/\p{Word}/&&s=~/\p{Cf}/) rescue false}.map{it.to_s 16 } => ["200c", "200d"] [2] pry(main)> /[\p{Word}\p{Cf}]/ (pry):5: warning: character class has duplicated range: /[\p{Word}\p{Cf}]/ => /[\p{Word}\p{Cf}]/ [3] pry(main)> ``` ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116324 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid ~~and non-overlapping~~ set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by jneen (Jeanine Adkisson). Description updated That specific case also appears to have changed, e.g. on 3.4.1: ```ruby [2] pry(main)> (0..0x10ffff).select{(s=[it].pack('U'); s=~/\p{Word}/&&s=~/\p{Cf}/) rescue false}.map{it.to_s 16} => [] ``` Maybe for preset classes like `\p{...}` and `[[:alpha:]]` we should only warn if one range completely subsumes another? ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116325 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid ~~and non-overlapping~~ set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. Perhaps -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by mame (Yusuke Endoh). jneen (Jeanine Adkisson) wrote in #note-7:
That specific case also appears to have changed, e.g. on 3.4.1:
It is an intentional bug fix. See #21503. While I understand your trouble, this warning is functioning exactly as intended. How do you suggest resolving it? ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116328 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid ~~and non-overlapping~~ set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by trinistr (Alexander Bulancov).
Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower.
Have you tried a non-capturing group? `/(?:\p{Word}|\p{S})/` should have better performance. ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116337 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid ~~and non-overlapping~~ set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by kddnewton (Kevin Newton). This might be a good opportunity to add the `||` operator from the Unicode spec (https://www.unicode.org/reports/tr18/#Subtraction_and_Intersection. We could make that one not warn, because it's explicitly desired. As in: ```ruby $VERBOSE = true regex = /[\p{Word}\p{S}]/ # warning regex = /[\p{Word}||\p{S}]/ # no warning ``` ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116338 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid ~~and non-overlapping~~ set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by jneen (Jeanine Adkisson). trinistr (Alexander Bulancov) wrote in #note-11:
Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower.
Have you tried a non-capturing group? `/(?:\p{Word}|\p{S})/` should have better performance.
This is what I actually tested. Still much slower. mame (Yusuke Endoh) wrote in #note-9:
jneen (Jeanine Adkisson) wrote in #note-7:
That specific case also appears to have changed, e.g. on 3.4.1:
It is an intentional bug fix. See #21503.
While I understand your trouble, this warning is functioning exactly as intended. How do you suggest resolving it?
I suppose the question is - what is the purpose of a warning here? What fix are you asking the code author to implement? If my downstream users are running with warnings on and Ruby prints 1000 lines of warnings loading my library, what exactly am I being warned about? Is there a specific danger to using overlapping character classes? Or should this kind of thing live in a linter like Rubocop? ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116340 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid ~~and non-overlapping~~ set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by maxfelsher (Max Felsher). If I'm reading the history right, the warning was added in #1831 in order to catch mistakes like a regexp defined as `/[:lower:]/` (as opposed to `/[[:lower:]]/`, I assume). I can see the value in that, but it does seem like there should be a way to list overlapping character classes without a warning (and without turning warnings off completely). ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116352 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid ~~and non-overlapping~~ set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by jneen (Jeanine Adkisson). That's a very interesting find! I do think it makes sense to warn if an explicitly written character repeats in a character class, or if the class begins and ends with a colon. But for overlapping unicode properties, there doesn't seem to be any danger in including both in a character class. That said, there's still an argument that all of this is a job for a linter. Rubocop didn't exist until about a year after #1831 was opened. ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116368 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid ~~and non-overlapping~~ set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by jneen (Jeanine Adkisson). Some benchmarks: ```console $ ruby --version ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [arm64-darwin25] ``` ```ruby require 'benchmark' LENGTH = 1000000 REPEAT = 100 TEST_STR = 'a' * LENGTH Benchmark.bm do |bm| bm.report "char class:" do REPEAT.times { /[\p{Word}\p{S}]*/o.match?(TEST_STR) } end bm.report "alternation:" do REPEAT.times { /(?:\p{Word}|\p{S})*/o.match?(TEST_STR) } end end ``` output: ``` user system total real char class: 0.634908 0.302112 0.937020 ( 0.937089) alternation: 0.983069 0.449849 1.432918 ( 1.433005) ``` The alternation syntax is understandably a bit slower, as it would be two nodes in the state machine rather than one unified range test. I expect this effect would be worse when more unicode properties are piled on (as they tend to be in practice), resulting in extra nodes. Either way, `/[\p{Word}\p{S}]/` is a perfectly valid regular expression that as far as I know doesn't have any practical issues, so I don't think it is helpful to warn. ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116371 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid ~~and non-overlapping~~ set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower. Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by jneen (Jeanine Adkisson). This isn't even possible to work around by targeting RUBY_VERSION, as Ruby warns even in unreachable cases: ```ruby regex = if RUBY_VERSION < '4' /[\p{Word}\p{Cf}]/ else /[\p{Word}]/ end ``` still warns on Ruby 4+, even though the code is not reachable in that version. ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116499 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid and non-redundant set of unicode properties, but I am still being spammed with warnings. Using `/(?:\p{Word}|\p{S})/` is kind of a workaround, but it is slower (see benchmarks below), and also less clear. They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. For a similar example, consider `/[\p{Word}\p{Cf}]/`, which overlap precisely on ZWJ and ZWNJ. Even with this very small overlap, Ruby issues a warning, despite neither class being removable without changing the meaning of the regexp. The regexp is valid and as far as I can tell has no practical issues - Onigmo seems to be capable of intersecting overlapping codepoint ranges. This warning was introduced back in 2009 with #1831, to help surface instances of things like `/[:lower:]/` instead of `/[[:lower:]]/`, but even then the reporter suggested only warning if the class both begins and ends with `:`. Is it appropriate to warn here? Is this a job best left to a static linter like Rubocop, which didn't exist at the time #1831 was opened? Or perhaps would it be better to warn only in the very specific case that #1831 was opened to address? -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by jneen (Jeanine Adkisson). Having looked through the onigmo code a bit now, I can think of a few ways forward. **a) Simply don't warn on overlapping ctype classes.** I believe this would only involve removing the check on line 1860 from regparse.c. This would preserve a warning for `/[:foo:]/`, as in #1831, as well as maybe rarer situations like `/[a-fb-g]/`. It would *not* warn on cases like `/[a-z\p{Word}]/` or `/[\p{Alnum}\p{Word}]/`. Whether this is a common enough mistake to warrant a warning I'm not entirely sure. I will also check the performance characteristics of these, in case overlapping ranges is a performance issue (which I doubt, but I think it is best to check). **b) Find a way to check if a character class or range completely subsumes another.** I honestly am not sure how I would go about implementing this, as it is a much deeper check which would require a greater understanding of onigmo internals than I have so far. The idea would be to warn on `/[a-z\p{Word}]/` but *not* on e.g. `/[_-z\p{Word}]`, since the range `_-z` contains a character not matched by `\p{Word}`. This would also catch `/[\p{Alnum}\p{Word}]/`. **c) Rethink the overlapping character warning entirely, and (maybe) more specifically target things like `/[:x:]/`.** This would involve warning only if the first and last character of a char class are literal `:`. Similar to (a), it may turn out that repeated characters in classes are not a performance or correctness issue it is worth warning about at all. But this is a judgment I leave to the team. ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116534 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid and non-redundant set of unicode properties, but I am still being spammed with warnings. Using `/(?:\p{Word}|\p{S})/` is kind of a workaround, but it is slower (see benchmarks below), and also less clear. They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. For a similar example, consider `/[\p{Word}\p{Cf}]/`, which overlap precisely on ZWJ and ZWNJ. Even with this very small overlap, Ruby issues a warning, despite neither class being removable without changing the meaning of the regexp. The regexp is valid and as far as I can tell has no practical issues - Onigmo seems to be capable of intersecting overlapping codepoint ranges. This warning was introduced back in 2009 with #1831, to help surface instances of things like `/[:lower:]/` instead of `/[[:lower:]]/`, but even then the reporter suggested only warning if the class both begins and ends with `:`. Is it appropriate to warn here? Is this a job best left to a static linter like Rubocop, which didn't exist at the time #1831 was opened? Or perhaps would it be better to warn only in the very specific case that #1831 was opened to address? -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by jneen (Jeanine Adkisson). A quick benchmark shows we are within error bars for matching performance: ```ruby #!/usr/bin/env ruby require 'benchmark' NON_REPEAT = Regexp.new("[" + ("a-z" * 1) + "]") YES_REPEAT = Regexp.new("[" + ("a-z" * 100000) + "]") Benchmark.bm do |bm| bm.report('non-repeat') { 1000000.times { NON_REPEAT.match?('a') } } bm.report('yes-repeat') { 1000000.times { YES_REPEAT.match?('a') } } end ``` Output: ``` ; ruby /tmp/regex-test user system total real non-repeat 0.105758 0.000233 0.105991 ( 0.106004) yes-repeat 0.103658 0.000223 0.103881 ( 0.103881) ``` ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116535 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid and non-redundant set of unicode properties, but I am still being spammed with warnings. Using `/(?:\p{Word}|\p{S})/` is kind of a workaround, but it is slower (see benchmarks below), and also less clear. They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. For a similar example, consider `/[\p{Word}\p{Cf}]/`, which overlap precisely on ZWJ and ZWNJ. Even with this very small overlap, Ruby issues a warning, despite neither class being removable without changing the meaning of the regexp. The regexp is valid and as far as I can tell has no practical issues - Onigmo seems to be capable of intersecting overlapping codepoint ranges. This warning was introduced back in 2009 with #1831, to help surface instances of things like `/[:lower:]/` instead of `/[[:lower:]]/`, but even then the reporter suggested only warning if the class both begins and ends with `:`. Is it appropriate to warn here? Is this a job best left to a static linter like Rubocop, which didn't exist at the time #1831 was opened? Or perhaps would it be better to warn only in the very specific case that #1831 was opened to address? -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by duerst (Martin Dürst). Using two or more overlapping Unicode properties may not be very frequent, but in most cases isn't a mistake. If a user writes `/[\p{Word}\p{S}]/`, that expression should just match all word characters and all symbol characters, because that's most probably what the user wanted. The fact that there are some characters that are both word characters and symbol characters is irrelevant for that query, and should not produce a warning. There are many overlapping Unicode properties, because Unicode properties identify different aspects of characters (e.g. script, block, age, numeric properties,...). If we want to continue to warn about `/[:lower:]/`, that's fine, but we should warn about that specific case, not overlapping properties in general. ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116543 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid and non-redundant set of unicode properties, but I am still being spammed with warnings. Using `/(?:\p{Word}|\p{S})/` is kind of a workaround, but it is slower (see benchmarks below), and also less clear. They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. For a similar example, consider `/[\p{Word}\p{Cf}]/`, which overlap precisely on ZWJ and ZWNJ. Even with this very small overlap, Ruby issues a warning, despite neither class being removable without changing the meaning of the regexp. The regexp is valid and as far as I can tell has no practical issues - Onigmo seems to be capable of intersecting overlapping codepoint ranges. This warning was introduced back in 2009 with #1831, to help surface instances of things like `/[:lower:]/` instead of `/[[:lower:]]/`, but even then the reporter suggested only warning if the class both begins and ends with `:`. Is it appropriate to warn here? Is this a job best left to a static linter like Rubocop, which didn't exist at the time #1831 was opened? Or perhaps would it be better to warn only in the very specific case that #1831 was opened to address? -- https://bugs.ruby-lang.org/
Issue #21870 has been updated by jneen (Jeanine Adkisson). If there are no objections, I'll submit a patch with strategy (a) next week. It's straightforward to implement and maintains the closest to the current behaviour as possible while fixing the issue. ---------------------------------------- Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes https://bugs.ruby-lang.org/issues/21870#change-116587 * Author: jneen (Jeanine Adkisson) * Status: Open * ruby -v: 4.0.1 * Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN ---------------------------------------- ```ruby $VERBOSE = true # warning: character class has duplicated range: /[\p{Word}\p{S}]/ regex = /[\p{Word}\p{S}]/ ``` As far as I can tell this is a perfectly valid and non-redundant set of unicode properties, but I am still being spammed with warnings. Using `/(?:\p{Word}|\p{S})/` is kind of a workaround, but it is slower (see benchmarks below), and also less clear. They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges. For a similar example, consider `/[\p{Word}\p{Cf}]/`, which overlap precisely on ZWJ and ZWNJ. Even with this very small overlap, Ruby issues a warning, despite neither class being removable without changing the meaning of the regexp. The regexp is valid and as far as I can tell has no practical issues - Onigmo seems to be capable of intersecting overlapping codepoint ranges. This warning was introduced back in 2009 with #1831, to help surface instances of things like `/[:lower:]/` instead of `/[[:lower:]]/`, but even then the reporter suggested only warning if the class both begins and ends with `:`. Is it appropriate to warn here? Is this a job best left to a static linter like Rubocop, which didn't exist at the time #1831 was opened? Or perhaps would it be better to warn only in the very specific case that #1831 was opened to address? -- https://bugs.ruby-lang.org/
participants (7)
-
duerst -
jneen (Jeanine Adkisson) -
kddnewton (Kevin Newton) -
mame (Yusuke Endoh) -
maxfelsher (Max Felsher) -
tompng (tomoya ishida) -
trinistr (Alexander Bulancov)