[ruby-core:124764] [Ruby Bug#21870] Regexp: Warnings when using slightly overlapping \p{...} classes

10 Feb 2026

      Issue #21870 has been updated by jneen (Jeanine Adkisson).

Some benchmarks:

```console
$ ruby --version
ruby 4.0.1 (2026-01-13 revision e04267a14b) +PRISM [arm64-darwin25]
```

```ruby
require 'benchmark'

LENGTH = 1000000
REPEAT = 100
TEST_STR = 'a' * LENGTH

Benchmark.bm do |bm|
  bm.report "char class:" do
    REPEAT.times { /[\p{Word}\p{S}]*/o.match?(TEST_STR) }
  end

  bm.report "alternation:" do
    REPEAT.times { /(?:\p{Word}|\p{S})*/o.match?(TEST_STR) }
  end
end
```

output:
```
                  user     system      total        real
char class:   0.634908   0.302112   0.937020 (  0.937089)
alternation:  0.983069   0.449849   1.432918 (  1.433005)
```

The alternation syntax is understandably a bit slower, as it would be two nodes in the state machine rather than one unified range test. I expect this effect would be worse when more unicode properties are piled on (as they tend to be in practice), resulting in extra nodes.

Either way, `/[\p{Word}\p{S}]/` is a perfectly valid regular expression that as far as I know doesn't have any practical issues, so I don't think it is helpful to warn.

----------------------------------------
Bug #21870: Regexp: Warnings when using slightly overlapping \p{...} classes
https://bugs.ruby-lang.org/issues/21870#change-116371

* Author: jneen (Jeanine Adkisson)
* Status: Open
* ruby -v: 4.0.1
* Backport: 3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN, 4.0: UNKNOWN
----------------------------------------
```ruby
$VERBOSE = true
# warning: character class has duplicated range: /[\p{Word}\p{S}]/
regex = /[\p{Word}\p{S}]/
```

As far as I can tell this is a perfectly valid ~~and non-overlapping~~ set of unicode properties, but I am still being spammed with warnings. Using `/(\p{Word}|\p{S})/` is kind of a workaround, but it is slower.

Edit: They do overlap somewhat, but I think the deeper issue is there is not a convenient way to express this without falling back to raw unicode ranges.

-- 
https://bugs.ruby-lang.org/