[ruby-core:112223] [Ruby master Bug#19417] Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character

Issue #19417 has been reported by ObjectBoxPC (Philip Chung). ---------------------------------------- Bug #19417: Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character https://bugs.ruby-lang.org/issues/19417 * Author: ObjectBoxPC (Philip Chung) * Status: Open * Priority: Normal * ruby -v: 3.2.0 * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- According to the [documentation for Regexp](https://ruby-doc.org/3.2.0/Regexp.html), `\p{Word}` and `[[:word:]]` both match a character in one of the following Unicode general categories: Letter, Mark, Number, Connector_Punctuation. However, neither matches U+00B2, which is in the Other_Number category (which is a subcategory of Number). ``` ruby puts "Ruby version: %s" % RUBY_VERSION puts "\p{Word} matches? %s" % /\p{Word}/u.match?("\u00B2") puts "[[:word:]] matches? %s" % /[[:word:]]/u.match?("\u00B2") puts "Is a Number charater? %s" % /\p{Number}/u.match?("\u00B2") puts "Is an Other_Number character? %s" % /\p{Other_Number}/u.match?("\u00B2") ``` Expected output: ``` Ruby version: 3.2.0 p{Word} matches? true [[:word:]] matches? true Is a Number charater? true Is an Other_Number character? true ``` Actual output: ``` Ruby version: 3.2.0 p{Word} matches? false [[:word:]] matches? false Is a Number charater? true Is an Other_Number character? true ``` I notice that the [upstream Onigmo library doc](https://github.com/k-takata/Onigmo/blob/master/doc/RE) defines the `[[:word:]]` class as "Letter | Mark | Decimal_Number | Connector_Punctuation", meaning that it only matches certain number characters (which would exclude U+00B2). I am not sure how `\p{Word}` is defined though. But perhaps the documentation needs to be changed? -- https://bugs.ruby-lang.org/

Issue #19417 has been updated by jeremyevans0 (Jeremy Evans). Assuming this is a documentation bug, I submitted a pull request to fix it: https://github.com/ruby/ruby/pull/7287 ---------------------------------------- Bug #19417: Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character https://bugs.ruby-lang.org/issues/19417#change-101785 * Author: ObjectBoxPC (Philip Chung) * Status: Open * Priority: Normal * ruby -v: 3.2.0 * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- According to the [documentation for Regexp](https://ruby-doc.org/3.2.0/Regexp.html), `\p{Word}` and `[[:word:]]` both match a character in one of the following Unicode general categories: Letter, Mark, Number, Connector_Punctuation. However, neither matches U+00B2, which is in the Other_Number category (which is a subcategory of Number). ``` ruby puts "Ruby version: %s" % RUBY_VERSION puts "\p{Word} matches? %s" % /\p{Word}/u.match?("\u00B2") puts "[[:word:]] matches? %s" % /[[:word:]]/u.match?("\u00B2") puts "Is a Number charater? %s" % /\p{Number}/u.match?("\u00B2") puts "Is an Other_Number character? %s" % /\p{Other_Number}/u.match?("\u00B2") ``` Expected output: ``` Ruby version: 3.2.0 p{Word} matches? true [[:word:]] matches? true Is a Number charater? true Is an Other_Number character? true ``` Actual output: ``` Ruby version: 3.2.0 p{Word} matches? false [[:word:]] matches? false Is a Number charater? true Is an Other_Number character? true ``` I notice that the [upstream Onigmo library doc](https://github.com/k-takata/Onigmo/blob/master/doc/RE) defines the `[[:word:]]` class as "Letter | Mark | Decimal_Number | Connector_Punctuation", meaning that it only matches certain number characters (which would exclude U+00B2). I am not sure how `\p{Word}` is defined though. But perhaps the documentation needs to be changed? -- https://bugs.ruby-lang.org/

Issue #19417 has been updated by janosch-x (Janosch Müller). regarding the documentation, `letter` in the upstream doc is also incorrect, so the downstream doc actually has two errors. as implemented [here](https://github.com/ruby/ruby/blob/9821f6d0e5957a680bb4ce39708ebc86e23d85d0/t...), `word` actually matches anything with the `alphabetic` property (effectively a superset of the `letter` category comprising about ~1600 chars more). demonstration: ```ruby %w[ word letter mark decimal_number connector_punctuation alpha ].select { |p| eval("/\\p{#{p}}/ =~ ?Ⅷ") } # roman eight # => ["word", "alpha"] ``` a better wording might be: ``` A character with the <i>_Alphabetic_</i> unicode property or one of the following Unicode general categories: <i>Mark</i>, <i>Decimal\_Number</i>, <i>Connector\_Punctuation</i> ``` regarding the behavior, i think it could be changed to match `number` instead of `decimal_number`. some scripts (e.g. Malayalam) have characters for numbers higher than 9, and these would disrupt matching at the moment (e.g. the Malayalam 9 is matched but the 10 is not). this change would also make `word` match fractions and superscripts as the one mentioned by OP ("²"). to me, this would seem like the less unexpected behavior. ---------------------------------------- Bug #19417: Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character https://bugs.ruby-lang.org/issues/19417#change-101838 * Author: ObjectBoxPC (Philip Chung) * Status: Open * Priority: Normal * ruby -v: 3.2.0 * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- According to the [documentation for Regexp](https://ruby-doc.org/3.2.0/Regexp.html), `\p{Word}` and `[[:word:]]` both match a character in one of the following Unicode general categories: Letter, Mark, Number, Connector_Punctuation. However, neither matches U+00B2, which is in the Other_Number category (which is a subcategory of Number). ``` ruby puts "Ruby version: %s" % RUBY_VERSION puts "\p{Word} matches? %s" % /\p{Word}/u.match?("\u00B2") puts "[[:word:]] matches? %s" % /[[:word:]]/u.match?("\u00B2") puts "Is a Number charater? %s" % /\p{Number}/u.match?("\u00B2") puts "Is an Other_Number character? %s" % /\p{Other_Number}/u.match?("\u00B2") ``` Expected output: ``` Ruby version: 3.2.0 p{Word} matches? true [[:word:]] matches? true Is a Number charater? true Is an Other_Number character? true ``` Actual output: ``` Ruby version: 3.2.0 p{Word} matches? false [[:word:]] matches? false Is a Number charater? true Is an Other_Number character? true ``` I notice that the [upstream Onigmo library doc](https://github.com/k-takata/Onigmo/blob/master/doc/RE) defines the `[[:word:]]` class as "Letter | Mark | Decimal_Number | Connector_Punctuation", meaning that it only matches certain number characters (which would exclude U+00B2). I am not sure how `\p{Word}` is defined though. But perhaps the documentation needs to be changed? -- https://bugs.ruby-lang.org/

Issue #19417 has been updated by naruse (Yui NARUSE). The document is wrong. The definition of `word` is defined in Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS. https://unicode.org/reports/tr18/#word ---------------------------------------- Bug #19417: Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character https://bugs.ruby-lang.org/issues/19417#change-102252 * Author: ObjectBoxPC (Philip Chung) * Status: Open * Priority: Normal * ruby -v: 3.2.0 * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- According to the [documentation for Regexp](https://ruby-doc.org/3.2.0/Regexp.html), `\p{Word}` and `[[:word:]]` both match a character in one of the following Unicode general categories: Letter, Mark, Number, Connector_Punctuation. However, neither matches U+00B2, which is in the Other_Number category (which is a subcategory of Number). ``` ruby puts "Ruby version: %s" % RUBY_VERSION puts "\p{Word} matches? %s" % /\p{Word}/u.match?("\u00B2") puts "[[:word:]] matches? %s" % /[[:word:]]/u.match?("\u00B2") puts "Is a Number charater? %s" % /\p{Number}/u.match?("\u00B2") puts "Is an Other_Number character? %s" % /\p{Other_Number}/u.match?("\u00B2") ``` Expected output: ``` Ruby version: 3.2.0 p{Word} matches? true [[:word:]] matches? true Is a Number charater? true Is an Other_Number character? true ``` Actual output: ``` Ruby version: 3.2.0 p{Word} matches? false [[:word:]] matches? false Is a Number charater? true Is an Other_Number character? true ``` I notice that the [upstream Onigmo library doc](https://github.com/k-takata/Onigmo/blob/master/doc/RE) defines the `[[:word:]]` class as "Letter | Mark | Decimal_Number | Connector_Punctuation", meaning that it only matches certain number characters (which would exclude U+00B2). I am not sure how `\p{Word}` is defined though. But perhaps the documentation needs to be changed? -- https://bugs.ruby-lang.org/

Issue #19417 has been updated by jeremyevans0 (Jeremy Evans). naruse (Yui NARUSE) wrote in #note-3:
The document is wrong. The definition of `word` is defined in Unicode® Technical Standard #18 UNICODE REGULAR EXPRESSIONS. https://unicode.org/reports/tr18/#word
I've updated my pull request to match the description in the standard linked by @naruse. @janosch-x or @naruse, could you review? ---------------------------------------- Bug #19417: Regexp \p{Word} and [[:word:]] do not match Unicode Other_Number character https://bugs.ruby-lang.org/issues/19417#change-102528 * Author: ObjectBoxPC (Philip Chung) * Status: Open * Priority: Normal * ruby -v: 3.2.0 * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- According to the [documentation for Regexp](https://ruby-doc.org/3.2.0/Regexp.html), `\p{Word}` and `[[:word:]]` both match a character in one of the following Unicode general categories: Letter, Mark, Number, Connector_Punctuation. However, neither matches U+00B2, which is in the Other_Number category (which is a subcategory of Number). ``` ruby puts "Ruby version: %s" % RUBY_VERSION puts "\p{Word} matches? %s" % /\p{Word}/u.match?("\u00B2") puts "[[:word:]] matches? %s" % /[[:word:]]/u.match?("\u00B2") puts "Is a Number charater? %s" % /\p{Number}/u.match?("\u00B2") puts "Is an Other_Number character? %s" % /\p{Other_Number}/u.match?("\u00B2") ``` Expected output: ``` Ruby version: 3.2.0 p{Word} matches? true [[:word:]] matches? true Is a Number charater? true Is an Other_Number character? true ``` Actual output: ``` Ruby version: 3.2.0 p{Word} matches? false [[:word:]] matches? false Is a Number charater? true Is an Other_Number character? true ``` I notice that the [upstream Onigmo library doc](https://github.com/k-takata/Onigmo/blob/master/doc/RE) defines the `[[:word:]]` class as "Letter | Mark | Decimal_Number | Connector_Punctuation", meaning that it only matches certain number characters (which would exclude U+00B2). I am not sure how `\p{Word}` is defined though. But perhaps the documentation needs to be changed? -- https://bugs.ruby-lang.org/
participants (5)
-
janosch-x
-
jeremyevans0 (Jeremy Evans)
-
jeremyevans0 (Jeremy Evans)
-
naruse (Yui NARUSE)
-
ObjectBoxPC (Philip Chung)