[ruby-core:114936] [Ruby master Feature#19908] Update to Unicode 15.1

Issue #19908 has been reported by nobu (Nobuyoshi Nakada). ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Priority: Normal * Assignee: duerst (Martin Dürst) * Target version: 3.3 ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

🤘👍 пн, 2 окт. 2023 г. в 12:55, nobu (Nobuyoshi Nakada) via ruby-core < ruby-core@ml.ruby-lang.org>:
Issue #19908 has been reported by nobu (Nobuyoshi Nakada).
----------------------------------------
Feature #19908: Update to Unicode 15.1
https://bugs.ruby-lang.org/issues/19908
* Author: nobu (Nobuyoshi Nakada)
* Status: Assigned
* Priority: Normal
* Assignee: duerst (Martin Dürst)
* Target version: 3.3
----------------------------------------
The Unicode 15.1 is released.
The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values.
I'm not sure how these properties should be handled well.
`/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.
--
______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org...

Issue #19908 has been updated by duerst (Martin Dürst). There is a serious issue than just whether using an '_' or an '=' in the property: Unicode 15.1 makes some serious changes to grapheme clusters. Our implementation (function 'node_extended_grapheme_cluster' in regparse.c) is based on Unicode 11.0, in particular https://www.unicode.org/reports/tr29/tr29-33.html#Grapheme_Cluster_Boundarie.... This is quite a bit different from the current version at https://www.unicode.org/reports/tr29/tr29-43.html#Grapheme_Cluster_Boundarie.... One major difference is that for Unicode 11.0, there was a regular expression for grapheme clusters, which I just implemented in the above function. Unicode 15.1 just says that it's possible to use a regular expression, but doesn't give this regular expression. From reading through https://www.unicode.org/versions/Unicode15.1.0/#Migration, that's the main issue affecting Ruby. ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-105854 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Priority: Normal * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by duerst (Martin Dürst). @nobu: We have `Grapheme_Cluster_Break=...`、so I think '=' may be appropriate. But `Grapheme_Cluster_Break=...` uses a long, explicit name. So shouldn't it be `Indic_Cluster_Break=...`, not just `InCB=...`? ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-105861 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Priority: Normal * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by janosch-x (Janosch Müller). Is not [this](https://www.unicode.org/reports/tr29/tr29-43.html#Regex_Definitions) the updated regular expression? ```diff ccs-base := [\p{L}\p{N}\p{P}\p{S}\p{Zs}] ccs-extend := [\p{M}\p{Join_Control}] extended_base := ccs-base | hangul-syllable -crlf := CR LF +crlf := CR LF | CR | LF legacy-core := hangul-syllable | ri-sequence | xpicto-sequence legacy-postcore := [Extend ZWJ] core := hangul-syllable | ri-sequence | xpicto-sequence +| conjunctCluster | [^Control CR LF] postcore := [Extend ZWJ SpacingMark] precore := Prepend hangul-syllable := L* (V+ | LV V* | LVT) T* | L+ | T+ xpicto-sequence := \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})* +conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+ ``` ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-106054 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Priority: Normal * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by duerst (Martin Dürst). @janosch-x You are correct, thanks! I noticed it a few days ago, but didn't yet get around to write about that here. You beat me to that! ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-106096 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Priority: Normal * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by hsbt (Hiroshi SHIBATA). Unicode 16.0 has been released. https://www.unicode.org/versions/Unicode16.0.0/ Should we move this instead of 15.1? ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-109722 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by duerst (Martin Dürst). hsbt (Hiroshi SHIBATA) wrote in #note-8:
Unicode 16.0 has been released.
Should we move this instead of 15.1?
I think it's more prudent to do 15.1 first, then 16.0. I hope to be able to work on this soon. I created a separate issue for 16.0. ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-109725 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by hsbt (Hiroshi SHIBATA).
I think it's more prudent to do 15.1 first, then 16.0.
Agreed, thanks! ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-109726 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by ima1zumi (Mari Imaizumi). @duerst I'm interested in working on this issue. Are you planning to start it? If not, I'd like to try. ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-111243 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by mame (Yusuke Endoh). @duerst What do you think? ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112242 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by ima1zumi (Mari Imaizumi). I have created a PR to update it. https://github.com/ruby/ruby/pull/12798 ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112255 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by naruse (Yui NARUSE). The change looks good to me. Since you have already contributed reline and show your engineering skill, and now you also want to contribute to ruby/ruby, I think you should have commit right for ruby/ruby and commit this change by yourself. @matz How do you think? ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112335 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by ima1zumi (Mari Imaizumi). @naruse Thank you so much for your review and recommending me. I’d be happy to take on commit rights and commit this change myself. ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112336 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by mame (Yusuke Endoh). I'd also like to introduce ima1zumi-san as a candidate for committer. She has been actively working on irb and reline, has deep knowledge and a strong interest in character encoding, and is highly recognized, as she was endorsed by @naruse, the maintainer of Ruby's encoding system. With her contributions extending towards Ruby itself, I support her nomination. ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112337 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by kosaki (Motohiro KOSAKI). +1 ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112338 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by k0kubun (Takashi Kokubun). +1 ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112339 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by matsuda (Akira Matsuda). +1 ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112340 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by mrkn (Kenta Murata). +1 ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112341 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by alanwu (Alan Wu). +1 ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112342 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by matz (Yukihiro Matsumoto). #note-16 Approved. Matz. ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112356 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by hsbt (Hiroshi SHIBATA). @ima1zumi Can you provide the required information to me? See https://github.com/ruby/ruby/wiki/Committer-How-To#how-to-register-you-as-a-... in details. ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112357 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by ima1zumi (Mari Imaizumi). @hsbt I've sent an email to cvs-admin and opened https://github.com/ruby/git.ruby-lang.org/pull/91 ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112358 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/

Issue #19908 has been updated by hsbt (Hiroshi SHIBATA). Thanks, I've finished to prepare your account now. ---------------------------------------- Feature #19908: Update to Unicode 15.1 https://bugs.ruby-lang.org/issues/19908#change-112363 * Author: nobu (Nobuyoshi Nakada) * Status: Assigned * Assignee: duerst (Martin Dürst) ---------------------------------------- The Unicode 15.1 is released. The current enc-unicode.rb seems to fail because of `Indic_Conjunct_break` properties with values. I'm not sure how these properties should be handled well. `/\p{InCB_Liner}/` or `/\p{InCB=Liner}/` as the comments in that file? https://github.com/nobu/ruby/tree/unicode-15.1 is the former. -- https://bugs.ruby-lang.org/
participants (14)
-
alanwu (Alan Wu)
-
duerst
-
hsbt (Hiroshi SHIBATA)
-
ima1zumi (Mari Imaizumi)
-
janosch-x
-
k0kubun (Takashi Kokubun)
-
kosaki (Motohiro KOSAKI)
-
mame (Yusuke Endoh)
-
matsuda (Akira Matsuda)
-
matz (Yukihiro Matsumoto)
-
mrkn (Kenta Murata)
-
naruse (Yui NARUSE)
-
nobu (Nobuyoshi Nakada)
-
Игорь Пятчиц