[ruby-core:116013] [Ruby master Bug#20148] Sorting not working as expected on Urdu words.

Issue #20148 has been reported by zohaibnadeem13@gmail.com (Zohaib Nadeem). ---------------------------------------- Bug #20148: Sorting not working as expected on Urdu words. https://bugs.ruby-lang.org/issues/20148 * Author: zohaibnadeem13@gmail.com (Zohaib Nadeem) * Status: Open * Priority: Normal * ruby -v: 3.1.4 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- I was trying to sort an array of Urdu characters and found out an ambiguity in the result. Here is the script that I am using. ['ا', 'پ', 'ب', 'ت', 'ٹ'].sort Actual Result: ["ا", "ب", "ت", "ٹ", "پ"] Expected Result: ["ا", "ب", 'پ', "ت", "ٹ"] -- https://bugs.ruby-lang.org/

Using Python 3.10.12 on Ubuntu 22.04.3 LTS
l = ['ا', 'پ', 'ب', 'ت', 'ٹ'] l.sort() l ['ا', 'ب', 'ت', 'ٹ', 'پ']
On Thu, Jan 4, 2024 at 11:09 AM zohaibnadeem13@gmail.com (Zohaib Nadeem) via ruby-core <ruby-core@ml.ruby-lang.org> wrote:
Issue #20148 has been reported by zohaibnadeem13@gmail.com (Zohaib Nadeem).
---------------------------------------- Bug #20148: Sorting not working as expected on Urdu words. https://bugs.ruby-lang.org/issues/20148
* Author: zohaibnadeem13@gmail.com (Zohaib Nadeem) * Status: Open * Priority: Normal * ruby -v: 3.1.4 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- I was trying to sort an array of Urdu characters and found out an ambiguity in the result. Here is the script that I am using. ['ا', 'پ', 'ب', 'ت', 'ٹ'].sort
Actual Result: ["ا", "ب", "ت", "ٹ", "پ"]
Expected Result: ["ا", "ب", 'پ', "ت", "ٹ"]
-- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org...

Issue #20148 has been updated by duerst (Martin Dürst). Status changed from Open to Rejected The characters involved (shown right-to-left in most environments) are: U+0627 ا ARABIC LETTER ALEF U+00628 ب ARABIC LETTER BEH U+0062A ت ARABIC LETTER TEH U+00679 ٹ ARABIC LETTER TTEH U+0067E پ ARABIC LETTER PEH The first three characters are widely used in most if not all languages written with Arabic. The last two are more specific; in the code charts (see https://www.unicode.org/charts/PDF/U0600.pdf), TTEH has an annotation of 'Urdu', and PEH has an annotation of 'Persian, Urdu,...'. In the Urdu alphabet (see https://en.wikipedia.org/wiki/Urdu_alphabet), these are the first five letters, where PEH comes directly after BEH, and TTEH comes directly after TEH. The Ruby `sort` method sorts these letters/strings in Unicode codepoint order, the same way it does for all characters/strings. That's because sorting text is language-dependent. As an example, Swedish sorts 'ä' and 'ö' after 'z', whereas German sorts them with 'a' and 'o', respectively. It's impossible for `sort` to get it correct for both languages at the same time, and it would require a lot of data. I'm not sure how Arabic-speaking people would sort PEH or TTEH, if they recognize these letters at all. This is also similar to expecting `['a', 'A', 'b', 'B'].sort` to produce `['A', 'a', 'B', 'b']`, when it actually produces `["A", "B", "a", "b"]`. So I'm sorry to have to reject this because it works according to the specification. A feature request to provide language-specific string comparisons (e.g. `string1.<=>(string2, 'ur')` so that this can be used in a block with `sort` may be appropriate, but it will take quite some time to implement this. Alternatively, I suggest you define a hash for the Urdu alphabet order, e.g. ``` {"ا" => 1, "ب" => 2, "پ" => 3, "ت" => 4, "ٹ" => 5 }``` (the code above will look strange because of the effects of the Unicode Bidirectional algorithm, but it should be correct), and use that with the `sort_by` method to sort Urdu strings. ---------------------------------------- Bug #20148: Sorting not working as expected on Urdu words. https://bugs.ruby-lang.org/issues/20148#change-106018 * Author: zohaibnadeem13@gmail.com (Zohaib Nadeem) * Status: Rejected * Priority: Normal * ruby -v: 3.1.4 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- I was trying to sort an array of Urdu characters and found out an ambiguity in the result. Here is the script that I am using. ['ا', 'پ', 'ب', 'ت', 'ٹ'].sort Actual Result: ["ا", "ب", "ت", "ٹ", "پ"] Expected Result: ["ا", "ب", 'پ', "ت", "ٹ"] -- https://bugs.ruby-lang.org/

Issue #20148 has been updated by naruse (Yui NARUSE). As Martin says Ruby's `Array<String>#sort` just uses simple Unicode scalar value sort, which is not what you expect. For the use case which considers the knowledge of the language, you need to use "Collation". RDB sometimes implements collation. In Ruby for example you can use twitter-cldr-rb. https://github.com/twitter/twitter-cldr-rb?tab=readme-ov-file#sorting-collat... ``` irb(main):001> require 'twitter_cldr' => true irb(main):002> ['ا', 'پ', 'ب', 'ت', 'ٹ'].sort => ["ا", "ب", "ت", "ٹ", "پ"] irb(main):003> ['ا', 'پ', 'ب', 'ت', 'ٹ'].localize(:ur).sort.to_a => ["ا", "ب", "پ", "ت", "ٹ"] ``` ---------------------------------------- Bug #20148: Sorting not working as expected on Urdu words. https://bugs.ruby-lang.org/issues/20148#change-106026 * Author: zohaibnadeem13@gmail.com (Zohaib Nadeem) * Status: Rejected * Priority: Normal * ruby -v: 3.1.4 * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- I was trying to sort an array of Urdu characters and found out an ambiguity in the result. Here is the script that I am using. ['ا', 'پ', 'ب', 'ت', 'ٹ'].sort Actual Result: ["ا", "ب", "ت", "ٹ", "پ"] Expected Result: ["ا", "ب", 'پ', "ت", "ٹ"] -- https://bugs.ruby-lang.org/
participants (4)
-
duerst
-
Hanlyu Sarang
-
naruse (Yui NARUSE)
-
zohaibnadeem13@gmail.com (Zohaib Nadeem)