[ruby-core:116024] [Ruby master Bug#20148] Sorting not working as expected on Urdu words.

5 Jan 2024

      Issue #20148 has been updated by duerst (Martin Dürst).

Status changed from Open to Rejected

The characters involved (shown right-to-left in most environments) are:
U+0627 ا ARABIC LETTER ALEF
U+00628 ب ARABIC LETTER BEH
U+0062A ت ARABIC LETTER TEH
U+00679 ٹ ARABIC LETTER TTEH
U+0067E پ ARABIC LETTER PEH
The first three characters are widely used in most if not all languages written with Arabic. The last two are more specific; in the code charts (see https://www.unicode.org/charts/PDF/U0600.pdf), TTEH has an annotation of 'Urdu', and PEH has an annotation of 'Persian, Urdu,...'. In the Urdu alphabet (see https://en.wikipedia.org/wiki/Urdu_alphabet), these are the first five letters, where PEH comes directly after BEH, and TTEH comes directly after TEH.

The Ruby `sort` method sorts these letters/strings in Unicode codepoint order, the same way it does for all characters/strings. That's because sorting text is language-dependent. As an example, Swedish sorts 'ä' and 'ö' after 'z', whereas German sorts them with 'a' and 'o', respectively. It's impossible for `sort` to get it correct for both languages at the same time, and it would require a lot of data. I'm not sure how Arabic-speaking people would sort PEH or TTEH, if they recognize these letters at all.

This is also similar to expecting `['a', 'A', 'b', 'B'].sort` to produce `['A', 'a', 'B', 'b']`, when it actually produces `["A", "B", "a", "b"]`.

So I'm sorry to have to reject this because it works according to the specification. A feature request to provide language-specific string comparisons (e.g. `string1.<=>(string2, 'ur')` so that this can be used in a block with `sort` may be appropriate, but it will take quite some time to implement this.

Alternatively, I suggest you define a hash for the Urdu alphabet order, e.g.
```
{"ا" => 1,
"ب" => 2,
"پ" => 3,
"ت" => 4,
"ٹ" => 5
}```
(the code above will look strange because of the effects of the Unicode Bidirectional algorithm, but it should be correct), and use that with the `sort_by` method to sort Urdu strings.

----------------------------------------
Bug #20148: Sorting not working as expected on Urdu words. 
https://bugs.ruby-lang.org/issues/20148#change-106018

* Author: zohaibnadeem13@gmail.com (Zohaib Nadeem)
* Status: Rejected
* Priority: Normal
* ruby -v: 3.1.4
* Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
I was trying to sort an array of Urdu characters and found out an ambiguity in the result. Here is the script that I am using.
['ا', 'پ', 'ب', 'ت', 'ٹ'].sort

Actual Result:
["ا", "ب", "ت", "ٹ", "پ"]

Expected Result:
["ا", "ب", 'پ', "ت", "ٹ"]

-- 
https://bugs.ruby-lang.org/