[ruby-core:111737] [Ruby master Feature#19317] Unicode ICU Full case mapping

8 Jan 2023

      Issue #19317 has been updated by noraj (Alexandre ZANNI).

zverok (Victor Shepelev) wrote in #note-5:
...
Also, I believe that having a formal language code in the API (instead of a small informal list of writing systems supported) creates a false expectation that every language specificity might be properly accounted for, otherwise "looks like a bug", no?..
```ruby
"STRASSE".downcase(lang: :de_DE) 
# => "strasse"
# But in "properly supported" German, it probably should be 
# => "straße"
```
No.

1. The correct lower casing for STRASSE is strasse even in german. It's only the other was around that straße should be uppercased to STRASSE. But that is handled by ruby correctly. See [Unicode Spec](https://www.unicode.org/versions/Unicode15.0.0/ch05.pdf#G21180). This is not a reversible operation.
2. ß case mapping is language invariant, it means it should have the same behavior independently of the language.

---

Is Ruby already using the [UCD](https://unicode.org/ucd/) (Unicode Character Database see [UAX #44](https://unicode.org/reports/tr44/))?

----------------------------------------
Feature #19317: Unicode ICU Full case mapping
https://bugs.ruby-lang.org/issues/19317#change-101142

* Author: noraj (Alexandre ZANNI)
* Status: Assigned
* Priority: Normal
* Assignee: duerst (Martin Dürst)
----------------------------------------
As announced in [Case Mapping](https://docs.ruby-lang.org/en/master/case_mapping_rdoc.html#label-Default+Ca...), Ruby support for Unicode case mapping is not complete yet.

Unicode supports in Ruby is pretty awesome, it works by default nearly everywhere, things are implemented the right way and works as expected by the UTRs.

But some features are still missing.

To reach [ICU Full Case Mapping support](https://unicode-org.github.io/icu/userguide/transforms/casemappings.html#ful...), a few points need to be enhanced.

### context-sensitive case mapping

* [ ] cf. [Table 3-17 (Context Specification for Casing) of the Unicode standard](https://www.unicode.org/versions/Unicode15.0.0/ch03.pdf) and [ucd/SpecialCasing.txt](https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt).

```ruby
"ΣΣ".downcase # returns σσ instead of σς
```

Output examples in ECMAScript:

```
Σ    ➡️ σ
Σa   ➡️ σa
aΣ   ➡️ aς
aΣa  ➡️ aσa
ΣA   ➡️ σa
aΣ a ➡️ aς a
Σ1   ➡️ σ1
aΣ1  ➡️ aς1
ΣΣ   ➡️ σς
```

## language-sensitive case mapping

* [ ] Lithuanian rules
* [x] Turkish and Azeri

```ruby
"I".downcase # => "i"
"I".downcase(:turkic) # => "ı"
"I\u0307".upcase # => "İ"
"I\u0307".upcase(:lithuanian) # => "İ" instead of "I"
```

* [ ] using some standard locale / language codes

Also, it's true that for now there are only a few language-sensitive rules (for Lithuanian, Turkish and Azeri) but why: 

- adding a `:turkic` symbol and not a `:azeri`?
- using full english arbitrary (why `turkic` and not `turkish`?) language name rather than some [ICU locale IDs](https://unicode-org.github.io/icu/userguide/locale/)?
  - Language code ISO-639 standard
  - Script code Unicode ISO 15924 Registry
  - country code ISO-3166 standard

So I would rather see something like that

```ruby
"placeholder".upcase(locale: :tr_TR)
"placeholder".upcase(lang: :tr)
```

-- 
https://bugs.ruby-lang.org/