- ruby-core - ml.ruby-lang.org

[ruby-core:111723] [Ruby master Bug#18518] NoMemoryError + [FATAL] failed to allocate memory for twice 1 << large
by Eregon (Benoit Daloze) 07 Jan '23

07 Jan '23

Issue #18518 has been updated by Eregon (Benoit Daloze). CRuby actually can give NoMemoryError, RangeError but also ArgumentError (seems a bug: ``` $ ruby -e '1 << (2**67-1)' -e:1:in `<<': integer overflow: 4611686018427387905 * 4 > 18446744073709551615 (ArgumentError) ``` ---------------------------------------- Bug #18518: NoMemoryError + [FATAL] failed to allocate memory for twice 1 << large https://bugs.ruby-lang.org/issues/18518#change-101123 * Author: Eregon (Benoit Daloze) * Status: Rejected * Priority: Normal * ruby -v: ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux] * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- Repro: ```ruby exp = 2**40 # also fails with bignum e.g. 2**64 def exc begin yield rescue NoMemoryError => e p :NoMemoryError end end p exp exc { (1 << exp) } exc { (-1 << exp) } exc { (bignum_value << exp) } exc { (-bignum_value << exp) } ``` Output: ``` $ ruby -v mri_oom.rb ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux] mri_oom.rb:7: warning: assigned but unused variable - e 1099511627776 :NoMemoryError [FATAL] failed to allocate memory ``` 3.1.0 seems fine: ``` $ ruby -v mri_oom.rb ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-linux] mri_oom.rb:7: warning: assigned but unused variable - e 1099511627776 :NoMemoryError :NoMemoryError :NoMemoryError :NoMemoryError ``` -- https://bugs.ruby-lang.org/

1 0

[ruby-core:111722] [Ruby master Bug#18518] NoMemoryError + [FATAL] failed to allocate memory for twice 1 << large
by Eregon (Benoit Daloze) 07 Jan '23

07 Jan '23

Issue #18518 has been updated by Eregon (Benoit Daloze). nobu (Nobuyoshi Nakada) wrote in #note-8: > It is a test for the development branch and unrelated to users using released versions. It might not be clear given the original bug report, but the behavior of NoMemoryError vs RangeError on CRuby for `1 << (2**40)` is independent of dev/released version. So in this issue I suggest https://bugs.ruby-lang.org/issues/18518#note-4: > I think CRuby should check if RHS is bigger than 2**31 and if so raise an exception (e.g. RangeError) immediately instead of taking a lot of time and run into OOM Current behavior on `ruby 3.2.0 (2022-12-25 revision a528908271) [x86_64-linux]`: ``` $ ruby -e '1 << 2**40' -e: failed to allocate memory (NoMemoryError) $ ruby -e '1 << 2**64' -e: failed to allocate memory (NoMemoryError) $ ruby -e '1 << 2**128' -e:1:in `<<': shift width too big (RangeError) $ ruby -e '1 << 2**66' -e: failed to allocate memory (NoMemoryError) $ ruby -e '1 << 2**67' -e:1:in `<<': shift width too big (RangeError) ``` So the limit for RangeError on CRuby seems between 2^66 and 2^67, at least locally on my computer. Which makes sense given that's in bits, divided by 8 is the same as -3, so 67-3 = 64, CRuby can't allocate something that doesn't fit in size_t/64-bit. Interestingly `1 << 2**32` does work on CRuby: ``` $ ruby -e 'p (1 << 2**32).bit_length' 4294967297 # works locally for me, which I did not expect $ ruby -robjspace -e 'p ObjectSpace.memsize_of(1 << 2**32)' 536870956 ``` So OK let's keep this rejected and accept this limit is implementation-defined (2**67 on 64-bit CRuby, 2**35 I guess on 32-bit CRuby, 2**31 on JRuby+TruffleRuby) and I'll adapt the spec. ---------------------------------------- Bug #18518: NoMemoryError + [FATAL] failed to allocate memory for twice 1 << large https://bugs.ruby-lang.org/issues/18518#change-101122 * Author: Eregon (Benoit Daloze) * Status: Rejected * Priority: Normal * ruby -v: ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux] * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- Repro: ```ruby exp = 2**40 # also fails with bignum e.g. 2**64 def exc begin yield rescue NoMemoryError => e p :NoMemoryError end end p exp exc { (1 << exp) } exc { (-1 << exp) } exc { (bignum_value << exp) } exc { (-bignum_value << exp) } ``` Output: ``` $ ruby -v mri_oom.rb ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux] mri_oom.rb:7: warning: assigned but unused variable - e 1099511627776 :NoMemoryError [FATAL] failed to allocate memory ``` 3.1.0 seems fine: ``` $ ruby -v mri_oom.rb ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-linux] mri_oom.rb:7: warning: assigned but unused variable - e 1099511627776 :NoMemoryError :NoMemoryError :NoMemoryError :NoMemoryError ``` -- https://bugs.ruby-lang.org/

1 0

[ruby-core:111715] [Ruby master Bug#19082] Recent gRPC gem fails to build from the source in already released versions
by stanhu (Stan Hu) 07 Jan '23

07 Jan '23

Issue #19082 has been updated by stanhu (Stan Hu). Ok, this is indeed a bug. While Ruby 2.7.7 and 3.0.5 work, Ruby 3.1.3 and up have reintroduced https://bugs.ruby-lang.org/issues/19005. See https://bugs.ruby-lang.org/issues/19005#note-25. ---------------------------------------- Bug #19082: Recent gRPC gem fails to build from the source in already released versions https://bugs.ruby-lang.org/issues/19082#change-101113 * Author: monfresh (Moncef Belyamani) * Status: Third Party's Issue * Priority: Normal * ruby -v: 3.1.3 * Backport: 2.7: DONTNEED, 3.0: DONTNEED, 3.1: DONTNEED ---------------------------------------- About 10 days ago, this commit in the ruby_3_1 branch removed the "$" from "[$flag=]" on line 3073 of `configure.ac`: https://github.com/ruby/ruby/commit/ee6cc2502664ac46edc61868d8954b626bb48e5… This causes the installation of the grpc gem to fail whereas before this change, the gem installed fine. If I add the dollar sign back in, the grpc gem installs successfully. Here are the steps to reproduce: 1. Clone the Ruby repo on an Apple Silicon Mac that has v14 of the command line tools 2. `git checkout -b ruby_3_1 origin/ruby_3_1` 3. Compile Ruby: ``` ./autogen.sh ./configure --with-opt-dir="$(brew --prefix openssl@3):$(brew --prefix readline):$(brew --prefix libyaml):$(brew --prefix gdbm):$(brew --prefix gmp)" --prefix=/Users/moncef/.rubies/ruby-3.1.3 --disable-install-doc make -j7 main make -j7 install ``` 4. Switch to 3.1.3 with `chruby 3.1.3` 5. `gem install grpc` With the current branch, this fails. 6. Remove ~/.rubies/ruby-3.1.3 and ~/.gem/ruby/3.1.3 7. Add the dollar sign back in `configure.ac` 8. Compile Ruby 3.1.3 again the same way as above 9. Switch to 3.1.3 10. `gem install grpc` => This works now. I attached a zip file of the "gem_make.out" file that shows the full stack trace for why grpc failed to build the gem native extension. ---Files-------------------------------- gem_make.out.zip (77 KB) -- https://bugs.ruby-lang.org/

1 0

[ruby-core:111714] [Ruby master Bug#19005] Ruby interpreter compiled XCode 14 cannot build some native gems on macOS
by stanhu (Stan Hu) 07 Jan '23

07 Jan '23

Issue #19005 has been updated by stanhu (Stan Hu). Status changed from Discussion to Closed Ok, I see this was reported in https://bugs.ruby-lang.org/issues/19082. ---------------------------------------- Bug #19005: Ruby interpreter compiled XCode 14 cannot build some native gems on macOS https://bugs.ruby-lang.org/issues/19005#change-101112 * Author: stanhu (Stan Hu) * Status: Closed * Priority: Normal * ruby -v: ruby 2.7.6p219 (2022-04-12 revision 44c8bfa984) [arm64-darwin21] * Backport: 2.7: DONE, 3.0: DONE, 3.1: DONE ---------------------------------------- This seems related to https://bugs.ruby-lang.org/issues/18912 and https://bugs.ruby-lang.org/issues/18981 . Steps to reproduce: 1. Upgrade to XCode 14. 2. Compile a new Ruby interpreter. I used the version provided in https://github.com/ruby/ruby/pull/6297 with `./configure --prefix=/tmp/ruby --with-openssl-dir=$(brew --prefix openssl(a)1.1) --with-readline-dir=$(brew --prefix readline) --enable-shared`. 3. Confirm that `-Wl,-undefined,dynamic_lookup` is no longer available: ``` irb(main):001:0> RbConfig::CONFIG['DLDFLAGS'] => "-Wl,-multiply_defined,suppress" ``` 4. Ran `gem install pg_query` (`gem install ffi-yajl` will also fail). Error: ``` linking shared-object pg_query/pg_query.bundle Undefined symbols for architecture arm64: "Init_pg_query", referenced from: -exported_symbol[s_list] command line option (maybe you meant: _Init_pg_query) ld: symbol(s) not found for architecture arm64 clang: error: linker command failed with exit code 1 (use -v to see invocation) ``` I can workaround the problem by doing: ``` gem install pg_query -- --with-ldflags="-Wl,-undefined,dynamic_lookup" ``` -- https://bugs.ruby-lang.org/

1 0

[ruby-core:111713] [Ruby master Bug#19005] Ruby interpreter compiled XCode 14 cannot build some native gems on macOS
by stanhu (Stan Hu) 07 Jan '23

07 Jan '23

Issue #19005 has been updated by stanhu (Stan Hu). I think this problem was "accidentally" fixed in Ruby 2.7.7 and 3.0.5, but it's not working in Ruby 3.1.3 and up due to a simple removal of a dollar sign (https://github.com/ruby/ruby/commit/667aa81219ca080c0a4b9f97d29bb3221bd08a33) In Ruby 3.1.3, I'm not seeing the `ADDITIONAL_DLDFLAGS` set with `-Wl,-undefined,dynamic_lookup`: ```ruby irb(main):001:0> RUBY_VERSION => "3.1.3" irb(main):002:0> RbConfig::CONFIG['ADDITIONAL_DLDFLAGS'] => "" ``` Whereas 3.0.5 has this: ```ruby irb(main):001:0> RUBY_VERSION => "3.0.5" irb(main):002:0> RbConfig::CONFIG['ADDITIONAL_DLDFLAGS'] => "-Wl,-undefined,dynamic_lookup" ``` If I look at the `./configure` output in Ruby 3.0.5 or 2.7.7, I see something like this: ``` checking whether -Wl,-undefined,dynamic_lookup is accepted as LDFLAGS... ./configure: line 29806: -Wl,-undefined,dynamic_lookup=: command not found checking whether -Wl,-undefined,dynamic_lookup is accepted for bundle... no ``` But with 3.1.3 and up, the `LDFLAGS` check is a straight no: ``` checking whether -Wl,-undefined,dynamic_lookup is accepted as LDFLAGS... no ``` The `configure` output with Ruby 3.0.5 looks like: ``` if ac_fn_c_try_link "$LINENO" then : { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5 printf "%s\n" "${msg_result_yes}yes${msg_reset}" >&6 ; } else $as_nop $flag= { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5 printf "%s\n" "${msg_result_no}no${msg_reset}" >&6 ; } fi ``` The failing line in question is `$flag=`. In Ruby 3.1.3 and up, it appears `$flag=` has been replaced with `flag=` due to https://github.com/ruby/ruby/commit/667aa81219ca080c0a4b9f97d29bb3221bd08a33: ``` if ac_fn_c_try_link "$LINENO" then : { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: yes" >&5 printf "%s\n" "${msg_result_yes}yes${msg_reset}" >&6 ; } else $as_nop flag= { printf "%s\n" "$as_me:${as_lineno-$LINENO}: result: no" >&5 printf "%s\n" "${msg_result_no}no${msg_reset}" >&6 ; } fi ``` ---------------------------------------- Bug #19005: Ruby interpreter compiled XCode 14 cannot build some native gems on macOS https://bugs.ruby-lang.org/issues/19005#change-101110 * Author: stanhu (Stan Hu) * Status: Closed * Priority: Normal * ruby -v: ruby 2.7.6p219 (2022-04-12 revision 44c8bfa984) [arm64-darwin21] * Backport: 2.7: DONE, 3.0: DONE, 3.1: DONE ---------------------------------------- This seems related to https://bugs.ruby-lang.org/issues/18912 and https://bugs.ruby-lang.org/issues/18981 . Steps to reproduce: 1. Upgrade to XCode 14. 2. Compile a new Ruby interpreter. I used the version provided in https://github.com/ruby/ruby/pull/6297 with `./configure --prefix=/tmp/ruby --with-openssl-dir=$(brew --prefix openssl(a)1.1) --with-readline-dir=$(brew --prefix readline) --enable-shared`. 3. Confirm that `-Wl,-undefined,dynamic_lookup` is no longer available: ``` irb(main):001:0> RbConfig::CONFIG['DLDFLAGS'] => "-Wl,-multiply_defined,suppress" ``` 4. Ran `gem install pg_query` (`gem install ffi-yajl` will also fail). Error: ``` linking shared-object pg_query/pg_query.bundle Undefined symbols for architecture arm64: "Init_pg_query", referenced from: -exported_symbol[s_list] command line option (maybe you meant: _Init_pg_query) ld: symbol(s) not found for architecture arm64 clang: error: linker command failed with exit code 1 (use -v to see invocation) ``` I can workaround the problem by doing: ``` gem install pg_query -- --with-ldflags="-Wl,-undefined,dynamic_lookup" ``` -- https://bugs.ruby-lang.org/

1 0

[ruby-core:111706] [Ruby master Bug#18518] NoMemoryError + [FATAL] failed to allocate memory for twice 1 << large
by nobu (Nobuyoshi Nakada) 07 Jan '23

07 Jan '23

Issue #18518 has been updated by nobu (Nobuyoshi Nakada). Status changed from Open to Rejected It is a test for the development branch and unrelated to users using released versions. ---------------------------------------- Bug #18518: NoMemoryError + [FATAL] failed to allocate memory for twice 1 << large https://bugs.ruby-lang.org/issues/18518#change-101105 * Author: Eregon (Benoit Daloze) * Status: Rejected * Priority: Normal * ruby -v: ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux] * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- Repro: ```ruby exp = 2**40 # also fails with bignum e.g. 2**64 def exc begin yield rescue NoMemoryError => e p :NoMemoryError end end p exp exc { (1 << exp) } exc { (-1 << exp) } exc { (bignum_value << exp) } exc { (-bignum_value << exp) } ``` Output: ``` $ ruby -v mri_oom.rb ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux] mri_oom.rb:7: warning: assigned but unused variable - e 1099511627776 :NoMemoryError [FATAL] failed to allocate memory ``` 3.1.0 seems fine: ``` $ ruby -v mri_oom.rb ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-linux] mri_oom.rb:7: warning: assigned but unused variable - e 1099511627776 :NoMemoryError :NoMemoryError :NoMemoryError :NoMemoryError ``` -- https://bugs.ruby-lang.org/

1 0

[ruby-core:111704] [Ruby master Bug#18518] NoMemoryError + [FATAL] failed to allocate memory for twice 1 << large
by headius (Charles Nutter) 06 Jan '23

06 Jan '23

Issue #18518 has been updated by headius (Charles Nutter). There's no practical reason to support left shift of greater than integer max, so I would support a fast check and RangeError. It would make more sense than just blowing up memory and raising NoMemoryError for something that should never work (1 << (`2**32`) produces a big integer at least 2^29 bytes wide, more than 0.5GB). ---------------------------------------- Bug #18518: NoMemoryError + [FATAL] failed to allocate memory for twice 1 << large https://bugs.ruby-lang.org/issues/18518#change-101101 * Author: Eregon (Benoit Daloze) * Status: Open * Priority: Normal * ruby -v: ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux] * Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- Repro: ```ruby exp = 2**40 # also fails with bignum e.g. 2**64 def exc begin yield rescue NoMemoryError => e p :NoMemoryError end end p exp exc { (1 << exp) } exc { (-1 << exp) } exc { (bignum_value << exp) } exc { (-bignum_value << exp) } ``` Output: ``` $ ruby -v mri_oom.rb ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [x86_64-linux] mri_oom.rb:7: warning: assigned but unused variable - e 1099511627776 :NoMemoryError [FATAL] failed to allocate memory ``` 3.1.0 seems fine: ``` $ ruby -v mri_oom.rb ruby 3.1.0p0 (2021-12-25 revision fb4df44d16) [x86_64-linux] mri_oom.rb:7: warning: assigned but unused variable - e 1099511627776 :NoMemoryError :NoMemoryError :NoMemoryError :NoMemoryError ``` -- https://bugs.ruby-lang.org/

1 0

[ruby-core:111432] [Ruby master Bug#19260] ruby/spec is failed with Ruby 3.3
by hsbt (Hiroshi SHIBATA) 06 Jan '23

06 Jan '23

Issue #19260 has been reported by hsbt (Hiroshi SHIBATA). ---------------------------------------- Bug #19260: ruby/spec is failed with Ruby 3.3 https://bugs.ruby-lang.org/issues/19260 * Author: hsbt (Hiroshi SHIBATA) * Status: Open * Priority: Normal * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- After bumping version, we got the some fails with ruby/spec. https://github.com/ruby/ruby/actions/runs/3778576412/jobs/6423166914 ``` 1) Literal Regexps handles a lookbehind with ss characters ERROR RegexpError: invalid pattern in look-behind: /(?<!dss)/i /home/runner/work/ruby/ruby/src/spec/ruby/language/regexp_spec.rb:120:in `block (3 levels) in <top (required)>' /home/runner/work/ruby/ruby/src/spec/ruby/language/regexp_spec.rb:4:in `<top (required)>' 2) Float#round does not lose precision during the rounding process FAILED Expected 767573.18758 to have same value and type as 767573.18759 /home/runner/work/ruby/ruby/src/spec/ruby/core/float/round_spec.rb:148:in `block (3 levels) in <top (required)>' /home/runner/work/ruby/ruby/src/spec/ruby/core/float/round_spec.rb:3:in `<top (required)>' 3) Encoding#replicate has been removed FAILED Expected #<Encoding:US-ASCII>.respond_to? :replicate, true to be falsy but was true /home/runner/work/ruby/ruby/src/spec/ruby/core/encoding/replicate_spec.rb:72:in `block (3 levels) in <top (required)>' /home/runner/work/ruby/ruby/src/spec/ruby/core/encoding/replicate_spec.rb:4:in `<top (required)>' ``` -- https://bugs.ruby-lang.org/

3 5

[ruby-core:111702] [Ruby master Bug#18797] Third argument to Regexp.new is a bit broken
by Eregon (Benoit Daloze) 06 Jan '23

06 Jan '23

Issue #18797 has been updated by Eregon (Benoit Daloze). Assignee set to jeremyevans0 (Jeremy Evans) Target version set to 3.3 ---------------------------------------- Bug #18797: Third argument to Regexp.new is a bit broken https://bugs.ruby-lang.org/issues/18797#change-101098 * Author: janosch-x (Janosch Müller) * Status: Open * Priority: Normal * Assignee: jeremyevans0 (Jeremy Evans) * Target version: 3.3 * Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN ---------------------------------------- ## Situation 'n' or 'N' can be passed as a third argument to `Regexp.new`. However, the behavior is not the same as the literal `n`-flag or the `Regexp::NOENCODING` option, and it makes the `#encoding` of `Regexp` and `Regexp#source` diverge: ```ruby /😅/n # => SyntaxError Regexp.new('😅', Regexp::NOENCODING) # => RegexpError re = Regexp.new('😅', nil, 'n') # => /😅/ re.options == Regexp::NOENCODING # => true re.encoding # => ASCII-8BIT re.source.encoding # => UTF-8 re =~ '😅' # => Encoding::CompatibilityError ``` ## Code [Here](https://github.com/ruby/ruby/blob/b41de3a1e8c36a5cc336b6f7cd3cb71126c…. There is also a test for the resulting encoding [here](https://github.com/ruby/ruby/blob/cf2bbcfff2985c116552967c7c4522f4630…, but it is a no-op because the whole file is set to that encoding via magic comment anyway. The third argument was added when ASCII was still the default Ruby encoding, so I guess Regexp and source encoding still matched at that point. ## Solution It could be fixed, but my impression is that it is not useful anymore. It was probably only added because `Regexp::NOENCODING` wasn't available at the time, so I think it could be deprecated like so: > Passing a third argument to Regexp.new is deprecated. Use `Regexp::NOENCODING` as second argument instead. -- https://bugs.ruby-lang.org/

1 0

[ruby-core:111697] [Ruby master Feature#18949] Deprecate and remove replicate and dummy encodings
by Eregon (Benoit Daloze) 06 Jan '23

06 Jan '23

Issue #18949 has been updated by Eregon (Benoit Daloze). Target version set to 3.3 This is all done now, only https://github.com/ruby/ruby/pull/7079 left and I'll merge that when it passes CI. Overall: * We deprecated and removed `Encoding#replicate` * We removed `get_actual_encoding()` * We limited to 256 encodings and kept `rb_define_dummy_encoding()` with that constraint. * There is a single flat array to lookup encodings, `rb_enc_from_index()` is fast now. Since the limit is 256 and not 128 though it means `ENCODING_GET` is not just `RB_ENCODING_GET_INLINED` but still has the check and slow fallback. Thank you for the discussion, @ko1 for implementing the fixed-size table and let's close this. Of course for all builtin encodings the cost is just the extra check. Maybe the limit could be changed later to 128 if this optimization is wanted. ---------------------------------------- Feature #18949: Deprecate and remove replicate and dummy encodings https://bugs.ruby-lang.org/issues/18949#change-101095 * Author: Eregon (Benoit Daloze) * Status: Open * Priority: Normal * Assignee: Eregon (Benoit Daloze) * Target version: 3.3 ---------------------------------------- Ruby has a lot of accidental complexity. Sometimes it becomes clear some features bring a lot of complexity and yet provide little value or are used very rarely. Also most Ruby users do not even know about these features. Replicate and dummy encodings seem to clearly fall into this category, almost nobody uses them but they add a significant complexity and also add a significant performance overhead. Notably, the existence of those means the number of encodings in a Ruby runtime is actually variable and not fixed. That means extra synchronization, hashtable lookups, indirections, function calls, etc. ## Replicate Encodings Replicate encodings are created using `Encoding#replicate(name)`. It almost sounds like an alias but in fact it is more than that and creates a new Encoding object, which can be used by a String: ```ruby e = Encoding::US_ASCII.replicate('MY-US-ASCII') s = "abc".force_encoding(e) p s.encoding # => e p s.encoding.name # => 'MY-US-ASCII' ``` This seems completely useless. There is an obvious first step here which is to change `Encoding#replicate` to return the receiver, and just install an alias for it. That avoids creating more encoding instances needlessly. I think we should also deprecate and remove this method though, it is never a good idea to have a global mutable map like this. If someone want extra aliases for encodings, they can easily to do so by having their own Hash: `{ alias => encoding }.fetch(name) { Encoding.find(name) }`. ## Dummy Encodings Dummy encodings are not real encodings. They are artificial encodings designed to look like encodings, but don't function as encodings in Ruby. From the docs: ``` enc.dummy? -> true or false ------------------------------------------------------------------------ Returns true for dummy encodings. A dummy encoding is an encoding for which character handling is not properly implemented. It is used for stateful encodings. ``` I wonder why we have those half-implemented encodings in core, it sounds to me like unfinished work which should not have been merged. The "codepoints" of dummy encodings are just "bytes" and so they behave the same as `Encoding::BINARY`, with the exception of the UTF-16 and UTF-32 dummy encodings. ### UTF-16 and UTF-32 dummy encodings These two are special dummy encodings. What they do is they scan the first 2 or 4 bytes of the String, and if those bytes are a byte-order mark (BOM), the true "actual" encoding is resolved to UTF-16BE/UTF-16LE or UTF-32BE/UTF-32LE. Otherwise, `Encoding::BINARY` is returned. This logic is done by `get_actual_encoding()`. What is weird is this check is not done on String creation, no, it is done *every time* the encoding of that String is accessed (and the result is not stored on the String). That is a needless overhead and really unreliable semantics. Do we really want a String which automagically changes between UTF-16LE and UTF-16BE based on mutating its bytes? I think nobody wants that: ```ruby s = "\xFF\xFEa\x00b\x00c\x00d\x00".force_encoding("UTF-16") p s # => "\uFEFFabcd" s.setbyte 0, 254 s.setbyte 1, 255 p s # => "\uFEFF\u6100\u6200\u6300\u6400" ``` I think the path is clear, we should deprecate and then remove Encoding::UTF_16 and Encoding::UTF_32 (dummy encodings). And then we no longer need `get_actual_encoding()` and the overhead it adds to every String method. We could also keep those constants and make them refer the native-endian UTF-16/32. But that could cause confusing errors as we would change the meaning of them. We could add `Encoding::UTF_16NE` / `Encoding::UTF_16_NATIVE_ENDIAN` if that is useful. Another possibility would be to resolve these encodings on String creation, like: ``` "\xFF\xFE".force_encoding("UTF-16").encoding # => UTF-16LE String.new("\xFF\xFE", encoding: Encoding::UTF_16).encoding # => UTF-16LE "ab".force_encoding("UTF-16").encoding # exception, not a BOM String.new("ab", encoding: Encoding::UTF_16).encoding # exception, not a BOM ``` I think it is unnecessary to keep such complexity though. A class method on String or Encoding like e.g. `Encoding.find_from_bom(string)` is so much clearer and efficient (no need to special case those encodings in String.new, #force_encoding, etc). FWIW JRuby seems to use `getActualEncoding()` only in 2 places (scanForCodeRange, inspect), which is an indication those dummy UTF encodings are barely used if ever. Similarly, TruffleRuby only has 4 usages of `GetActualEncodingNode`. ### Existing dummy encodings ``` > Encoding.list.select(&:dummy?) [#<Encoding:UTF-16 (dummy)>, #<Encoding:UTF-32 (dummy)>, #<Encoding:IBM037 (dummy)>, #<Encoding:UTF-7 (dummy)>, #<Encoding:ISO-2022-JP (dummy)>, #<Encoding:ISO-2022-JP-2 (dummy)>, #<Encoding:ISO-2022-JP-KDDI (dummy)>, #<Encoding:CP50220 (dummy)>, #<Encoding:CP50221 (dummy)>] ``` So besides UTF-16/UTF-32 dummy, it's only 7 encodings. Does anyone use one of these 7 dummy encodings? What is interesting to note, is that these encodings are exactly the ones that are also not ASCII-compatible, with the exception of UTF-16BE/UTF-16LE/UTF-32BE/UTF-32LE (non-dummy). As a note, UTF-{16,32}{BE,LE} are ASCII-compatible in codepoints but not in bytes, and Ruby uses the bytes definition of ASCII-compatible. There is potential to simplify encoding compatibility rules and encoding compatibility checks based on that. So what this means is if we removed dummy encodings, all encodings except UTF-{16,32}{BE,LE} would be ASCII-compatible, which would lead to significant simplifications for many string operations which currently need to handle dummy encodings specially. Unicode encodings like UTF-{16,32}{BE,LE} already have special behavior for some Ruby methods, so those are already handled specially in some places (they are the only encodings with minLength > 1). ``` > Encoding.list.reject(&:ascii_compatible?) [#<Encoding:UTF-16BE>, #<Encoding:UTF-16LE>, #<Encoding:UTF-32BE>, #<Encoding:UTF-32LE>, #<Encoding:UTF-16 (dummy)>, #<Encoding:UTF-32 (dummy)>, #<Encoding:IBM037 (dummy)>, #<Encoding:UTF-7 (dummy)>, #<Encoding:ISO-2022-JP (dummy)>, #<Encoding:ISO-2022-JP-2 (dummy)>, #<Encoding:ISO-2022-JP-KDDI (dummy)>, #<Encoding:CP50220 (dummy)>, #<Encoding:CP50221 (dummy)>] ``` What can we do with such a dummy non-ASCII-compatible encoding? Almost nothing useful: ```ruby s = "abc".encode("IBM037") => "\x81\x82\x83" > s.bytes => [129, 130, 131] > s.codepoints => [129, 130, 131] > s == "abc" => false > "été".encode("IBM037") => "\x51\xA3\x51" ``` So about the only thing that works with them is `String#encode`. I think we could preserve that functionality, if actually used (does anyone use one of these 7 dummy encodings?), through: ```ruby > "été".encode("IBM037") => "\x51\xA3\x51" (.encoding == BINARY) > "\x51\xA3\x51".encode("UTF-8", "IBM037") # encode from IBM037 to UTF-8 => "été" (.encoding == UTF-8) ``` That way there is no need for those to be Encoding instances, we would only need the conversion tables. It is even better if we can remove them, so the notion of "dummy encodings" can disappear completely and nobody needs to understand or implement them. ### rb_define_dummy_encoding(name) The C-API has `rb_define_dummy_encoding(const char *name)`. This creates a new Encoding instance with `dummy?=true`, and it is also non-ASCII-compatible. There seems to be no purpose to this besides storing the metadata of an encoding which does not exist in Ruby. This seems a really expensive/complex way to handle that from the VM point of view (because it dynamically creates an Encoding and add it to lists/maps/etc). A simple replacement would be to mark the String as BINARY and save the encoding name as an instance variable of that String. Since anyway Ruby can't understand anything about that String, it's just raw bytes to Ruby's eyes. ## Summary I suggest we deprecate replicate and dummy encodings in Ruby 3.2. And then we remove them in the next version. This will significantly simplify string-related methods, and the behavior exposed to Ruby users. It will also significantly speedup encoding lookup in CRuby (and other Ruby implementations). With a fixed number of encodings we can ensure all encoding indices fit in 7 bits, and `ENCODING_GET` can be simply `RB_ENCODING_GET_INLINED`. `get_actual_encoding()` will be gone and its overhead as well. `rb_enc_from_index()` would be just `return global_enc_table->list[index].enc;`, instead of the expensive behavior currently with `GLOBAL_ENC_TABLE_EVAL` which takes a lock and more when there are multiple Ractors. Many checks in these methods would be removed as well. Yet another improvement would be to load all encodings eagerly, that is small and fast in my experience, what is slow and big is the conversion tables, that'd simplify `must_encindex()` further. These changes would affect most String methods, which use ``` STR_ENC_GET->get_encoding which does: get_actual_encoding->rb_enc_from_index and possibly ->enc_from_index ENCODING_GET->RB_ENCODING_GET_INLINED and possibly ->rb_enc_get_index->enc_get_index_str->rb_attr_get ``` Some of these details are mentioned in https://github.com/ruby/ruby/pull/6095#discussion_r915149708. The overhead is so large that it is worth handling some hardcoded encoding indices directly in String methods. This feels wrong, getting the encoding from a String should be simple, straightforward and fast. Further optimizations will be unlocked as the encoding list becomes fixed and immutable. For example, the name-to-Encoding map is then immutable and could use perfect hashing. Inline caching those lookups also becomes easier as the the map cannot change. Also that map would no longer need synchronization, etc. ## To Decide Each item is independent. I think 1 & 2 are very important, 3 less but would be nice. 1. Deprecate and then remove `Encoding#replicate` and `rb_define_dummy_encoding()`. With that there is a fixed number of encodings, a lot of simplifications and many optimizations become available. They are used respectively in only 1 gem and 5 gems, see https://bugs.ruby-lang.org/issues/18949#note-4 2. Deprecate and then remove the dummy UTF-16 and UTF-32 encodings. This removes the need for `get_actual_encoding()` which is expensive. This functionality seems rarely used in practice, and it only works when such strings have a BOM, which is very rare. 3. Deprecate and then remove other dummy encodings, so there are no more dummy "half-implemented" encodings and all encodings are ASCII-compatible in terms of codepoints. -- https://bugs.ruby-lang.org/

1 0