[ruby-core:118560] [Ruby master Feature#20594] A new String method to append bytes while preserving encoding

11 Jul 2024

      Issue #20594 has been updated by Eregon (Benoit Daloze).

mame (Yusuke Endoh) wrote in #note-7:
...
Existing methods with byte-prefix (String#byteindex, #bytesplite, etc.) mean that the unit of offset or size is in byte.
My understanding of `byte*` methods is they treat the String as a byte array, which implies indices are just byte indices but also that the encoding is ignored (it seems clear when one does `"é".getbyte(0)`).
It's (almost) as-if the string had the BINARY encoding for the duration of the operation, but without the overhead to switch to BINARY and back (which notably could cause some extra code range computation, etc).
BTW, I would consider `each_byte` also a `byte*` method, and that one does not accept or pass byte indices.

So I think it would make sense to extend the meaning of `byte*` methods to be a little more general, just like I explained above.
I don't think it was documented to be only about byte indices either.

That said, I think `String#append_bytes(String)` sounds fine too.

----------------------------------------
Feature #20594: A new String method to append bytes while preserving encoding
https://bugs.ruby-lang.org/issues/20594#change-109083

* Author: byroot (Jean Boussier)
* Status: Open
* Assignee: byroot (Jean Boussier)
----------------------------------------
### Context

When working with binary protocols such as `protobuf` or `MessagePack`, you may often need to assemble multiple
strings of different encoding:

```ruby
Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf <<
      255 << title.bytesize << title <<
      255 << body.bytesize << body
  end
end

Post.new("Hello", "World").serialize("somedata".b) # => "somedata\xFF\x05Hello\xFF\x05World" #<Encoding:ASCII-8BIT>
```

The problem in the above case, is that because `Encoding::ASCII_8BIT` is declared as ASCII compatible,
if one of the appended string contains bytes outside the ASCII range, string is automatically promoted
to another encoding, which then leads to encoding issues:

```ruby
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => incompatible character encodings: ASCII-8BIT and UTF-8 (Encoding::CompatibilityError)
```

In many cases, you want to append to a String without changing the receiver's encoding.

The issue isn't exclusive to binary protocols and formats, it also happen with ASCII protocols that accept arbitrary bytes inline,
like Redis's RESP protocol or even HTTP/1.1.

### Previous discussion

There was a similar feature request a while ago, but it was abandoned: https://bugs.ruby-lang.org/issues/14975

### Existing solutions

You can of course always cast the strings you append to avoid this problem:

```ruby
Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf <<
      255 << title.bytesize << title.b <<
      255 << body.bytesize << body.b
  end
end
```

But this cause a lot of needless allocations.

You'd think you could also use `bytesplice`, but it actually has the same issue:

```ruby
Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf << 255 << title.bytesize
    buf.bytesplice(buf.bytesize, title.bytesize, title)
    buf << 255 << body.bytesize
    buf.bytesplice(buf.bytesize, body.bytesize, title)
  end
end
Post.new("H€llo", "Wôrld").serialize("somedata".b) # => 'String#bytesplice': incompatible character encodings: BINARY (ASCII-8BIT) and UTF-8 (Encoding::CompatibilityError)
```

And even if it worked, it would be very unergonomic.

### Proposal: a `byteconcat` method

A solution to this would be to add a new `byteconcat` method, that could be shimed as:

```ruby
class String
  def byteconcat(*strings)
    strings.map! do |s|
      if s.is_a?(String) && s.encoding != encoding
        s.dup.force_encoding(encoding)
      else
        s
      end
    end
    concat(*strings)
  end
end

Post = Struct.new(:title, :body) do
  def serialize(buf)
    buf.byteconcat(
      255, title.bytesize, title,
      255, body.bytesize, body,
    )
  end
end

Post.new("H€llo", "Wôrld").serialize("somedata".b) # => "somedata\xFF\aH\xE2\x82\xACllo\xFF\x06W\xC3\xB4rld" #<Encoding:ASCII-8BIT>
```

But of course a builtin implementation wouldn't need to dup the arguments.

Like other `byte*` methods, it's the responsibility of the caller to ensure the resulting string has a valid encoding, or
to deal with it if not.

### Method name and signature

#### Name

This proposal suggests `String#byteconcat`, to mirror `String#concat`, but other names are possible:

  - `byteappend` (like `Array#append`)
  - `bytepush`  (like `Array#push`)

#### Signature

This proposal makes `byteconcat` accept either `String` or `Integer` (in char range) arguments like `concat`. I believe it makes sense for consistency and also because it's not uncommon for protocols to have some byte based segments, and Integers are more convenient there.

The proposed method also accept variable arguments for consistency with `String#concat`, `Array#push`, `Array#append`.

The proposed method returns self, like `concat` and others.

### YJIT consideration

I consulted @maximecb about this proposal, and according to her, accepting variable arguments makes it harder for YJIT to optimize.
I suspect consistency with other APIs trumps the performance consideration, but I think it's worth mentioning.

-- 
https://bugs.ruby-lang.org/