Issue #22056 has been updated by kou (Kouhei Sutou). Eregon (Benoit Daloze) wrote in #note-12:
The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed.
This breaks `RSTRING_PTR()` guaranteeing to return a `\0`-terminated String, aka `SHARABLE_MIDDLE_SUBSTRING` not being enabled currently (#19315).
Good point. We should add "the buffer owned by another object must be zero-terminated" as a caller's responsibility. We may be able to remove the restriction later. Let's discuss it in #19315 not here. Let's focus on whether we should add a new API for `rb_str_new_static()` + `rb_ivar_set()` for memory efficient `String` creation or not in this issue. We have some missing features such as #19315 to use the new API widely for now. But we can improve the current situation step-by-step. If we should consider this new API after all related improvements are done, let's do it. If we can consider this new API that still has some restrictions for these listed use cases, let's discuss this now.
My feeling is `rb_str_new_static()` + `rb_ivar_set()` is good enough for this use case.
This proposal's motivation is memory efficiency. If we don't think memory overhead by `rb_ivar_set()` should not be cared, we don't need the proposed API. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117248 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` The chosen key is a convention rather than a contract, and the ivar can be cleared by Ruby-level code, invalidating the pointer without any warning. It also incurs ivar table allocation overhead. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, self); ``` The returned String holds a GC reference to the `IO::Buffer`, which owns or transitively keeps alive the backing memory. The zero-copy string therefore remains valid for as long as it is reachable. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/