[ruby-core:125416] [Ruby Feature#22056] Add zero-copy String constructor backed by an arbitrary Ruby object
Issue #22056 has been reported by himura467 (Akito Shitara). ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- ### Description Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by kou (Kouhei Sutou). I want this feature as a user of `IO::Buffer#get_string` and the author of `GLib::Bytes#to_s`. Both of them are for processing Apache Arrow data effectively.
When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7.
If we can embed the target data, I think that `rb_str_new_external()` can behave like `rb_str_new()` (we can copy the target data). If we do so, we can reuse the `4` flag value (`STR_PRECOMPUTED_HASH`) for this case. Because `STR_PRECOMPUTED_HASH` is used only for embedded string.
`IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly:
```c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ```
Can we use `self` not `buffer->source` for `parent`? If we use `self` not `buffer->source`, we can use this optimization for `IO::Buffer.map` (that doesn't use `buffer->source`) too. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117214 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by byroot (Jean Boussier). Somewhat related proposal: [Feature #20878] I also wonder, this new "external" string is supposed to be mutable? If so that may increase `string.c` complexity significantly, as all codepaths that normally resize the string buffer will need to behave differently with this new type of string. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117217 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by himura467 (Akito Shitara). Thanks @kou
Can we use `self` not `buffer->source` for `parent`? If we use `self` not `buffer->source`, we can use this optimization for `IO::Buffer.map` (that doesn't use `buffer->source`) too.
The `io_buffer_free` method internally executes `buffer->source = Qnil`, so when `self` is used for `parent`, `str` would keep the `IO::Buffer` alive, but the `IO::Buffer` would release its `T_STRING` reference, potentially causing the `T_STRING` to be garbage collected. Using `buffer->source` for `parent` is one of the simplest workarounds for this issue. However, considering future applications to `IO::Buffer.map`, changing the behavior to use `self` for `parent` (while modifying `io_buffer_free` to skip the free operation while there are still references) would be the more clean solution to address this issue. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117218 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by himura467 (Akito Shitara). Thank you @byroot, for pointing out the concern. We are designing this new "external" string to be mutable. Regarding the complexity concern, the mechanism for handling cases where the zero-copy returned string is modified can be implemented without requiring new code paths by leveraging the existing `STR_NOFREE` mechanism. Therefore, I do not expect this to significantly increase the complexity of `string.c`. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117219 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by byroot (Jean Boussier).
by leveraging the existing STR_NOFREE mechanism.
I see. So it's mutable, but Copy-on-Write. If you mutate the string, all the content is copied in a buffer managed by Ruby. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117220 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by himura467 (Akito Shitara).
I see. So it's mutable, but Copy-on-Write. If you mutate the string, all the content is copied in a buffer managed by Ruby.
Yes, that's correct. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117221 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by byroot (Jean Boussier). Another though: what does it means for coderanges? Since the buffer is owned by another object, it can be mutated without going through one of String methods, which means things like `ENC_CODERANGE_CLEAR` won't happen. Perhaps it's acceptable, but that may cause weird behaviors for many methods. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117222 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by himura467 (Akito Shitara). Thank you for raising this, it's a good point worth keeping in mind. The zero-copy path only applies to READONLY buffers, and `T_STRING` stored in `buffer->source` is guaranteed to be frozen by the design of `IO::Buffer.for`. Since there is currently no mechanism to modify the backing memory, I believe this has no impact. However, when extended to `IO::Buffer.map`, shared mappings could cause the coderange to become stale, because an external process can modify the underlying file and those changes are immediately visible through the mapping. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117223 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by jhawthorn (John Hawthorn). himura467 (Akito Shitara) wrote:
`rb_str_new_static` avoids copying but is only safe for storage that lives forever.
For what it's worth, I'm unconvinced this existing optimization is paying off (and I intend to remove it from the fstring case, where it definitely isn't). At least in CRuby itself at boot we create ~500 strings using this and almost all of them are embeddable in a 40 byte object. The largest is ~80 bytes (RUBY_DESCRIPTION) and the second-largest is only 50, so copying is trivial. We're creating less cache friendly strings for no reason. There's also a small gotcha to strings being created with `rb_str_new_static` that any similar technique would inherit: they require that buf[0] to `buf[len+enc.termlen]` be readable for `str_fill_term` (documentation says len+1, but I think it's actually more for ex. utf16). That doesn't seem like a guarantee ex. `g_bytes_get_data` would provide when making a string containing the full buffer. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117224 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by Eregon (Benoit Daloze). himura467 (Akito Shitara) wrote:
This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller.
What do you mean by "undocumented GC behavior"? Isn't it only relying on an object keeping its ivar values alive? That's well documented.
The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed.
This breaks `RSTRING_PTR()` guaranteeing to return a `\0`-terminated String, aka `SHARABLE_MIDDLE_SUBSTRING` not being enabled currently (#19315). I don't think it makes sense to break that invariant in a single case, that would be very error-prone. We should either remove the invariant completely (or design some migration path), or keep it holding always. My feeling is `rb_str_new_static()` + `rb_ivar_set()` is good enough for this use case. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117225 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by himura467 (Akito Shitara). Thank you @jhawthorn
For what it's worth, I'm unconvinced this existing optimization is paying off (and I intend to remove it from the fstring case, where it definitely isn't). At least in CRuby itself at boot we create ~500 strings using this and almost all of them are embeddable in a 40 byte object. The largest is only ~80 bytes (RUBY_DESCRIPTION) and the second-largest is only 50, so copying is trivial. We're creating less cache friendly strings for no reason.
The use cases we have in mind involve large or frequently accessed data where copy overhead matters. `IO::Buffer#get_string` on file mappings or network buffers is one example. That's a different concern from boot-time strings.
There's also a small gotcha to strings being created with `rb_str_new_static` that any similar technique would inherit: they require that `buf[0]` to `buf[len+enc.termlen]` be readable for `str_fill_term` (documentation says len+1, but I think it's actually more for ex. utf16). That doesn't seem like a guarantee ex. `g_bytes_get_data` would provide when making a string containing the full buffer.
For the MAPPED case the implementation guards with `offset + length + termlen <= buffer->size`, ensuring those bytes fall within the mapped region. For the string-backed case there's a gap when the requested encoding has a larger `termlen` than the source, so the guard should be `offset + length + termlen <= buffer->size + src_termlen`. We've updated the implementation accordingly. The `g_bytes_get_data` concern for full-buffer strings is valid: callers need to ensure at least termlen bytes past the data are allocated. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117230 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by himura467 (Akito Shitara). Thank you @Eregon
What do you mean by "undocumented GC behavior"? Isn't it only relying on an object keeping its ivar values alive? That's well documented.
Agreed, "undocumented" was imprecise. What I meant is that the relationship is fragile at the API level: the two-step pattern can silently be omitted, the chosen key is a convention rather than a guarantee, and the ivar can be cleared by Ruby-level code (instance_variable_set), invalidating the pointer without any warning. The proposed API encodes the dependency into the allocation itself, making it impossible to accidentally omit.
This breaks `RSTRING_PTR()` guaranteeing to return a `\0`-terminated String, aka `SHARABLE_MIDDLE_SUBSTRING` not being enabled currently (#19315). I don't think it makes sense to break that invariant in a single case, that would be very error-prone. We should either remove the invariant completely (or design some migration path), or keep it holding always. My feeling is `rb_str_new_static()` + `rb_ivar_set()` is good enough for this use case.
The concern is fair in the context of #19315: `RSTRING_PTR()` is widely assumed to return a `\0`-terminated pointer, and C extensions that pass it directly to C string functions would silently misbehave if `ptr[len]` is not `\0`. That said, `rb_str_new_static` has the same property when used with arbitrary memory: it sets `STR_NOFREE` without writing `\0` at `ptr[len]` (https://github.com/ruby/ruby/blob/b6e4fa71d514796ee826b1257bfd7b2a177f5f09/s...). For the `GLib::Bytes` use case, `g_bytes_get_data()` does not guarantee a `\0` byte past the end, so the existing workaround has the identical exposure at the `RSTRING_PTR()` level. `StringValueCStr` is handled correctly for both: `str_dependent_p` returns true for `STR_NOFREE` strings, so `str_fill_term` takes the dependent path and calls `str_make_independent_expand` if `\0` is absent, writing the terminator into a fresh allocation without touching the source. If the `RSTRING_PTR()` invariant concern is a blocker for this proposal, it applies equally to `rb_str_new_static` with non-literal memory. The longer-term fix (making `RSTRING_PTR()` itself handle lazy strings as discussed in #19315) would benefit both approaches. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117231 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` This relies on undocumented GC behavior, incurs ivar table allocation overhead, and leaves lifetime management entirely to the caller. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, buffer->source); ``` The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by kou (Kouhei Sutou). byroot (Jean Boussier) wrote in #note-9:
Another though: what does it means for coderanges? Since the buffer is owned by another object, it can be mutated without going through one of String methods, which means things like `ENC_CODERANGE_CLEAR` won't happen.
Perhaps it's acceptable, but that may cause weird behaviors for many methods.
Good point. We should add "the buffer is owned by another object must not be mutated" as a caller's responsible. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117246 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` The chosen key is a convention rather than a contract, and the ivar can be cleared by Ruby-level code, invalidating the pointer without any warning. It also incurs ivar table allocation overhead. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, self); ``` The returned String holds a GC reference to the `IO::Buffer`, which owns or transitively keeps alive the backing memory. The zero-copy string therefore remains valid for as long as it is reachable. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by kou (Kouhei Sutou). jhawthorn (John Hawthorn) wrote in #note-11:
himura467 (Akito Shitara) wrote:
`rb_str_new_static` avoids copying but is only safe for storage that lives forever.
For what it's worth, I'm unconvinced this existing optimization is paying off (and I intend to remove it from the fstring case, where it definitely isn't). At least in CRuby itself at boot we create ~500 strings using this and almost all of them are embeddable in a 40 byte object. The largest is only ~80 bytes (RUBY_DESCRIPTION) and the second-largest is only 50, so copying is trivial. We're creating less cache friendly strings for no reason.
Interesting. We may be able to add an optimization for embeddable data if embedded `String` is faster: `rb_str_new_static()` returns a embedded `String` for embeddable data. But it's out-of-scope of this proposal. Let's discuss it separately. Do we already have an issue for this case?
There's also a small gotcha to strings being created with `rb_str_new_static` that any similar technique would inherit: they require that buf[0] to `buf[len+enc.termlen]` be readable for `str_fill_term` (documentation says len+1, but I think it's actually more for ex. utf16). That doesn't seem like a guarantee ex. `g_bytes_get_data` would provide when making a string containing the full buffer.
Good point. We should add "the buffer owned by another object must have len + enc.termlen size" as a caller's responsibility. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117247 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` The chosen key is a convention rather than a contract, and the ivar can be cleared by Ruby-level code, invalidating the pointer without any warning. It also incurs ivar table allocation overhead. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, self); ``` The returned String holds a GC reference to the `IO::Buffer`, which owns or transitively keeps alive the backing memory. The zero-copy string therefore remains valid for as long as it is reachable. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by kou (Kouhei Sutou). Eregon (Benoit Daloze) wrote in #note-12:
The returned String holds a direct GC reference to the source String, so it remains valid even after the buffer is freed.
This breaks `RSTRING_PTR()` guaranteeing to return a `\0`-terminated String, aka `SHARABLE_MIDDLE_SUBSTRING` not being enabled currently (#19315).
Good point. We should add "the buffer owned by another object must be zero-terminated" as a caller's responsibility. We may be able to remove the restriction later. Let's discuss it in #19315 not here. Let's focus on whether we should add a new API for `rb_str_new_static()` + `rb_ivar_set()` for memory efficient `String` creation or not in this issue. We have some missing features such as #19315 to use the new API widely for now. But we can improve the current situation step-by-step. If we should consider this new API after all related improvements are done, let's do it. If we can consider this new API that still has some restrictions for these listed use cases, let's discuss this now.
My feeling is `rb_str_new_static()` + `rb_ivar_set()` is good enough for this use case.
This proposal's motivation is memory efficiency. If we don't think memory overhead by `rb_ivar_set()` should not be cared, we don't need the proposed API. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117248 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` The chosen key is a convention rather than a contract, and the ivar can be cleared by Ruby-level code, invalidating the pointer without any warning. It also incurs ivar table allocation overhead. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, self); ``` The returned String holds a GC reference to the `IO::Buffer`, which owns or transitively keeps alive the backing memory. The zero-copy string therefore remains valid for as long as it is reachable. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
Issue #22056 has been updated by matz (Yukihiro Matsumoto). Thanks for the proposal. Let me clarify my position. I am not opposed to introducing this API in principle. The use case is real, and I do not think `rb_str_new_static` + ivar is a proper substitute. They have different lifetime semantics: `rb_str_new_static` is designed for storage that lives forever, while this proposal expresses a dependency on an arbitrary Ruby object whose lifetime is managed by GC. Encoding that dependency into the allocation itself is the right direction. The blocker for me is the `\0`-termination invariant of `RSTRING_PTR()` that Eregon raised. Many C extensions rely on it. This is the same underlying question as #19315, and I would like to settle the direction there before committing to a new C API. Once we have decided how strings backed by externally-owned memory should behave with respect to `RSTRING_PTR()`, I am open to this proposal. On the other concerns: - Ensuring the source buffer is not mutated should be the caller's responsibility. The API does not need to defend against it. - The `termlen` readability is worth thinking about, but I see it as future work, not a blocker. Let us continue the discussion on #19315. Matz. ---------------------------------------- Feature #22056: Add zero-copy String constructor backed by an arbitrary Ruby object https://bugs.ruby-lang.org/issues/22056#change-117299 * Author: himura467 (Akito Shitara) * Status: Open ---------------------------------------- Ruby has rich built-in functionality for working with byte sequences through `String`. Objects that manage their own byte buffers naturally want to expose their data through this interface. The straightforward approach is `rb_str_new()`, which copies the bytes: ``` c VALUE str = rb_str_new(str, len); ``` For large or frequently accessed buffers this copy is wasteful in both time and memory. One approach is to create a String that directly references the existing memory, with the GC keeping the owner alive for as long as the String is reachable. This avoids both the copy and the need for manual lifetime management. ### Existing APIs and their limitations | API | Memory behavior | | ---- | ---- | | `rb_str_new` / `rb_str_new_cstr` | Copies bytes; String owns the allocation | | `rb_str_new_shared` / `rb_str_new_frozen` | References another String's buffer; parent must be a String | | `rb_str_new_static` | References static (compile-time) storage; no lifetime management | `rb_str_new_static` avoids copying but is only safe for storage that lives forever. When memory is owned by a Ruby object, it is freed when that object is collected, and `rb_str_new_static` offers no way to express that dependency. The common workaround is to pin the owner via an instance variable: ``` c VALUE str = rb_str_new_static(str, len); rb_ivar_set(str, id_owner, owner); ``` The chosen key is a convention rather than a contract, and the ivar can be cleared by Ruby-level code, invalidating the pointer without any warning. It also incurs ivar table allocation overhead. ### Proposal Add zero-copy String constructors that accept an explicit parent object. The proposed names are tentative and open for discussion (see below): ``` c VALUE rb_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_usascii_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_utf8_str_new_external(const char *ptr, long len, VALUE parent); VALUE rb_enc_str_new_external(const char *ptr, long len, rb_encoding *enc, VALUE parent); ``` `parent` can be any live Ruby object. The GC guarantees it is not collected before the returned String is. `ptr` must point into memory whose lifetime is tied to `parent`; no copy is made. ### Use cases *IO::Buffer#get_string* `IO::Buffer.for(string)` wraps a String's bytes in a READONLY EXTERNAL buffer. `IO::Buffer#get_string` now copies those bytes into a new String. With this API, the returned String can reference the source String directly: ``` c return rb_enc_str_new_external((const char *)base + offset, length, encoding, self); ``` The returned String holds a GC reference to the `IO::Buffer`, which owns or transitively keeps alive the backing memory. The zero-copy string therefore remains valid for as long as it is reachable. *GLib::Bytes#to_s (ruby-gnome)* `GLib::Bytes` is an immutable, reference-counted byte buffer from GLib. The current [implementation](https://github.com/ruby-gnome/ruby-gnome/blob/1dad74d1a86f97e95c9d89eec33fbe...) uses the ivar workaround: ``` c VALUE str = rb_str_new_static(data, size); rb_iv_set(str, "@bytes", self); ``` With the proposed API this becomes: ``` c return rb_str_new_external(data, size, self); ``` ### Open questions *Naming* The name `rb_str_new_external` is one option. Other candidates: * `rb_str_new_owned_by` / `rb_enc_str_new_owned_by` * `rb_str_new_pinned` / `rb_enc_str_new_pinned` * `rb_str_new_with_parent` / `rb_enc_str_new_with_parent` *Memory retention* When a String referencing a small slice of a large buffer remains reachable, the entire backing object is kept alive. This is the same concern that led Java to remove the shared-backing optimization from `String.substring()` in Java 7. The risk was also raised in the context of Ruby's own lazy substring proposal (#19315, https://bugs.ruby-lang.org/issues/19315#note-7):
I heard that Java stopped the shared substring technique 10 years ago (https://www.infoq.com/news/2013/12/Oracle-Tunes-Java-String/) because of the potential for memory leaks
I don't disagree this proposal, but it would be nice if we could evaluate the effectiveness of this optimization.
Whether the same concern applies to this proposal, and whether the API should offer a way to force an independent copy, is worth discussing. ### Proof of concept A prototype implementation is at: https://github.com/ruby/ruby/pull/16834 The implementation introduces a new flag on non-embedded strings and stores the parent reference in `RString.as.heap.aux.parent`. The GC mark phase pins embedded parent strings to prevent compaction from invalidating the raw pointer stored in the zero-copy child. -- https://bugs.ruby-lang.org/
participants (6)
-
byroot (Jean Boussier) -
Eregon (Benoit Daloze) -
himura467 (Akito Shitara) -
jhawthorn (John Hawthorn) -
kou (Kouhei Sutou) -
matz (Yukihiro Matsumoto)