[ruby-core:111674] [Ruby master Feature#19314] String#bytesplice should support partial copy

Issue #19314 has been reported by shugo (Shugo Maeda). ---------------------------------------- Feature #19314: String#bytesplice should support partial copy https://bugs.ruby-lang.org/issues/19314 * Author: shugo (Shugo Maeda) * Status: Open * Priority: Normal ---------------------------------------- String#bytesplice should support partial copy without temporary String objects. For example, given `x = "0123456789"`, either of the following replaces the contents of `x` with `"0167856789"`: ```ruby x.bytesplice(2, 3, x, 6, 3) x.bytesplice(2..4, x, 6..8) ``` ## Considerations * What should be the return value? * The return value should be the whole source string for performance and consistency with `bytesplice(offset, len, s)`. * Can the source and destination ranges overlap? * Yes. * Can the source and destination lengths be different? * Yes. * Can range form and offset/length form be mixed in the source and destination? * No. * What should happen when any offset doesn't land on character boundary in text strings. * IndexError should be raised. * Can the length be omitted in the destination? * Maybe yes, but it may be confusing. ## Use cases * [Gapped buffer implementation for text editors](https://github.com/shugo/textbringer) * [NAT implementation](https://github.com/kazuho/rat) * https://twitter.com/kazuho/status/1611279616098070532 -- https://bugs.ruby-lang.org/

Issue #19314 has been updated by Eregon (Benoit Daloze). I think this is too hard to read and parse for a human and 5 arguments seems way too much for a core method. It feels like a full memcpy/arraycopy which I don't think in general is a good idea for String. The implementation complexity in []= and similar already hurts Ruby too much. This is probably the 3rd or more workaround I see to have proper lazy substrings in CRuby, i.e., `"abcdef"[1..3]` must not copy bytes. That is what needs to be solved (it already works in TruffleRuby). Yes, it means RSTRING_PTR() might need to allocate to \0-terminate, so be it, it's worth it. So I am strongly against this, it's a nth workaround for something simpler to solve which is much more helpful in general. ---------------------------------------- Feature #19314: String#bytesplice should support partial copy https://bugs.ruby-lang.org/issues/19314#change-101071 * Author: shugo (Shugo Maeda) * Status: Open * Priority: Normal ---------------------------------------- String#bytesplice should support partial copy without temporary String objects. For example, given `x = "0123456789"`, either of the following replaces the contents of `x` with `"0167856789"`: ```ruby x.bytesplice(2, 3, x, 6, 3) x.bytesplice(2..4, x, 6..8) ``` ## Considerations * What should be the return value? * The return value should be the whole source string for performance and consistency with `bytesplice(offset, len, s)`. * Can the source and destination ranges overlap? * Yes. * Can the source and destination lengths be different? * Yes. * Can range form and offset/length form be mixed in the source and destination? * No. * What should happen when any offset doesn't land on character boundary in text strings. * IndexError should be raised. * Can the length be omitted in the destination? * Maybe yes, but it may be confusing. ## Use cases * [Gapped buffer implementation for text editors](https://github.com/shugo/textbringer) * [NAT implementation](https://github.com/kazuho/rat) * https://twitter.com/kazuho/status/1611279616098070532 -- https://bugs.ruby-lang.org/

Issue #19314 has been updated by naruse (Yui NARUSE). I agree that this is a workaround and a VM should solve this as an optimization. But your proposal: Lazy substrings is not a solution because it also creates an object especially for small strings which is embedded in RVALUE. I agree that this is memcpy/arraycopy. Therefore this proposal should add a description how large this workaround contributes performance in such use cases as memcpy on Ruby. ---------------------------------------- Feature #19314: String#bytesplice should support partial copy https://bugs.ruby-lang.org/issues/19314#change-101075 * Author: shugo (Shugo Maeda) * Status: Open * Priority: Normal ---------------------------------------- String#bytesplice should support partial copy without temporary String objects. For example, given `x = "0123456789"`, either of the following replaces the contents of `x` with `"0167856789"`: ```ruby x.bytesplice(2, 3, x, 6, 3) x.bytesplice(2..4, x, 6..8) ``` ## Considerations * What should be the return value? * The return value should be the whole source string for performance and consistency with `bytesplice(offset, len, s)`. * Can the source and destination ranges overlap? * Yes. * Can the source and destination lengths be different? * Yes. * Can range form and offset/length form be mixed in the source and destination? * No. * What should happen when any offset doesn't land on character boundary in text strings. * IndexError should be raised. * Can the length be omitted in the destination? * Maybe yes, but it may be confusing. ## Use cases * [Gapped buffer implementation for text editors](https://github.com/shugo/textbringer) * [NAT implementation](https://github.com/kazuho/rat) * https://twitter.com/kazuho/status/1611279616098070532 -- https://bugs.ruby-lang.org/

Issue #19314 has been updated by Eregon (Benoit Daloze). naruse (Yui NARUSE) wrote in #note-3:
But your proposal: Lazy substrings is not a solution because it also creates an object especially for small strings which is embedded in RVALUE.
Yes it creates a String instance reusing the same buffer. That shouldn't cost much compared to copying many bytes. It should be insignificant on a benchmark with a long string to copy/move, for a short string perf shouldn't matter much anyway (it won't the be bottleneck of the program). If it's still too much overhead, it sounds like allocations in CRuby need to be better optimized, or escape analysis should be implemented. Again, those 2 are more general and benefits are much wider than this one method change that would be used for very few Ruby programs and only handles one specific case. ---------------------------------------- Feature #19314: String#bytesplice should support partial copy https://bugs.ruby-lang.org/issues/19314#change-101078 * Author: shugo (Shugo Maeda) * Status: Open * Priority: Normal ---------------------------------------- String#bytesplice should support partial copy without temporary String objects. For example, given `x = "0123456789"`, either of the following replaces the contents of `x` with `"0167856789"`: ```ruby x.bytesplice(2, 3, x, 6, 3) x.bytesplice(2..4, x, 6..8) ``` ## Considerations * What should be the return value? * The return value should be the whole source string for performance and consistency with `bytesplice(offset, len, s)`. * Can the source and destination ranges overlap? * Yes. * Can the source and destination lengths be different? * Yes. * Can range form and offset/length form be mixed in the source and destination? * No. * What should happen when any offset doesn't land on character boundary in text strings. * IndexError should be raised. * Can the length be omitted in the destination? * Maybe yes, but it may be confusing. ## Use cases * [Gapped buffer implementation for text editors](https://github.com/shugo/textbringer) * [NAT implementation](https://github.com/kazuho/rat) * https://twitter.com/kazuho/status/1611279616098070532 -- https://bugs.ruby-lang.org/

Issue #19314 has been updated by Eregon (Benoit Daloze). Ah, something I missed though is that with lazy substrings, there would still need to be a copy of the bytes to "unshare" the string when writing to it. That copy would also be needed if the string was shared before (e.g. with .dup), but that's unknown in our case.
It feels like a full memcpy/arraycopy which I don't think in general is a good idea for String.
To expand on that, I dislike that because it's using String as a byte array. If anything, such operation should be supported on Array before String. Now that we have IO::Buffer and there is https://docs.ruby-lang.org/en/master/IO/Buffer.html#method-i-copy, why not use that? ---------------------------------------- Feature #19314: String#bytesplice should support partial copy https://bugs.ruby-lang.org/issues/19314#change-101080 * Author: shugo (Shugo Maeda) * Status: Open * Priority: Normal ---------------------------------------- String#bytesplice should support partial copy without temporary String objects. For example, given `x = "0123456789"`, either of the following replaces the contents of `x` with `"0167856789"`: ```ruby x.bytesplice(2, 3, x, 6, 3) x.bytesplice(2..4, x, 6..8) ``` ## Considerations * What should be the return value? * The return value should be the whole source string for performance and consistency with `bytesplice(offset, len, s)`. * Can the source and destination ranges overlap? * Yes. * Can the source and destination lengths be different? * Yes. * Can range form and offset/length form be mixed in the source and destination? * No. * What should happen when any offset doesn't land on character boundary in text strings. * IndexError should be raised. * Can the length be omitted in the destination? * Maybe yes, but it may be confusing. ## Use cases * [Gapped buffer implementation for text editors](https://github.com/shugo/textbringer) * [NAT implementation](https://github.com/kazuho/rat) * https://twitter.com/kazuho/status/1611279616098070532 -- https://bugs.ruby-lang.org/

Issue #19314 has been updated by Eregon (Benoit Daloze). Eregon (Benoit Daloze) wrote in #note-5:
Now that we have IO::Buffer and there is https://docs.ruby-lang.org/en/master/IO/Buffer.html#method-i-copy, why not use that?
So this does what you want I believe: ```ruby x = "0123456789" IO::Buffer.for(x) do |buffer| buffer.copy(buffer, 2, 3, 6) end p x # => "0167856789" ``` I think there is no need to change `String#bytesplice` therefore (there is even not a need for `String#bytesplice` due to that, which [I think we shouldn't have added](https://bugs.ruby-lang.org/issues/18598#note-3)). And `IO::Buffer` seems better suited for byte-buffer-like operations. ---------------------------------------- Feature #19314: String#bytesplice should support partial copy https://bugs.ruby-lang.org/issues/19314#change-101081 * Author: shugo (Shugo Maeda) * Status: Open * Priority: Normal ---------------------------------------- String#bytesplice should support partial copy without temporary String objects. For example, given `x = "0123456789"`, either of the following replaces the contents of `x` with `"0167856789"`: ```ruby x.bytesplice(2, 3, x, 6, 3) x.bytesplice(2..4, x, 6..8) ``` ## Considerations * What should be the return value? * The return value should be the whole source string for performance and consistency with `bytesplice(offset, len, s)`. * Can the source and destination ranges overlap? * Yes. * Can the source and destination lengths be different? * Yes. * Can range form and offset/length form be mixed in the source and destination? * No. * What should happen when any offset doesn't land on character boundary in text strings. * IndexError should be raised. * Can the length be omitted in the destination? * Maybe yes, but it may be confusing. ## Use cases * [Gapped buffer implementation for text editors](https://github.com/shugo/textbringer) * [NAT implementation](https://github.com/kazuho/rat) * https://twitter.com/kazuho/status/1611279616098070532 -- https://bugs.ruby-lang.org/

Issue #19314 has been updated by naruse (Yui NARUSE).
That shouldn't cost much compared to copying many bytes.
This proposal shows two use cases: text editor and NAT, which doesn't copy many bytes. ---------------------------------------- Feature #19314: String#bytesplice should support partial copy https://bugs.ruby-lang.org/issues/19314#change-101118 * Author: shugo (Shugo Maeda) * Status: Open * Priority: Normal ---------------------------------------- String#bytesplice should support partial copy without temporary String objects. For example, given `x = "0123456789"`, either of the following replaces the contents of `x` with `"0167856789"`: ```ruby x.bytesplice(2, 3, x, 6, 3) x.bytesplice(2..4, x, 6..8) ``` ## Considerations * What should be the return value? * The return value should be the whole source string for performance and consistency with `bytesplice(offset, len, s)`. * Can the source and destination ranges overlap? * Yes. * Can the source and destination lengths be different? * Yes. * Can range form and offset/length form be mixed in the source and destination? * No. * What should happen when any offset doesn't land on character boundary in text strings. * IndexError should be raised. * Can the length be omitted in the destination? * Maybe yes, but it may be confusing. ## Use cases * [Gapped buffer implementation for text editors](https://github.com/shugo/textbringer) * [NAT implementation](https://github.com/kazuho/rat) * https://twitter.com/kazuho/status/1611279616098070532 -- https://bugs.ruby-lang.org/

Issue #19314 has been updated by matz (Yukihiro Matsumoto). Accepted. Matz. ---------------------------------------- Feature #19314: String#bytesplice should support partial copy https://bugs.ruby-lang.org/issues/19314#change-101325 * Author: shugo (Shugo Maeda) * Status: Open * Priority: Normal ---------------------------------------- String#bytesplice should support partial copy without temporary String objects. For example, given `x = "0123456789"`, either of the following replaces the contents of `x` with `"0167856789"`: ```ruby x.bytesplice(2, 3, x, 6, 3) x.bytesplice(2..4, x, 6..8) ``` ## Considerations * What should be the return value? * The return value should be the whole source string for performance and consistency with `bytesplice(offset, len, s)`. * Can the source and destination ranges overlap? * Yes. * Can the source and destination lengths be different? * Yes. * Can range form and offset/length form be mixed in the source and destination? * No. * What should happen when any offset doesn't land on character boundary in text strings. * IndexError should be raised. * Can the length be omitted in the destination? * Maybe yes, but it may be confusing. ## Use cases * [Gapped buffer implementation for text editors](https://github.com/shugo/textbringer) * [NAT implementation](https://github.com/kazuho/rat) * https://twitter.com/kazuho/status/1611279616098070532 -- https://bugs.ruby-lang.org/
participants (4)
-
Eregon (Benoit Daloze)
-
matz (Yukihiro Matsumoto)
-
naruse (Yui NARUSE)
-
shugo (Shugo Maeda)