[ruby-core:125775] [Ruby Feature#22118] Introduce Basic Bit Operations into String
Issue #22118 has been reported by hasumikin (hitoshi hasumi). ---------------------------------------- Feature #22118: Introduce Basic Bit Operations into String https://bugs.ruby-lang.org/issues/22118 * Author: hasumikin (hitoshi hasumi) * Status: Open ---------------------------------------- This PR implements a subset of the specification proposed in the parent feature ticket #22082 PR URL: [placeholder to be updated] ## Methods implemented with the same specification as the parent ticket * `String#bit_at(offset, lsb_first: true) -> true | false | nil` * `String#bitwise_not -> String` * `String#bitwise_not! -> self` * `String#bitwise_and(other) -> String` * `String#bitwise_and!(other) -> self` * `String#bitwise_or(other) -> String` * `String#bitwise_or!(other) -> self` * `String#bitwise_xor(other) -> String` * `String#bitwise_xor!(other) -> self` ## Methods with range-oriented forms omitted for now The parent ticket also included range-oriented forms for some methods. In this PR, those range arguments are intentionally omitted, and the methods are limited to simpler forms. Range support can be considered as a future extension. * `String#bit_count -> Integer` * `String#bit_set(offset, lsb_first: true) -> self` * `String#bit_clear(offset, lsb_first: true) -> self` * `String#bit_flip(offset, lsb_first: true) -> self` In particular, `String#bit_count` currently takes no arguments. It does not accept a range, offset/length pair, or `lsb_first:` keyword. --- ## Remaining design questions even after narrowing the scope The overall direction of adding these methods has already been approved by Matz. The remaining questions are mostly about naming and API details. ### 1. Why the `bitwise_*` prefix is used In the parent ticket, all other methods are related to bit positions: they either take bit-position arguments, yield bit positions, or operate on a specified bit range. In contrast, the `bitwise_*` methods are whole-data operations. They apply to the entire string as a fixed-size bitmap and do not expose bit positions or bit ordering. The `bitwise_*` prefix is intended to make this distinction explicit. ### 2. Bit numbering #### `lsb_first: true` (default) Within each byte, `offset = 0` is the LSB. Numbering proceeds upward through byte[0] and then continues at the LSB of byte[1]: ```text byte[0] byte[1] offset: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 bit: b7 b6 b5 b4 b3 b2 b1 b0 b7 b6 b5 b4 b3 b2 b1 b0 ^ ^ LSB LSB offset = byte_index * 8 + bit_in_byte ``` #### `lsb_first: false` for MSB-first Byte order is preserved (`byte[0]` is still first), but within each byte numbering starts at the MSB: ```text byte[0] byte[1] offset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 bit: b7 b6 b5 b4 b3 b2 b1 b0 b7 b6 b5 b4 b3 b2 b1 b0 ^ ^ MSB MSB offset = byte_index * 8 + (7 - bit_in_byte) ``` * Whether `lsb_first:` is needed at all, or whether an equivalent mechanism should exist under another name? * If such a mechanism is needed, whether `lsb_first:` is the right keyword name? * If `lsb_first:` is acceptable, should its default be `true`? My point is that `lsb_first:` is needed because both bit numbering conventions are used in real byte-buffer formats. LSB-first is the more common convention for general in-memory bitmap use, including formats and APIs such as Apache Arrow validity bitmaps, ext4 block bitmaps, Roaring bitmaps, Linux/BSD bitmap APIs, and hardware register descriptions. MSB-first is also needed for domains such as RFC-style network protocol diagrams, PNG low-bit-depth scanlines, BitTorrent bitfields, and some compressed bit streams. The keyword only controls bit numbering within each byte. Byte order itself is unchanged. ### 3. Out-of-range behavior `String#bit_at` returns `nil` when the bit offset is out of range. On the other hand, mutating methods such as `String#bit_set`, `String#bit_clear`, and `String#bit_flip` raise `IndexError` for an out-of-range bit offset. This follows the same general distinction as read-like access versus write-like access: reading a missing position can return `nil`, but mutating a missing position should not silently do nothing. ### 4. Negative bit offsets Negative bit offsets are rejected with `IndexError`. Although Ruby often uses negative indices to count from the end, combining negative bit offsets with the `lsb_first: true/false` mechanism would make the numbering rule harder to explain and reason about. The API keeps bit offsets as non-negative flat positions from the beginning of the string. ### 5. Length mismatch for binary bitwise operations `String#bitwise_and`, `String#bitwise_or`, and `String#bitwise_xor` require both operands to have the same byte size. If the byte sizes differ, they raise `ArgumentError`. This avoids implicit truncation or zero-padding. Since these methods operate on strings as fixed-size bitmaps, requiring equal sizes makes the operation explicit and predictable. -- https://bugs.ruby-lang.org/
Issue #22118 has been updated by hasumikin (hitoshi hasumi). Memorandom. I had a face-to-face conversation with Matz and Shugo yesterday.
Matz: each_bit should yield `0/1`, not `false/true` Me: (Oh, still, acceptable to me) Matz: I don't like the keyword `lsb_first:`. Shugo: How about using encoding or some internal flag? Me: (Yes, that's what I wanted) Matz: Hmmm...
(Matz didn't reject it at least) If it is realized, Pretty-Print could make something like this: ```ruby data = "\x01\x02\x03\x04\x05\x06" data.bit!(false) # Set the encoding (or internal flag) as BITS with MSB-first p data.bit_at(7) => 1 # Position 7 of 00000001 numbering from MSB pp data => 00000001 00000010 00000011 00000100 00000101 00000110 ``` ---------------------------------------- Feature #22118: Introduce Basic Bit Operations into String https://bugs.ruby-lang.org/issues/22118#change-117716 * Author: hasumikin (hitoshi hasumi) * Status: Open ---------------------------------------- This PR implements a subset of the specification proposed in the parent feature ticket #22082 PR URL: https://github.com/ruby/ruby/pull/17353 ## Methods implemented with the same specification as the parent ticket * `String#bit_at(offset, lsb_first: true) -> true | false | nil` * `String#bitwise_not -> String` * `String#bitwise_not! -> self` * `String#bitwise_and(other) -> String` * `String#bitwise_and!(other) -> self` * `String#bitwise_or(other) -> String` * `String#bitwise_or!(other) -> self` * `String#bitwise_xor(other) -> String` * `String#bitwise_xor!(other) -> self` ## Methods with range-oriented forms omitted for now The parent ticket also included range-oriented forms for some methods. In this PR, those range arguments are intentionally omitted, and the methods are limited to simpler forms. Range support can be considered as a future extension. * `String#bit_count -> Integer` * `String#bit_set(offset, lsb_first: true) -> self` * `String#bit_clear(offset, lsb_first: true) -> self` * `String#bit_flip(offset, lsb_first: true) -> self` In particular, `String#bit_count` currently takes no arguments. It does not accept a range, offset/length pair, or `lsb_first:` keyword. --- ## Remaining design questions even after narrowing the scope The overall direction of adding these methods has already been approved by Matz. The remaining questions are mostly about naming and API details. ### 1. Why the `bitwise_*` prefix is used In the parent ticket, all other methods are related to bit positions: they either take bit-position arguments, yield bit positions, or operate on a specified bit range. In contrast, the `bitwise_*` methods are whole-data operations. They apply to the entire string as a fixed-size bitmap and do not expose bit positions or bit ordering. The `bitwise_*` prefix is intended to make this distinction explicit. ### 2. Bit numbering #### `lsb_first: true` (default) Within each byte, `offset = 0` is the LSB. Numbering proceeds upward through byte[0] and then continues at the LSB of byte[1]: ```text byte[0] byte[1] offset: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 bit: b7 b6 b5 b4 b3 b2 b1 b0 b7 b6 b5 b4 b3 b2 b1 b0 ^ ^ LSB LSB offset = byte_index * 8 + bit_in_byte ``` #### `lsb_first: false` for MSB-first Byte order is preserved (`byte[0]` is still first), but within each byte numbering starts at the MSB: ```text byte[0] byte[1] offset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 bit: b7 b6 b5 b4 b3 b2 b1 b0 b7 b6 b5 b4 b3 b2 b1 b0 ^ ^ MSB MSB offset = byte_index * 8 + (7 - bit_in_byte) ``` * Whether `lsb_first:` is needed at all, or whether an equivalent mechanism should exist under another name? * If such a mechanism is needed, whether `lsb_first:` is the right keyword name? * If `lsb_first:` is acceptable, should its default be `true`? My point is that `lsb_first:` is needed because both bit numbering conventions are used in real byte-buffer formats. LSB-first is the more common convention for general in-memory bitmap use, including formats and APIs such as Apache Arrow validity bitmaps, ext4 block bitmaps, Roaring bitmaps, Linux/BSD bitmap APIs, and hardware register descriptions. MSB-first is also needed for domains such as RFC-style network protocol diagrams, PNG low-bit-depth scanlines, BitTorrent bitfields, and some compressed bit streams. The keyword only controls bit numbering within each byte. Byte order itself is unchanged. ### 3. Out-of-range behavior `String#bit_at` returns `nil` when the bit offset is out of range. On the other hand, mutating methods such as `String#bit_set`, `String#bit_clear`, and `String#bit_flip` raise `IndexError` for an out-of-range bit offset. This follows the same general distinction as read-like access versus write-like access: reading a missing position can return `nil`, but mutating a missing position should not silently do nothing. ### 4. Negative bit offsets Negative bit offsets are rejected with `IndexError`. Although Ruby often uses negative indices to count from the end, combining negative bit offsets with the `lsb_first: true/false` mechanism would make the numbering rule harder to explain and reason about. The API keeps bit offsets as non-negative flat positions from the beginning of the string. ### 5. Length mismatch for binary bitwise operations `String#bitwise_and`, `String#bitwise_or`, and `String#bitwise_xor` require both operands to have the same byte size. If the byte sizes differ, they raise `ArgumentError`. This avoids implicit truncation or zero-padding. Since these methods operate on strings as fixed-size bitmaps, requiring equal sizes makes the operation explicit and predictable. -- https://bugs.ruby-lang.org/
Issue #22118 has been updated by Dan0042 (Daniel DeLorme). Introducing efficient bit operations is a great idea, but looking at the proposed methods, having to pass `lsb_first:` repeatedly across multiple bit-related methods suggests that the bit-endianness is a cross-cutting concern. I believe it would result in cleaner code to specify that configuration once at initialization of a dedicated `BitSet` class (perhaps as `String::BitSet` or a stdlib class) ```ruby bitmap = BitSet.new(backing_string, lsb_first: true) bitmap.bit_at(offset) ``` It could allocate its own backing string if omitted, similar to StringIO. ---------------------------------------- Feature #22118: Introduce Basic Bit Operations into String https://bugs.ruby-lang.org/issues/22118#change-117750 * Author: hasumikin (hitoshi hasumi) * Status: Open ---------------------------------------- This PR implements a subset of the specification proposed in the parent feature ticket #22082 PR URL: https://github.com/ruby/ruby/pull/17353 ## Methods implemented with the same specification as the parent ticket * `String#bit_at(offset, lsb_first: true) -> true | false | nil` * `String#bitwise_not -> String` * `String#bitwise_not! -> self` * `String#bitwise_and(other) -> String` * `String#bitwise_and!(other) -> self` * `String#bitwise_or(other) -> String` * `String#bitwise_or!(other) -> self` * `String#bitwise_xor(other) -> String` * `String#bitwise_xor!(other) -> self` ## Methods with range-oriented forms omitted for now The parent ticket also included range-oriented forms for some methods. In this PR, those range arguments are intentionally omitted, and the methods are limited to simpler forms. Range support can be considered as a future extension. * `String#bit_count -> Integer` * `String#bit_set(offset, lsb_first: true) -> self` * `String#bit_clear(offset, lsb_first: true) -> self` * `String#bit_flip(offset, lsb_first: true) -> self` In particular, `String#bit_count` currently takes no arguments. It does not accept a range, offset/length pair, or `lsb_first:` keyword. --- ## Remaining design questions even after narrowing the scope The overall direction of adding these methods has already been approved by Matz. The remaining questions are mostly about naming and API details. ### 1. Why the `bitwise_*` prefix is used In the parent ticket, all other methods are related to bit positions: they either take bit-position arguments, yield bit positions, or operate on a specified bit range. In contrast, the `bitwise_*` methods are whole-data operations. They apply to the entire string as a fixed-size bitmap and do not expose bit positions or bit ordering. The `bitwise_*` prefix is intended to make this distinction explicit. ### 2. Bit numbering #### `lsb_first: true` (default) Within each byte, `offset = 0` is the LSB. Numbering proceeds upward through byte[0] and then continues at the LSB of byte[1]: ```text byte[0] byte[1] offset: 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 bit: b7 b6 b5 b4 b3 b2 b1 b0 b7 b6 b5 b4 b3 b2 b1 b0 ^ ^ LSB LSB offset = byte_index * 8 + bit_in_byte ``` #### `lsb_first: false` for MSB-first Byte order is preserved (`byte[0]` is still first), but within each byte numbering starts at the MSB: ```text byte[0] byte[1] offset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 bit: b7 b6 b5 b4 b3 b2 b1 b0 b7 b6 b5 b4 b3 b2 b1 b0 ^ ^ MSB MSB offset = byte_index * 8 + (7 - bit_in_byte) ``` * Whether `lsb_first:` is needed at all, or whether an equivalent mechanism should exist under another name? * If such a mechanism is needed, whether `lsb_first:` is the right keyword name? * If `lsb_first:` is acceptable, should its default be `true`? My point is that `lsb_first:` is needed because both bit numbering conventions are used in real byte-buffer formats. LSB-first is the more common convention for general in-memory bitmap use, including formats and APIs such as Apache Arrow validity bitmaps, ext4 block bitmaps, Roaring bitmaps, Linux/BSD bitmap APIs, and hardware register descriptions. MSB-first is also needed for domains such as RFC-style network protocol diagrams, PNG low-bit-depth scanlines, BitTorrent bitfields, and some compressed bit streams. The keyword only controls bit numbering within each byte. Byte order itself is unchanged. ### 3. Out-of-range behavior `String#bit_at` returns `nil` when the bit offset is out of range. On the other hand, mutating methods such as `String#bit_set`, `String#bit_clear`, and `String#bit_flip` raise `IndexError` for an out-of-range bit offset. This follows the same general distinction as read-like access versus write-like access: reading a missing position can return `nil`, but mutating a missing position should not silently do nothing. ### 4. Negative bit offsets Negative bit offsets are rejected with `IndexError`. Although Ruby often uses negative indices to count from the end, combining negative bit offsets with the `lsb_first: true/false` mechanism would make the numbering rule harder to explain and reason about. The API keeps bit offsets as non-negative flat positions from the beginning of the string. ### 5. Length mismatch for binary bitwise operations `String#bitwise_and`, `String#bitwise_or`, and `String#bitwise_xor` require both operands to have the same byte size. If the byte sizes differ, they raise `ArgumentError`. This avoids implicit truncation or zero-padding. Since these methods operate on strings as fixed-size bitmaps, requiring equal sizes makes the operation explicit and predictable. -- https://bugs.ruby-lang.org/
participants (2)
-
Dan0042 (Daniel DeLorme) -
hasumikin (hitoshi hasumi)