
Issue #20663 has been reported by javanthropus (Jeremy Bopp). ---------------------------------------- Bug #20663: Reading from IO does not recover gracefully from bad data pushed via IO#ungetc https://bugs.ruby-lang.org/issues/20663 * Author: javanthropus (Jeremy Bopp) * Status: Open * Backport: 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- If bytes that result in at least 2 invalid characters for the internal encoding of an IO object are pushed into the internal buffer with IO#getc, reading from the stream returns invalid characters composed of both bytes from the internal buffer and the converted bytes from the stream even if the next character in the stream itself is completely valid. ``` ruby char_bytes = Tempfile.open(encoding: 'utf-8:utf-16le') do |f| f.write("🍣") f.rewind f.ungetc("🍣".encode('utf-16le').b[0..-2]) f.each_char.map(&:bytes) end puts char_bytes.inspect ``` The above outputs: ``` [[60, 216], [99, 60], [216, 99], [223]] ``` I expect it to output: ``` [[60, 216], [99], [60, 216, 99, 223]] ``` In other words, I expect it to first completely drain the internal character buffer returning as many characters as necessary (invalid or otherwise) before reading from the stream and converting and returning the next character. Interestingly, if there are only bytes sufficient for 1 invalid character in the internal buffer, it behaves that way: ``` ruby char_bytes = Tempfile.open(encoding: 'utf-8:utf-16le') do |f| f.write("🍣") f.rewind f.ungetc("🍣".encode('utf-16le').b[0..-3]) # <- Note the -3 here vs. the -2 earlier f.each_char.map(&:bytes) end puts char_bytes.inspect ``` This outputs: ``` [[60, 216], [60, 216, 99, 223]] ``` The first character is invalid, but returning it first clears the buffer. Then the next character is read, converted, and returned in full. -- https://bugs.ruby-lang.org/