[ruby-core:114495] [Ruby master Bug#19848] Ripper BOM behavior

Issue #19848 has been reported by kddnewton (Kevin Newton). ---------------------------------------- Bug #19848: Ripper BOM behavior https://bugs.ruby-lang.org/issues/19848 * Author: kddnewton (Kevin Newton) * Status: Open * Priority: Normal * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- When there is a byte-order mark in a file, the first token in the file usually begins at -3. For example: ```ruby Ripper.lex("\xEF\xBB\xBF[]") # => [[[1, -3], :on_lbracket, "[", BEG|LABEL], [[1, 1], :on_rbracket, "]", END]] ``` The rest of the tokens appear as if the byte-order-mark never existed. This is consistent except for the case where the file starts with a global variable, an instance variable, or a class variable. In those cases the first token begins at 0. For example: ```ruby Ripper.lex("\xEF\xBB\xBF@foo") # => [[[1, 0], :on_ivar, "@foo", END]] Ripper.lex("\xEF\xBB\xBF@@foo") # => [[[1, 0], :on_cvar, "@@foo", END]] Ripper.lex("\xEF\xBB\xBF$foo") # => [[[1, 0], :on_gvar, "$foo", END]] ``` Additionally, when there is a byte-order mark it usually does not appear as part of the first token, unless the token is a magic encoding comment. If it's a magic encoding comment, then it's part of the value: ```ruby Ripper.lex("\xEF\xBB\xBF# encoding: us-ascii") # => [[[1, -3], :on_comment, "\xEF\xBB\xBF# encoding: us-ascii", BEG]] ``` For solutions - when there is a byte-order mark I think the column information should either always start at 0, or always start at -3. Then for the encoding comment, it should probably not show up as part of the value, or it should show up for all comments. -- https://bugs.ruby-lang.org/

Issue #19848 has been updated by kddnewton (Kevin Newton). Apologies, I think I was wrong about the last part, it's part of the string but it doesn't show up on inspect. So this is just about the column information then. ---------------------------------------- Bug #19848: Ripper BOM behavior https://bugs.ruby-lang.org/issues/19848#change-104280 * Author: kddnewton (Kevin Newton) * Status: Open * Priority: Normal * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- When there is a byte-order mark in a file, the first token in the file usually begins at -3. For example: ```ruby Ripper.lex("\xEF\xBB\xBF[]") # => [[[1, -3], :on_lbracket, "[", BEG|LABEL], [[1, 1], :on_rbracket, "]", END]] ``` The rest of the tokens appear as if the byte-order-mark never existed. This is consistent except for the case where the file starts with a global variable, an instance variable, or a class variable. In those cases the first token begins at 0. For example: ```ruby Ripper.lex("\xEF\xBB\xBF@foo") # => [[[1, 0], :on_ivar, "@foo", END]] Ripper.lex("\xEF\xBB\xBF@@foo") # => [[[1, 0], :on_cvar, "@@foo", END]] Ripper.lex("\xEF\xBB\xBF$foo") # => [[[1, 0], :on_gvar, "$foo", END]] ``` Additionally, when there is a byte-order mark it usually does not appear as part of the first token, unless the token is a magic encoding comment. If it's a magic encoding comment, then it's part of the value: ```ruby Ripper.lex("\xEF\xBB\xBF# encoding: us-ascii") # => [[[1, -3], :on_comment, "\xEF\xBB\xBF# encoding: us-ascii", BEG]] ``` For solutions - when there is a byte-order mark I think the column information should either always start at 0, or always start at -3. Then for the encoding comment, it should probably not show up as part of the value, or it should show up for all comments. -- https://bugs.ruby-lang.org/

Issue #19848 has been updated by nobu (Nobuyoshi Nakada). https://github.com/ruby/ruby/pull/8281 ---------------------------------------- Bug #19848: Ripper BOM behavior https://bugs.ruby-lang.org/issues/19848#change-104292 * Author: kddnewton (Kevin Newton) * Status: Open * Priority: Normal * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- When there is a byte-order mark in a file, the first token in the file usually begins at -3. For example: ```ruby Ripper.lex("\xEF\xBB\xBF[]") # => [[[1, -3], :on_lbracket, "[", BEG|LABEL], [[1, 1], :on_rbracket, "]", END]] ``` The rest of the tokens appear as if the byte-order-mark never existed. This is consistent except for the case where the file starts with a global variable, an instance variable, or a class variable. In those cases the first token begins at 0. For example: ```ruby Ripper.lex("\xEF\xBB\xBF@foo") # => [[[1, 0], :on_ivar, "@foo", END]] Ripper.lex("\xEF\xBB\xBF@@foo") # => [[[1, 0], :on_cvar, "@@foo", END]] Ripper.lex("\xEF\xBB\xBF$foo") # => [[[1, 0], :on_gvar, "$foo", END]] ``` Additionally, when there is a byte-order mark it usually does not appear as part of the first token, unless the token is a magic encoding comment. If it's a magic encoding comment, then it's part of the value: ```ruby Ripper.lex("\xEF\xBB\xBF# encoding: us-ascii") # => [[[1, -3], :on_comment, "\xEF\xBB\xBF# encoding: us-ascii", BEG]] ``` For solutions - when there is a byte-order mark I think the column information should either always start at 0, or always start at -3. Then for the encoding comment, it should probably not show up as part of the value, or it should show up for all comments. -- https://bugs.ruby-lang.org/
participants (2)
-
kddnewton (Kevin Newton)
-
nobu (Nobuyoshi Nakada)