Issue #19090 has been updated by k0kubun (Takashi Kokubun).
I get what you're saying. My position on this issue is:
* `CGI` is not a good place either unless you're writing a CGI application. ERB also has `ERB::Escape` now, and I'd say "embedded Ruby escape" is a better module name than "CGI" for the purpose.
* ERB 4.0.0 was shipped with a new file `erb/util.rb`, which allows you to load only a couple of escape methods, not loading an extra template engine.
* The way we defined the method was designed by @jeremyevans0 for Erubi. Loading the ERB template engine would be the last thing the Erubi maintainer would like to do. So, thanks to his idea, it's possible to load only escape methods.
* `ERB::Util.html_escape` is monkey-patched by Rails and it's been the only official `html_safe`-compatible HTML escape method for years (while it's been using Erubis or Erubi). It's the only Rails-official way to do it, and moving it to somewhere else would be unnecessarily disruptive.
----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-100283
* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.
More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.
## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.
## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.
The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.
[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.
The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.
## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:
* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
* We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.
## Why not a new method?
It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.
--
https://bugs.ruby-lang.org/
Issue #19090 has been updated by Eregon (Benoit Daloze).
Right, I forgot CGI is a default gem too.
I think it seems cleaner for other template engines (e.g. haml, slim, etc) to depend (as in require "cgi") on CGI vs depending on ERB, i.e. CGI feels smaller/like a subset of ERB. In other words, it doesn't seem ideal for other template engines to depend on the ERB template engine just for escaping (although practically speaking there is no issue, just from a design perspective).
----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-100282
* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.
More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.
## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.
## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.
The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.
[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.
The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.
## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:
* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
* We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.
## Why not a new method?
It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.
--
https://bugs.ruby-lang.org/
Issue #18899 has been updated by javanthropus (Jeremy Bopp).
Please also see #18995 for another example of the intricate implementation behaving unexpectedly. During my own investigation, I discovered that using `"-"` for the internal encoding name is silently ignored. According to the comments in the code, `"-"` is used to indicate no conversion, but it's completely undocumented for the method. If you use `"-"` for the external encoding name, you get similarly divergent behavior as reported for this issue if you pass `"-:utf-8"` vs. `"-"`, `"utf-8"`.
----------------------------------------
Bug #18899: Inconsistent argument handling in IO#set_encoding
https://bugs.ruby-lang.org/issues/18899#change-100280
* Author: javanthropus (Jeremy Bopp)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
* Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
`IO#set_encoding` behaves differently when processing a single String argument than it does when processing 2 arguments (whether Strings or Encodings) in the case where the external encoding is being set to binary and the internal encoding is being set to any other encoding.
This script demonstrates the resulting values of the external and internal encodings for an IO instance given different ways to equivalently call `#set_encoding`:
```ruby
#!/usr/bin/env ruby
def show(io, args)
printf(
"args: %-50s external encoding: %-25s internal encoding: %-25s\n",
args.inspect,
io.external_encoding.inspect,
io.internal_encoding.inspect
)
end
File.open('/dev/null') do |f|
args = ['binary:utf-8']
f.set_encoding(*args)
show(f, args)
args = ['binary', 'utf-8']
f.set_encoding(*args)
show(f, args)
args = [Encoding.find('binary'), Encoding.find('utf-8')]
f.set_encoding(*args)
show(f, args)
end
```
This behavior is the same from Ruby 2.7.0 to 3.1.2.
--
https://bugs.ruby-lang.org/
Issue #19090 has been updated by k0kubun (Takashi Kokubun).
I filed a PR for truffleruby https://github.com/ruby/erb/pull/39.
> I think it is important to keep the optimized HTML escape in core/stdlib (e.g., in CGI)
I didn't get what you mean in this part. CGI and ERB are both default gems. There's no distinction in its "coreness" between them.
----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-100279
* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.
More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.
## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.
## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.
The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.
[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.
The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.
## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:
* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
* We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.
## Why not a new method?
It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.
--
https://bugs.ruby-lang.org/
Issue #19102 has been updated by k0kubun (Takashi Kokubun).
Would you mind reviewing https://github.com/ruby/erb/pull/39? I skipped this ticket's change for truffleruby in the PR.
----------------------------------------
Feature #19102: Optimize ERB::Util.html_escape more than CGI.escapeHTML for template engines
https://bugs.ruby-lang.org/issues/19102#change-100278
* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Change the behavior of `ERB::Util.html_escape` in the following two parts:
1. Skip converting an argument with `#to_s` if the argument is already a `T_STRING`.
2. Do not allocate and return a new String when nothing needs to be escaped.
## Background
The current `ERB::Util.html_escape` is implemented as `CGI.escapeHTML(s.to_s)`. So the performance is almost equal to `CGI.escapeHTML` except for the `to_s` call. Because it's common to embed non-String expressions in template engines, a template engine typically calls `to_s` to convert non-String expressions to a String before escaping it, which is why the difference exists. Proposal (1) is useful for optimizing the case that something that's already a `String` is embedded. We ignore the extreme case that `String#to_s` is weirdly monkey-patched.
As to proposal (2), my original implementation of `CGI.escapeHTML` https://github.com/ruby/ruby/pull/1164 was not calling `rb_str_dup` for that case. However, because [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility with the old `gsub`-based implementation, we added the unneeded `rb_str_dup` call and the performance for that case has been compromised. This behavior is completely unnecessary for template engines. On the other hand, because `ERB::Util.html_escape` is a helper for ERB, we should not need to consider any backward compatibility that is not relevant to ERB or any template engines. So proposal (2) should be possible in `ERB::Util.html_escape` unlike `CGI.escapeHTML`.
## Benchmark
Implementation: https://github.com/ruby/erb/pull/27
```rb
require 'benchmark/ips'
require 'erb'
class << ERB::Util
def html_escape_old(s)
CGI.escapeHTML(s.to_s)
end
end
Benchmark.ips do |x|
s = 'hello world'
x.report('before') { ERB::Util.html_escape_old(s) }
x.report('after') { ERB::Util.html_escape(s) }
x.compare!
end
```
```
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
Warming up --------------------------------------
before 1.066M i/100ms
after 1.879M i/100ms
Calculating -------------------------------------
before 10.615M (± 0.3%) i/s - 53.320M in 5.023083s
after 18.742M (± 0.4%) i/s - 93.929M in 5.011847s
Comparison:
after: 18741747.6 i/s
before: 10615137.1 i/s - 1.77x (± 0.00) slower
```
--
https://bugs.ruby-lang.org/
Issue #19090 has been updated by Eregon (Benoit Daloze).
I did not expect `rb_str_dup()` is so costly on CRuby, I guess the allocation is slow and of course CRuby can't escape-analyze it.
I think it is important to keep the optimized HTML escape in core/stdlib (e.g., in CGI), as the fastest way is implementation-dependent.
See https://bugs.ruby-lang.org/issues/19102#note-4
I think we should add an optional argument to avoid copying the string when unchanged, that's easy and avoids yet another place/method with that escaping logic.
----------------------------------------
Feature #19090: Do not duplicate an unescaped string in CGI.escapeHTML
https://bugs.ruby-lang.org/issues/19090#change-100276
* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Stop guaranteeing that `GGI.escapeHTML` returns a new string even if there's nothing to be escaped.
More specifically, stop calling this `rb_str_dup` https://github.com/ruby/cgi/blob/v0.3.3/ext/cgi/escape/escape.c#L72 for the case that nothing needs to be escaped.
## Background
My original implementation https://github.com/ruby/ruby/pull/1164 was not calling it. The reason why `rb_str_dup` was added was that [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility because the original `gsub`-based implementation always returns a new object. As a result, even while many people use `CGI.escapeHTML` as an optimized implementation for escaping HTML today, it ended up having a compromised performance.
## Motivation
The motivation is to improve performance. By just doing so, escaping a pre-allocated `"string"` becomes 1.34x faster on my machine https://gist.github.com/k0kubun/f66d6fe1e6ba821e4263257e504ba28f.
The most major use case of `CGP.escapeHTML` is to safely embed a user input. When the result is just embedded in another string, the allocated new object will be just wasted. It's pretty common that an embedded string fragment doesn't contain any of `'"&<>` characters. So we should stop wasting that to optimize that case.
[Bug #11858] wasn't really a use case but just "I think this is backward incompatibility" based on frozen Hello World. Unlike user input, you usually don't need to escape your own string literal. It feels like the ticket addressed a problem that doesn't exist in actual applications. It should have cited existing code that could be broken by that, and I can't find such code with `gem-codesearch` today.
The only reason to maintain the current behavior would be to allow using a return value of `CGI.escapeHTML` as a buffer for creating another longer string starting with the escaped value, but using `CGI.escapeHTML` to initialize a string buffer feels like an abuse. Relying on the behavior never makes sense as an "optimization" either because it makes all other cases (the result is not used as a string buffer) suboptimal.
## Why not an optional flag like `CGI.escapeHTML(str, dup: false)`?
Two reasons:
* The non-dup behavior should be used 99.999..9% of the time. We shouldn't make code using `CGI.escapeHTML` less readable just for maintaining a use case that doesn't exist.
* Passing keyword arguments to a C extension is unfortunately slow, and it defeats the optimization purpose. In core classes, we could use `Primitive` to address that, but this is a default gem and we can't use that.
* We could workaround that if we choose `CGI.escapeHTML(str, false)`, but again it'd spoil the readability for maintaining an invalid use case.
## Why not a new method?
It's a good idea actually, but with `escapeHTML`, `escape_html`, and `h` aliased to it already, I can't think of a good name for it. And again, not calling it `escapeHTML` or `escape_html` would spoil the readability for no valid reason.
--
https://bugs.ruby-lang.org/
Issue #19102 has been updated by Eregon (Benoit Daloze).
I think it is unfortunate to add a C extension for ERB for that, ERB was always pure-Ruby and that was nice.
Also the C extension is slower on TruffleRuby, the Regexp is actually JIT-compiled and can use vectorization, unlike that C code. Also part of it is RSTRING_PTR() basically forces a copy from managed memory (byte[]) to native memory (char*) on TruffleRuby.
```
truffleruby 23.0.0-dev-57e53f8a, like ruby 3.1.2, GraalVM CE Native [x86_64-linux]
CGI.escapeHTML 31.985M (± 1.2%) i/s - 160.093M in 5.006001s
ERB::Util.html_escape 7.427M (± 3.3%) i/s - 37.162M in 5.009721s
```
and CRuby 3.1 is:
```
ERB::Util.html_escape 14.551M (± 0.8%) i/s - 73.308M in 5.038335s
CGI.escapeHTML 10.065M (± 0.6%) i/s - 51.054M in 5.072629s
```
Given those results, could you build the C extension only for CRuby?
I think it would also be much nicer to keep the optimized HTML escape in CGI which is in stdlib, so it can be used by all templates engines.
In #19090 I did not expect `rb_str_dup()` is so costly on CRuby, I guess the allocation is slow and of course CRuby can't escape-analyze it.
A new method in CGI sounds best, or probably nicer an optional argument whether to always return a copy.
It seems tricky to change the existing method to return a mutable string as-is, I guess the person who found #11858 actually in that incompatibility.
Ruby users seem to assume in general if a core/stdlib method returns a string either it's frozen or they are free to modify it, but here it would actually also mutate the original String passed to `CGI.escapeHTML`.
`String#to_s` really sounds like something every Ruby JIT/VM should be able to trivially optimize.
So that I think should be an insignificant cost, even on CRuby (and if it isn't it should be easy to fix).
----------------------------------------
Feature #19102: Optimize ERB::Util.html_escape more than CGI.escapeHTML for template engines
https://bugs.ruby-lang.org/issues/19102#change-100275
* Author: k0kubun (Takashi Kokubun)
* Status: Closed
* Priority: Normal
----------------------------------------
## Proposal
Change the behavior of `ERB::Util.html_escape` in the following two parts:
1. Skip converting an argument with `#to_s` if the argument is already a `T_STRING`.
2. Do not allocate and return a new String when nothing needs to be escaped.
## Background
The current `ERB::Util.html_escape` is implemented as `CGI.escapeHTML(s.to_s)`. So the performance is almost equal to `CGI.escapeHTML` except for the `to_s` call. Because it's common to embed non-String expressions in template engines, a template engine typically calls `to_s` to convert non-String expressions to a String before escaping it, which is why the difference exists. Proposal (1) is useful for optimizing the case that something that's already a `String` is embedded. We ignore the extreme case that `String#to_s` is weirdly monkey-patched.
As to proposal (2), my original implementation of `CGI.escapeHTML` https://github.com/ruby/ruby/pull/1164 was not calling `rb_str_dup` for that case. However, because [Bug #11858] claimed returning the argument object for non-escaped cases is a backward incompatibility with the old `gsub`-based implementation, we added the unneeded `rb_str_dup` call and the performance for that case has been compromised. This behavior is completely unnecessary for template engines. On the other hand, because `ERB::Util.html_escape` is a helper for ERB, we should not need to consider any backward compatibility that is not relevant to ERB or any template engines. So proposal (2) should be possible in `ERB::Util.html_escape` unlike `CGI.escapeHTML`.
## Benchmark
Implementation: https://github.com/ruby/erb/pull/27
```rb
require 'benchmark/ips'
require 'erb'
class << ERB::Util
def html_escape_old(s)
CGI.escapeHTML(s.to_s)
end
end
Benchmark.ips do |x|
s = 'hello world'
x.report('before') { ERB::Util.html_escape_old(s) }
x.report('after') { ERB::Util.html_escape(s) }
x.compare!
end
```
```
ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
Warming up --------------------------------------
before 1.066M i/100ms
after 1.879M i/100ms
Calculating -------------------------------------
before 10.615M (± 0.3%) i/s - 53.320M in 5.023083s
after 18.742M (± 0.4%) i/s - 93.929M in 5.011847s
Comparison:
after: 18741747.6 i/s
before: 10615137.1 i/s - 1.77x (± 0.00) slower
```
--
https://bugs.ruby-lang.org/
Issue #18899 has been updated by Eregon (Benoit Daloze).
I've taken a look in `IO#set_encoding` recently and it's such an unreadable mess, I think nobody would be able to explain its full semantics.
So anything to simplify it would IMHO be welcome.
I think `IO#set_encoding` should simply set the internal/external encodings for that IO, with no special cases and not caring about the default external/internal encodings.
If some cases don't make any sense they should raise an exception.
----------------------------------------
Bug #18899: Inconsistent argument handling in IO#set_encoding
https://bugs.ruby-lang.org/issues/18899#change-100274
* Author: javanthropus (Jeremy Bopp)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.1.2p20 (2022-04-12 revision 4491bb740a) [x86_64-linux]
* Backport: 2.7: UNKNOWN, 3.0: UNKNOWN, 3.1: UNKNOWN
----------------------------------------
`IO#set_encoding` behaves differently when processing a single String argument than it does when processing 2 arguments (whether Strings or Encodings) in the case where the external encoding is being set to binary and the internal encoding is being set to any other encoding.
This script demonstrates the resulting values of the external and internal encodings for an IO instance given different ways to equivalently call `#set_encoding`:
```ruby
#!/usr/bin/env ruby
def show(io, args)
printf(
"args: %-50s external encoding: %-25s internal encoding: %-25s\n",
args.inspect,
io.external_encoding.inspect,
io.internal_encoding.inspect
)
end
File.open('/dev/null') do |f|
args = ['binary:utf-8']
f.set_encoding(*args)
show(f, args)
args = ['binary', 'utf-8']
f.set_encoding(*args)
show(f, args)
args = [Encoding.find('binary'), Encoding.find('utf-8')]
f.set_encoding(*args)
show(f, args)
end
```
This behavior is the same from Ruby 2.7.0 to 3.1.2.
--
https://bugs.ruby-lang.org/
Issue #19138 has been updated by Eregon (Benoit Daloze).
schneems (Richard Schneeman) wrote in #note-6:
> Instead of adding #line though if we could attach the source that would be more impactful for syntax search.
>
> Some cases such as eval do not have source files, so if we could access the contents for casses without a path that would increase the capabilities of the gem.
Strongly agreed. We should have a way to get the original source code/text from an exception or a Method/UnboundMethod object, or any core object which has a `source_location`.
The current way trying to reread from disk is quite brittle and incomplete, as seen for `error_highlight` by @mame.
TruffleRuby already keeps this information, it's only a matter of exposing it through some method.
That should be a separate feature request though.
----------------------------------------
Feature #19138: `SyntaxError#path` for syntax_suggest
https://bugs.ruby-lang.org/issues/19138#change-100273
* Author: nobu (Nobuyoshi Nakada)
* Status: Open
* Priority: Normal
----------------------------------------
Currently syntax_suggest searches the path name from the exception message.
But extracting the info from messages for humans is fragile, I think.
So proposing a new method `SyntaxError#path`, similar to `LoadError#path`.
```patch
commit 986da132002af1cdb75c0c89ca2831fe51e6ce69
Author: Nobuyoshi Nakada <nobu(a)ruby-lang.org>
AuthorDate: 2022-11-20 22:59:52 +0900
Commit: Nobuyoshi Nakada <nobu(a)ruby-lang.org>
CommitDate: 2022-11-20 23:44:27 +0900
Add `SyntaxError#path`
diff --git a/error.c b/error.c
index 0ff4b8d6d8e..ad1bc6ee8dc 100644
--- a/error.c
+++ b/error.c
@@ -125,6 +125,8 @@ err_vcatf(VALUE str, const char *pre, const char *file, int line,
return str;
}
+static VALUE syntax_error_with_path(VALUE, VALUE, VALUE*, rb_encoding*);
+
VALUE
rb_syntax_error_append(VALUE exc, VALUE file, int line, int column,
rb_encoding *enc, const char *fmt, va_list args)
@@ -138,15 +140,7 @@ rb_syntax_error_append(VALUE exc, VALUE file, int line, int column,
}
else {
VALUE mesg;
- if (NIL_P(exc)) {
- mesg = rb_enc_str_new(0, 0, enc);
- exc = rb_class_new_instance(1, &mesg, rb_eSyntaxError);
- }
- else {
- mesg = rb_attr_get(exc, idMesg);
- if (RSTRING_LEN(mesg) > 0 && *(RSTRING_END(mesg)-1) != '\n')
- rb_str_cat_cstr(mesg, "\n");
- }
+ exc = syntax_error_with_path(exc, file, &mesg, enc);
err_vcatf(mesg, NULL, fn, line, fmt, args);
}
@@ -2353,6 +2347,25 @@ syntax_error_initialize(int argc, VALUE *argv, VALUE self)
return rb_call_super(argc, argv);
}
+static VALUE
+syntax_error_with_path(VALUE exc, VALUE path, VALUE *mesg, rb_encoding *enc)
+{
+ if (NIL_P(exc)) {
+ *mesg = rb_enc_str_new(0, 0, enc);
+ exc = rb_class_new_instance(1, mesg, rb_eSyntaxError);
+ rb_ivar_set(exc, id_i_path, path);
+ }
+ else {
+ if (rb_attr_get(exc, id_i_path) != path) {
+ rb_raise(rb_eArgError, "SyntaxError#path changed");
+ }
+ VALUE s = *mesg = rb_attr_get(exc, idMesg);
+ if (RSTRING_LEN(s) > 0 && *(RSTRING_END(s)-1) != '\n')
+ rb_str_cat_cstr(s, "\n");
+ }
+ return exc;
+}
+
/*
* Document-module: Errno
*
@@ -3011,9 +3024,14 @@ Init_Exception(void)
rb_eSyntaxError = rb_define_class("SyntaxError", rb_eScriptError);
rb_define_method(rb_eSyntaxError, "initialize", syntax_error_initialize, -1);
+ ID id_path = rb_intern_const("path");
+
+ /* the path failed to parse */
+ rb_attr(rb_eSyntaxError, id_path, TRUE, FALSE, FALSE);
+
rb_eLoadError = rb_define_class("LoadError", rb_eScriptError);
/* the path failed to load */
- rb_attr(rb_eLoadError, rb_intern_const("path"), TRUE, FALSE, FALSE);
+ rb_attr(rb_eLoadError, id_path, TRUE, FALSE, FALSE);
rb_eNotImpError = rb_define_class("NotImplementedError", rb_eScriptError);
```
With this method, syntax_suggest/core_ext.rb will no longer need `PathnameFromMessage`.
```patch
diff --git i/lib/syntax_suggest/core_ext.rb w/lib/syntax_suggest/core_ext.rb
index 40f5fe13759..616a6ed9839 100644
--- i/lib/syntax_suggest/core_ext.rb
+++ w/lib/syntax_suggest/core_ext.rb
@@ -25,15 +25,12 @@
require "syntax_suggest/api" unless defined?(SyntaxSuggest::DEFAULT_VALUE)
message = super
- file = if highlight
- SyntaxSuggest::PathnameFromMessage.new(super(highlight: false, **kwargs)).call.name
- else
- SyntaxSuggest::PathnameFromMessage.new(message).call.name
- end
-
- io = SyntaxSuggest::MiniStringIO.new
+ file = path
if file
+ file = Pathname.new(file)
+ io = SyntaxSuggest::MiniStringIO.new
+
SyntaxSuggest.call(
io: io,
source: file.read,
```
Since we have not released with `SyntaxError#detailed_message` yet, there should not be a compatibility issue.
@schneems How do you think?
--
https://bugs.ruby-lang.org/
Issue #19142 has been updated by nobu (Nobuyoshi Nakada).
I think that `test` needs `install` only when "preloadenv" and `--enable-load-relative` can't work.
At least `test-all` can run as `/install/bin/ruby test/runner.rb`, I think.
----------------------------------------
Misc #19142: Run test suites against 'install', based on ENV variable?
https://bugs.ruby-lang.org/issues/19142#change-100270
* Author: MSP-Greg (Greg L)
* Status: Open
* Priority: Normal
----------------------------------------
Some time ago I believe there was discussion about:
```
make
make test
make install
```
vs
```
make
make install
make test
```
Some people preferred to not have to run install before test (disk space, time, etc).
Might an environment variable like `RUBY_TEST_FROM_INSTALL` or `RUBY_TEST_FROM_PATH` be
added that would trigger the test suite to test against the Ruby install folder or the
Ruby in PATH?
I believe a handful of files in the test system would need changes, as they load files
using `require_relative`...
I think spec/mspec is already setup to do so.
--
https://bugs.ruby-lang.org/