[ruby-core:124311] [Ruby Feature#21795] Methods for retrieving ASTs
Issue #21795 has been reported by kddnewton (Kevin Newton). ---------------------------------------- Feature #21795: Methods for retrieving ASTs https://bugs.ruby-lang.org/issues/21795 * Author: kddnewton (Kevin Newton) * Status: Open ---------------------------------------- I would like to propose a handful of methods for retrieving ASTs from various objects that correspond to locations in code. This includes: * Proc#ast * Method#ast * UnboundMethod#ast * Thread::Backtrace::Location#ast * TracePoint#ast (on call/return events) The purpose of this is to make tooling easier to write and maintain. Specifically, this would be able to be used in irb, power_assert, error_highlight, and various other tools both in core and not that make use of source code. There have been many previous discussions of retrieving node_id, source_location, source, etc. All of these use cases are covered by returning the AST for some entity. In this case node_id becomes an implementation detail, invisible to the user. Source location can be derived from the information on the AST itself. Similarly, source can be derived from the AST. Internally, I do not think we have to store any more information than we already do (since we have node_id for the first four of these, it becomes rather trivial). For TracePoint we can have a larger discussion about it, but I think it should not be too much work. In terms of implementation, the only caveat I would put is that if the ISEQ were compiled through the old parser/compiler, this should return `nil`, as the node ids do not match up and we do not want to further propagate the RubyVM::AST API. The reason I am opening up this ticket with 5 different methods requested in it is to get approval first for the direction, then I can open individual tickets or just PRs for each method. I believe this feature would ease the maintenance burden of many core libraries, and unify otherwise disparate efforts to achieve the same thing. -- https://bugs.ruby-lang.org/
Issue #21795 has been updated by mame (Yusuke Endoh). I anticipated that we would consider this eventually, but incorporating it into the core presents significant challenges. Here are two major issues regarding feasibility. (Based on chats with @ko1, @tompng, and @yui-knk, though these are my personal views.) ## The Implementation Approach CRuby currently discards source code and ASTs after ISeq generation. The proposed `#ast` method would have to re-read and re-parse the source, which causes two problems: 1. If the file is modified after loading, `#ast` may return the wrong node. 2. It does not work for `eval` strings. `error_highlight` accepts this fragility because it displays just "hints". But I don't think that it is allowed for a built-in method. At least, we must avoid returning an incorrect node, and clarify when failures occur. I propose two approaches: 1. Keep loaded source in memory (e.g., `RubyVM.keep_script_lines = true` by default). This supports `eval` but increase memory usage. 2. Validate source hash. Store a hash in the ISeq and check it to ensure the file hasn't changed. ## The Parser Switching Problem What is the node definition returned by `#ast`? As noted in #21618, built-in Prism is not exposed as a Ruby API. If `Gemfile.lock` specifies an older version of prism gem, even `require "prism"` won't provide the expected definition. IMO, it would be good to have a node definition that does not depend on prism gem (maybe `Ruby::Node`?). I am not sure how much effort is needed for this. We would also need to consider where to place what in the ruby/prism and ruby/ruby repositories for development. We also need to decide if `#ast` should return `RubyVM::AST::Node` when `--parser=parse.y` is specified. ---------------------------------------- Feature #21795: Methods for retrieving ASTs https://bugs.ruby-lang.org/issues/21795#change-115824 * Author: kddnewton (Kevin Newton) * Status: Open ---------------------------------------- I would like to propose a handful of methods for retrieving ASTs from various objects that correspond to locations in code. This includes: * Proc#ast * Method#ast * UnboundMethod#ast * Thread::Backtrace::Location#ast * TracePoint#ast (on call/return events) The purpose of this is to make tooling easier to write and maintain. Specifically, this would be able to be used in irb, power_assert, error_highlight, and various other tools both in core and not that make use of source code. There have been many previous discussions of retrieving node_id, source_location, source, etc. All of these use cases are covered by returning the AST for some entity. In this case node_id becomes an implementation detail, invisible to the user. Source location can be derived from the information on the AST itself. Similarly, source can be derived from the AST. Internally, I do not think we have to store any more information than we already do (since we have node_id for the first four of these, it becomes rather trivial). For TracePoint we can have a larger discussion about it, but I think it should not be too much work. In terms of implementation, the only caveat I would put is that if the ISEQ were compiled through the old parser/compiler, this should return `nil`, as the node ids do not match up and we do not want to further propagate the RubyVM::AST API. The reason I am opening up this ticket with 5 different methods requested in it is to get approval first for the direction, then I can open individual tickets or just PRs for each method. I believe this feature would ease the maintenance burden of many core libraries, and unify otherwise disparate efforts to achieve the same thing. -- https://bugs.ruby-lang.org/
Issue #21795 has been updated by Eregon (Benoit Daloze). One idea to make it work with `--parser=parse.y` until universal parser supports the Prism API (#21825) would be to: * Get the `RubyVM::AST::Node` of that object, then extract the start/end line & start/end columns. Or do the same internally without needing a `RubyVM::AST::Node`, it's just converting from parse.y node_id to "bounds". * With those values, use something similar to [Prism.node_for](https://github.com/ruby/prism/pull/3808) to find a node base on those bounds. * There should be a single node matching those bounds because we are only looking for specific nodes. ---------------------------------------- Feature #21795: Methods for retrieving ASTs https://bugs.ruby-lang.org/issues/21795#change-115976 * Author: kddnewton (Kevin Newton) * Status: Open ---------------------------------------- I would like to propose a handful of methods for retrieving ASTs from various objects that correspond to locations in code. This includes: * Proc#ast * Method#ast * UnboundMethod#ast * Thread::Backtrace::Location#ast * TracePoint#ast (on call/return events) The purpose of this is to make tooling easier to write and maintain. Specifically, this would be able to be used in irb, power_assert, error_highlight, and various other tools both in core and not that make use of source code. There have been many previous discussions of retrieving node_id, source_location, source, etc. All of these use cases are covered by returning the AST for some entity. In this case node_id becomes an implementation detail, invisible to the user. Source location can be derived from the information on the AST itself. Similarly, source can be derived from the AST. Internally, I do not think we have to store any more information than we already do (since we have node_id for the first four of these, it becomes rather trivial). For TracePoint we can have a larger discussion about it, but I think it should not be too much work. In terms of implementation, the only caveat I would put is that if the ISEQ were compiled through the old parser/compiler, this should return `nil`, as the node ids do not match up and we do not want to further propagate the RubyVM::AST API. The reason I am opening up this ticket with 5 different methods requested in it is to get approval first for the direction, then I can open individual tickets or just PRs for each method. I believe this feature would ease the maintenance burden of many core libraries, and unify otherwise disparate efforts to achieve the same thing. -- https://bugs.ruby-lang.org/
Issue #21795 has been updated by kddnewton (Kevin Newton). Thanks @mame for the detailed reply! I appreciate your thoughtfulness here. With regard to the implementation approach problem, I love your solution of keeping a source hash on the iseq. I think that makes a lot of sense, and could be used in error highlight today as well. That could potentially even be used by other tools in the case of code reloading. I think we _could_ potentially store the code for eval, but I would be tempted to say let us not change anything for now and return `nil` or raise an error in that case. (In the same way we would need to return `nil` or raise an error for C methods.) For the parser switching problem, I think I would like to introduce a Prism ABI version (alongside the Prism gem version). I would update this version whenever a structural change is made (field added/renamed/removed/etc.). Then, if we could store the Prism ABI version on the ISEQ as well, we could require prism and check if the ABI version matches before attempting to re-parse. We could be clear through the error message that the Prism ABI version is a mismatch and therefore we cannot re-parse. I am not sure if we should return RubyVM::AST nodes in the case the ISEQ was compiled with parse.y/compile.c, but I am okay with it if that's the direction you would like to go. ---------------------------------------- Feature #21795: Methods for retrieving ASTs https://bugs.ruby-lang.org/issues/21795#change-115986 * Author: kddnewton (Kevin Newton) * Status: Open ---------------------------------------- I would like to propose a handful of methods for retrieving ASTs from various objects that correspond to locations in code. This includes: * Proc#ast * Method#ast * UnboundMethod#ast * Thread::Backtrace::Location#ast * TracePoint#ast (on call/return events) The purpose of this is to make tooling easier to write and maintain. Specifically, this would be able to be used in irb, power_assert, error_highlight, and various other tools both in core and not that make use of source code. There have been many previous discussions of retrieving node_id, source_location, source, etc. All of these use cases are covered by returning the AST for some entity. In this case node_id becomes an implementation detail, invisible to the user. Source location can be derived from the information on the AST itself. Similarly, source can be derived from the AST. Internally, I do not think we have to store any more information than we already do (since we have node_id for the first four of these, it becomes rather trivial). For TracePoint we can have a larger discussion about it, but I think it should not be too much work. In terms of implementation, the only caveat I would put is that if the ISEQ were compiled through the old parser/compiler, this should return `nil`, as the node ids do not match up and we do not want to further propagate the RubyVM::AST API. The reason I am opening up this ticket with 5 different methods requested in it is to get approval first for the direction, then I can open individual tickets or just PRs for each method. I believe this feature would ease the maintenance burden of many core libraries, and unify otherwise disparate efforts to achieve the same thing. -- https://bugs.ruby-lang.org/
Issue #21795 has been updated by Eregon (Benoit Daloze). Rails' `_callable_to_source_string` would be a good use case for this, see https://github.com/rails/rails/pull/56624 ---------------------------------------- Feature #21795: Methods for retrieving ASTs https://bugs.ruby-lang.org/issues/21795#change-116154 * Author: kddnewton (Kevin Newton) * Status: Open ---------------------------------------- I would like to propose a handful of methods for retrieving ASTs from various objects that correspond to locations in code. This includes: * Proc#ast * Method#ast * UnboundMethod#ast * Thread::Backtrace::Location#ast * TracePoint#ast (on call/return events) The purpose of this is to make tooling easier to write and maintain. Specifically, this would be able to be used in irb, power_assert, error_highlight, and various other tools both in core and not that make use of source code. There have been many previous discussions of retrieving node_id, source_location, source, etc. All of these use cases are covered by returning the AST for some entity. In this case node_id becomes an implementation detail, invisible to the user. Source location can be derived from the information on the AST itself. Similarly, source can be derived from the AST. Internally, I do not think we have to store any more information than we already do (since we have node_id for the first four of these, it becomes rather trivial). For TracePoint we can have a larger discussion about it, but I think it should not be too much work. In terms of implementation, the only caveat I would put is that if the ISEQ were compiled through the old parser/compiler, this should return `nil`, as the node ids do not match up and we do not want to further propagate the RubyVM::AST API. The reason I am opening up this ticket with 5 different methods requested in it is to get approval first for the direction, then I can open individual tickets or just PRs for each method. I believe this feature would ease the maintenance burden of many core libraries, and unify otherwise disparate efforts to achieve the same thing. -- https://bugs.ruby-lang.org/
Issue #21795 has been updated by mame (Yusuke Endoh). Eregon (Benoit Daloze) wrote in #note-2:
As noted in #21618, built-in Prism is not exposed as a Ruby API. If `Gemfile.lock` specifies an older version of prism gem, even `require "prism"` won't provide the expected definition.
This is basically a solved problem, as discussed there. In that case, `Prism.parse(foo, version: "current")` fails with a clear exception explaining one needs to use a newer prism gem.
I believe #21618 primarily discusses released Ruby versions. My concern is specifically about the behavior on the master branch. When new syntax is introduced to the Ruby master branch, the built-in `prism.c` is updated immediately. In this scenario, if we attempt to retrieve `#ast` using the node definitions from a released prism gem, I am concerned that we will not get a correct AST due to the node definition mismatch. kddnewton (Kevin Newton) wrote in #note-4:
For the parser switching problem, I think I would like to introduce a Prism ABI version (alongside the Prism gem version). I would update this version whenever a structural change is made (field added/renamed/removed/etc.). Then, if we could store the Prism ABI version on the ISEQ as well, we could require prism and check if the ABI version matches before attempting to re-parse. We could be clear through the error message that the Prism ABI version is a mismatch and therefore we cannot re-parse.
While this is certainly a feasible solution, I don't feel it is the optimal one. I acknowledge the engineering challenges involved, but ideally, I believe having a built-in node definition (like `Ruby::Node`) within Ruby core itself would be the simplest and best approach. ---------------------------------------- Feature #21795: Methods for retrieving ASTs https://bugs.ruby-lang.org/issues/21795#change-116429 * Author: kddnewton (Kevin Newton) * Status: Open ---------------------------------------- I would like to propose a handful of methods for retrieving ASTs from various objects that correspond to locations in code. This includes: * Proc#ast * Method#ast * UnboundMethod#ast * Thread::Backtrace::Location#ast * TracePoint#ast (on call/return events) The purpose of this is to make tooling easier to write and maintain. Specifically, this would be able to be used in irb, power_assert, error_highlight, and various other tools both in core and not that make use of source code. There have been many previous discussions of retrieving node_id, source_location, source, etc. All of these use cases are covered by returning the AST for some entity. In this case node_id becomes an implementation detail, invisible to the user. Source location can be derived from the information on the AST itself. Similarly, source can be derived from the AST. Internally, I do not think we have to store any more information than we already do (since we have node_id for the first four of these, it becomes rather trivial). For TracePoint we can have a larger discussion about it, but I think it should not be too much work. In terms of implementation, the only caveat I would put is that if the ISEQ were compiled through the old parser/compiler, this should return `nil`, as the node ids do not match up and we do not want to further propagate the RubyVM::AST API. The reason I am opening up this ticket with 5 different methods requested in it is to get approval first for the direction, then I can open individual tickets or just PRs for each method. I believe this feature would ease the maintenance burden of many core libraries, and unify otherwise disparate efforts to achieve the same thing. -- https://bugs.ruby-lang.org/
Issue #21795 has been updated by kddnewton (Kevin Newton). Would a Ruby::Node be the same thing as a Prism::Node? As in, would it basically be a Ruby API that duplicates the Prism interface? I'm not sure about how to maintain it. For example, if we add more features to Prism's Ruby API (for example the work we've been doing on the translation layers to Ripper recently) would we also duplicate it to the various live branches of the Ruby::Node API? Or would it just be a trimmed down version? Either way, I'm not sure when I would recommend using Ruby::Node, because it seems like it would always be an out-of-date version of Prism::Node. ---------------------------------------- Feature #21795: Methods for retrieving ASTs https://bugs.ruby-lang.org/issues/21795#change-116443 * Author: kddnewton (Kevin Newton) * Status: Open ---------------------------------------- I would like to propose a handful of methods for retrieving ASTs from various objects that correspond to locations in code. This includes: * Proc#ast * Method#ast * UnboundMethod#ast * Thread::Backtrace::Location#ast * TracePoint#ast (on call/return events) The purpose of this is to make tooling easier to write and maintain. Specifically, this would be able to be used in irb, power_assert, error_highlight, and various other tools both in core and not that make use of source code. There have been many previous discussions of retrieving node_id, source_location, source, etc. All of these use cases are covered by returning the AST for some entity. In this case node_id becomes an implementation detail, invisible to the user. Source location can be derived from the information on the AST itself. Similarly, source can be derived from the AST. Internally, I do not think we have to store any more information than we already do (since we have node_id for the first four of these, it becomes rather trivial). For TracePoint we can have a larger discussion about it, but I think it should not be too much work. In terms of implementation, the only caveat I would put is that if the ISEQ were compiled through the old parser/compiler, this should return `nil`, as the node ids do not match up and we do not want to further propagate the RubyVM::AST API. The reason I am opening up this ticket with 5 different methods requested in it is to get approval first for the direction, then I can open individual tickets or just PRs for each method. I believe this feature would ease the maintenance burden of many core libraries, and unify otherwise disparate efforts to achieve the same thing. -- https://bugs.ruby-lang.org/
participants (3)
-
Eregon (Benoit Daloze) -
kddnewton (Kevin Newton) -
mame (Yusuke Endoh)