[ruby-core:124979] [Ruby Feature#21950] Add a built-in CPU-time profiler

11 Mar 2026

      Issue #21950 has been updated by Eregon (Benoit Daloze).

I think this would be good, and I agree this would avoid several gems reimplementing this in brittle and not portable (e.g. only works on CRuby) ways.

See https://github.com/truffleruby/truffleruby/issues/2044#issuecomment-65484832... for discussion of how to integrate StackProf and TruffleRuby, such an API would help there.

The API should not be under `RubyVM`, that's CRuby-specific, cannot be supported on other Rubies since gems do `if defined?(RubyVM)` and assume other things like `RubyVM::InstructionSequence` exist, and generally a bad idea to put anything there.
It could be `Ruby::Profiler`, that seems the most obvious fit. Could be `Ruby::SamplingProfiler` to be extra clear.

I think a high-level API like `start/stop` + being able to specify the sampling frequency (which would still be limited to a maximum frequency to avoid too much overhead) would be good.
It's also similar to the StackProf API, which is widely used.

And then a way to retrieve the results.
Could be Ruby objects representing backtraces, maybe simply an Array of Array of Thread::Backtrace::Location, and the Strings just represent the method/block name?
As a note `rb_profile_frames()` API very much returns things similar to Thread::Backtrace::Location: https://docs.ruby-lang.org/capi/en/master/db/d16/debug_8h.html

Some already-serialized format like pprof is also possible, but then it requires every usage of this new API to be able to parse that, so it seems less convenient.
It could be more efficient/compact though due to less allocations or more efficient representation for the same method.

headius (Charles Nutter) wrote in #note-1:
...
Any CPU timing in Ruby must also consider JIT improvements over time, and be able to decode things like inlined method calls so that the reported execution time is associated with the correct body of code. Many of the CPU profiling tools for JVM either disable aggressive optimizations (giving a poor view of optimized code CPU time) or indirectly break those optimizations (by triggering more safepoints and more stack traces that change influence inlining heuristics.
Also worth pointing out the ongoing efforts by the JIT team to eliminate artificial stack frames for leaf methods and some core method calls. As more artificial Ruby frames get elided, it will become harder to reconstruct the stack in a profiler. You can of course force those frames to be emitted, but then we're back to poisoning the profile.
This proposal is about Ruby-level profiling, not Java/C-level profiling, conceptually it's like collecting data equivalent to `caller_locations` efficiently every X milliseconds.
`caller_locations` must be correct with or without JIT, so I think this is no concern, this needs to be figured out by the JITs regardless of this API.

headius (Charles Nutter) wrote in #note-1:
...
JVM-based Ruby implementations generally won't support this
This is not a concern at least for TruffleRuby, because this is Ruby-level profiling, not Java-level profiling.
In fact Truffle [CPUSampler](https://github.com/oracle/graal/blob/master/tools/src/com.oracle.truffle.too...) already provides such an API, it's just not exposed as a Ruby API yet.

----------------------------------------
Feature #21950: Add a built-in CPU-time profiler
https://bugs.ruby-lang.org/issues/21950#change-116669

* Author: osyoyu (Daisuke Aritomo)
* Status: Open
----------------------------------------
Modern CRuby workloads can consume CPU concurrently across multiple native threads, especially with multiple Ractors and C exts which release the GVL. I'd like to propose the idea of integrating a built-in CPU-time profiler CRuby to enable more accurate and stable profiling in such situations.

## Motivation & Background

CPU profilers indicate how much CPU time were consumed by different methods. Most CPU-time profilers rely on the kernel to track consumed CPU time. `setitimer(3)` and `timer_create(3)` are APIs to configure kernel timers. The process receives a profiling signal (SIGPROF) every given time of CPU time consumed (e.g. 10 ms).

In general, a profiler needs to know which _thread_ consumed how much CPU time to attribute work to the correct thread. This wasn't a real requirement for Ruby in the pre-Ractor age, since only one native thread could consume CPU time at any given moment due to GVL limitations. Using process-wide CPU-time timers provided by `setitimer()` effectively did the job. It was safe to assume that the active thread was doing some work using the CPU when a profiling signal arrived.

Of course, this assumption does not stand in all situations. One is the case where C extensions release the GVL. Another is the multi-Ractor situation. In these cases, multiple native threads may simultaneously consume CPU time.

Linux 2.6.12+ provides a per-thread timer to address this task. Profilers such as [Pf2](https://github.com/osyoyu/pf2) and [dd-trace-rb](https://github.com/DataDog/dd-trace-rb) use this feature to keep track of CPU time. However, utilizing this API requires information that CRuby does not necessarily expose. Both carry copies of CRuby headers in order to access internal structs such as `rb_thread_t`. This is quite fragile and possibly unsustainable in the age where CRuby evolves towards Ractors and M:N threading.

## Proposal

Implement a built-in CPU-time profiler as `ext/profile`, and add some information extraction points to the VM built exclusively for it.

```ruby
require 'profile'
RubyVM::Profile.start
# your code here
RubyVM::Profile.stop #=> results
```

`ext/profile` will take care of the following in coordination with some VM helpers:

- Tracking creation and destruction of native thread (NT) s
- Management of kernel timers for those threads
	- i.e. calling `pthread_getcpuclockid(3)` and `timer_create(3)`
	- This will require iterating over all Ractors and shared NTs on profiler init
- Handling of SIGPROF signals which those timers will generate
- Walking the stack (calling `rb_profile_frames()`)
	- I'm not going to make this part of this ticket, but I'm thinking we can make `rb_profile_frames()` even more granular by following VM instructions, which is probably something we don't want to expose as an API

We would need to add some APIs to the VM for `ext/profile`:

- An API returning all alive NTs
- Event hooks notifying creation and destruction of NTs
- Event hooks notifying assign/unassign of RTs to NTs

Since only Linux provides the full set of required kernel features, the initial implementation will be limited to Linux systems. I can expand support to other POSIX systems by employing `setitimer()` later, but they will receive limited granularity (process-level CPU-time timers).

### Output interface

One thing to consider is the output format. I think we have a few practical choices here:

- Adopt pprof's `profile.proto` format.
	- This is a widely accepted format across tools including visualizers.
	- The format itself is pretty flexible, so it shouldn't be hard to add custom fields.
- Just return a Hash containing profile results.
	- We'd need to design some good format.
### Things out of scope

- Visualization
- Interaction with external visualizers / trackers / etc

These can be better left to normal RubyGems.

### Why not an external gem?

Through maintaining [Pf2](https://github.com/osyoyu/pf2), an CPU-time profiler library, I have encountered many difficulties obtaining information required for accurate profiling. The rule of thumb is that more internal information a profiler can access, the more accuracy it can achieve. However, from the CRuby maintenance perspective, I suppose not too many APIs exposing implementation details are wanted.

Locating a profiler under `ext/` looks like nice middle ground. Placing core 'profiling' logic (sampler scheduling, sampling itself) in CRuby and abstracting it as 'RubyVM::Profile' should work cleanly.

It should be noted that existing profilers have their own unique features, such as markers, unwinding of C stacks and integration with external APMs. I don't want to make this a tradeoff between accuracy and feature; instead, I'd like to design an API where both could live.

### Study on other languages

A handful of VM-based languages carry profiler implementation in their runtime.

- Go: runtime/pprof https://github.com/golang/go/tree/master/src/runtime/pprof
- OpenJDK: JFR https://github.com/openjdk/jdk/tree/master/src/hotspot/share/jfr
- Node.js: --cpu-prof https://nodejs.org/en/learn/getting-started/profiling
- And more

Among these, OpenJDK is a notable entry. JVM profilers have configured  used `AsyncGetCallTrace()`, which is just like `rb_profile_frames()`, to obtain stack traces from signal handlers. The signal originates from kernel timers installed by the profilers, configured to fire every specified interval of CPU time (e.g. 10 ms).

Even though `AsyncGetCallTrace()` and async-profiler (its primary user) are very sophisticated and battle-tested, JFR folks have decided to control sampling timing within the runtime to improve accuracy and stability.

For more information on JVM, see:
- [JEP 509](https://openjdk.org/jeps/509)
- [JEP 518](https://openjdk.org/jeps/518)
- [Taming the Bias: Unbiased* Safepoint-Based Stack Walking in JFR](https://mostlynerdless.de/blog/2025/05/20/taming-the-bias-unbiased-safepoint...)

-- 
https://bugs.ruby-lang.org/