Issue #21950 has been updated by Eregon (Benoit Daloze). osyoyu (Daisuke Aritomo) wrote in #note-8:
I don't want the API/output format to restrict what a profiler could emit. I wouldn't say such design is not doable, but careful consideration would be needed.
Careful consideration is exactly what should be done when designing a new API. Using `RubyVM` as a way to try to skip some considerations is bad and has led to many problems that take years to solve, and it's actively preventing portability of such APIs for no valid reason. It's not so hard to define an API/output format that can be evolved/make it possible to extend with extra things. It doesn't need to restrict (there could be some generic extra metadata or so), but it should standardize common data we know about (like samples & blocking information).
I'm OK with the idea of returning Ruby objects in general, but I don't think an Array of Array of `Thread::Backtrace::Location` itself is practical. If a 50-level deep backtrace for 8 threads gets collected 100 times a second, profiling for 5 minutes would produce 120M objects, which is a pretty hard push on memory. Some sort of dedup must be done.
For clarity, there would be an internal data structure for this that does not allocate Ruby objects during profiling and can be more footprint-efficient. But good point, it seems too big even just as a result returned by the profiler after the profiling is done. (FWIW I get `8*100*5*60*50` = 12 million objects, not 120)
pprof/profile.proto is a format proven to be flexible and efficient enough for serializing practical Go application profiles in many modes, including CPU-time, heap, allocations, goroutines, mutex contention. It is also accepted by a number of existing polished visualizers (Perfetto, Grafana Pyroscope, Google Cloud Profiler to name a few). Designing a Hash or Data structure based on profile.proto may be a good idea.
I'll look into that and also read the document @ivoanjo linked to. ---------------------------------------- Feature #21950: Add a built-in CPU-time profiler https://bugs.ruby-lang.org/issues/21950#change-116685 * Author: osyoyu (Daisuke Aritomo) * Status: Open ---------------------------------------- Modern CRuby workloads can consume CPU concurrently across multiple native threads, especially with multiple Ractors and C exts which release the GVL. I'd like to propose the idea of integrating a built-in CPU-time profiler CRuby to enable more accurate and stable profiling in such situations. ## Motivation & Background CPU profilers indicate how much CPU time were consumed by different methods. Most CPU-time profilers rely on the kernel to track consumed CPU time. `setitimer(3)` and `timer_create(3)` are APIs to configure kernel timers. The process receives a profiling signal (SIGPROF) every given time of CPU time consumed (e.g. 10 ms). In general, a profiler needs to know which _thread_ consumed how much CPU time to attribute work to the correct thread. This wasn't a real requirement for Ruby in the pre-Ractor age, since only one native thread could consume CPU time at any given moment due to GVL limitations. Using process-wide CPU-time timers provided by `setitimer()` effectively did the job. It was safe to assume that the active thread was doing some work using the CPU when a profiling signal arrived. Of course, this assumption does not stand in all situations. One is the case where C extensions release the GVL. Another is the multi-Ractor situation. In these cases, multiple native threads may simultaneously consume CPU time. Linux 2.6.12+ provides a per-thread timer to address this task. Profilers such as [Pf2](https://github.com/osyoyu/pf2) and [dd-trace-rb](https://github.com/DataDog/dd-trace-rb) use this feature to keep track of CPU time. However, utilizing this API requires information that CRuby does not necessarily expose. Both carry copies of CRuby headers in order to access internal structs such as `rb_thread_t`. This is quite fragile and possibly unsustainable in the age where CRuby evolves towards Ractors and M:N threading. ## Proposal Implement a built-in CPU-time profiler as `ext/profile`, and add some information extraction points to the VM built exclusively for it. ```ruby require 'profile' RubyVM::Profile.start # your code here RubyVM::Profile.stop #=> results ``` `ext/profile` will take care of the following in coordination with some VM helpers: - Tracking creation and destruction of native thread (NT) s - Management of kernel timers for those threads - i.e. calling `pthread_getcpuclockid(3)` and `timer_create(3)` - This will require iterating over all Ractors and shared NTs on profiler init - Handling of SIGPROF signals which those timers will generate - Walking the stack (calling `rb_profile_frames()`) - I'm not going to make this part of this ticket, but I'm thinking we can make `rb_profile_frames()` even more granular by following VM instructions, which is probably something we don't want to expose as an API We would need to add some APIs to the VM for `ext/profile`: - An API returning all alive NTs - Event hooks notifying creation and destruction of NTs - Event hooks notifying assign/unassign of RTs to NTs Since only Linux provides the full set of required kernel features, the initial implementation will be limited to Linux systems. I can expand support to other POSIX systems by employing `setitimer()` later, but they will receive limited granularity (process-level CPU-time timers). ### Output interface One thing to consider is the output format. I think we have a few practical choices here: - Adopt pprof's `profile.proto` format. - This is a widely accepted format across tools including visualizers. - The format itself is pretty flexible, so it shouldn't be hard to add custom fields. - Just return a Hash containing profile results. - We'd need to design some good format. ### Things out of scope - Visualization - Interaction with external visualizers / trackers / etc These can be better left to normal RubyGems. ### Why not an external gem? Through maintaining [Pf2](https://github.com/osyoyu/pf2), an CPU-time profiler library, I have encountered many difficulties obtaining information required for accurate profiling. The rule of thumb is that more internal information a profiler can access, the more accuracy it can achieve. However, from the CRuby maintenance perspective, I suppose not too many APIs exposing implementation details are wanted. Locating a profiler under `ext/` looks like nice middle ground. Placing core 'profiling' logic (sampler scheduling, sampling itself) in CRuby and abstracting it as 'RubyVM::Profile' should work cleanly. It should be noted that existing profilers have their own unique features, such as markers, unwinding of C stacks and integration with external APMs. I don't want to make this a tradeoff between accuracy and feature; instead, I'd like to design an API where both could live. ### Study on other languages A handful of VM-based languages carry profiler implementation in their runtime. - Go: runtime/pprof https://github.com/golang/go/tree/master/src/runtime/pprof - OpenJDK: JFR https://github.com/openjdk/jdk/tree/master/src/hotspot/share/jfr - Node.js: --cpu-prof https://nodejs.org/en/learn/getting-started/profiling - And more Among these, OpenJDK is a notable entry. JVM profilers have configured used `AsyncGetCallTrace()`, which is just like `rb_profile_frames()`, to obtain stack traces from signal handlers. The signal originates from kernel timers installed by the profilers, configured to fire every specified interval of CPU time (e.g. 10 ms). Even though `AsyncGetCallTrace()` and async-profiler (its primary user) are very sophisticated and battle-tested, JFR folks have decided to control sampling timing within the runtime to improve accuracy and stability. For more information on JVM, see: - [JEP 509](https://openjdk.org/jeps/509) - [JEP 518](https://openjdk.org/jeps/518) - [Taming the Bias: Unbiased* Safepoint-Based Stack Walking in JFR](https://mostlynerdless.de/blog/2025/05/20/taming-the-bias-unbiased-safepoint...) -- https://bugs.ruby-lang.org/