[ruby-core:125117] [Ruby Feature#21963] A solution to completely avoid allocated-but-uninitialized objects
Issue #21963 has been reported by Eregon (Benoit Daloze). ---------------------------------------- Feature #21963: A solution to completely avoid allocated-but-uninitialized objects https://bugs.ruby-lang.org/issues/21963 * Author: Eregon (Benoit Daloze) * Status: Open ---------------------------------------- A common issue when defining a class is to handle allocated-but-uninitialized objects. For example: ```ruby obj = MyClass.allocate obj.some_method ``` This can easily segfault for classes defined in C and raise an unclear exception for classes defined in Ruby. As a workaround many core (and non-core) classes add a check that they are initialized in *every* instance method. This is suboptimal for performance and correctness, classes should not need to care about allocated-but-uninitialized objects. Fundamentally, to solve this we need to guarantee that after the allocation function is used that either `initialize`, `initialize_dup` or `initialize_clone` is called. And we can't guarantee that for `Class#allocate`. The current workarounds are: * `undef allocate`, but this does not prevent `Class.instance_method(:allocate).bind_call(Foo)`. * `rb_undef_alloc_func()` but this breaks `dup`, `clone` and `Marshal`. The idea is to have in addition of the `public alloc function` (in `rb_classext_struct.as.class.allocator`) an `internal alloc function`. Then: * `Class#new`, `dup`, `clone` and `Marshal` always use the internal alloc function, because they guarantee to call `initialize`, `initialize_dup` or `initialize_clone`. * `rb_define_alloc_func()` sets both fields. * `rb_undef_alloc_func()` sets both fields. * `rb_get_alloc_func()` reads the public alloc function (unchanged) * `Class#allocate` uses the public alloc function (unchanged) We add a new method on `Class`, for example `Class#safe_initialization`, which: * Sets the public alloc function to `UNDEF_ALLOC_FUNC`, same as `rb_undef_alloc_func()`, so `Class#allocate` and `rb_get_alloc_func()` will raise if they are used (as they are unsafe). * Preserves the internal alloc function so `Class#new`, `dup`, `clone` and `Marshal` keep working. After that the class has fully safe intialization and does not need to worry about allocated-but-uninitialized objects anymore. From https://bugs.ruby-lang.org/issues/21852#note-7 -- https://bugs.ruby-lang.org/
Issue #21963 has been updated by Eregon (Benoit Daloze). PR implementing that idea and applying it for MatchData and Regexp, removing many checks which are no longer necessary: https://github.com/ruby/ruby/pull/16528 Instead of using 2 fields it's using the existing `allocator` field + a boolean flag to tell if the allocator is public (default) or internal (set by `rb_class_safe_initialization()`). ---------------------------------------- Feature #21963: A solution to completely avoid allocated-but-uninitialized objects https://bugs.ruby-lang.org/issues/21963#change-116841 * Author: Eregon (Benoit Daloze) * Status: Open ---------------------------------------- A common issue when defining a class is to handle allocated-but-uninitialized objects. For example: ```ruby obj = MyClass.allocate obj.some_method ``` This can easily segfault for classes defined in C and raise an unclear exception for classes defined in Ruby. As a workaround many core (and non-core) classes add a check that they are initialized in *every* instance method. This is suboptimal for performance and correctness, classes should not need to care about allocated-but-uninitialized objects. Fundamentally, to solve this we need to guarantee that after the allocation function is used that either `initialize`, `initialize_dup` or `initialize_clone` is called. And we can't guarantee that for `Class#allocate`. The current workarounds are: * `undef allocate`, but this does not prevent `Class.instance_method(:allocate).bind_call(Foo)`. * `rb_undef_alloc_func()` but this breaks `dup`, `clone` and `Marshal`. The idea is to have in addition of the `public alloc function` (in `rb_classext_struct.as.class.allocator`) an `internal alloc function`. Then: * `Class#new`, `dup`, `clone` and `Marshal` always use the internal alloc function, because they guarantee to call `initialize`, `initialize_dup` or `initialize_clone`. * `rb_define_alloc_func()` sets both fields. * `rb_undef_alloc_func()` sets both fields. * `rb_get_alloc_func()` reads the public alloc function (unchanged) * `Class#allocate` uses the public alloc function (unchanged) We add a new method on `Class`, for example `Class#safe_initialization`, which: * Sets the public alloc function to `UNDEF_ALLOC_FUNC`, same as `rb_undef_alloc_func()`, so `Class#allocate` and `rb_get_alloc_func()` will raise if they are used (as they are unsafe). * Preserves the internal alloc function so `Class#new`, `dup`, `clone` and `Marshal` keep working. After that the class has fully safe intialization and does not need to worry about allocated-but-uninitialized objects anymore. From https://bugs.ruby-lang.org/issues/21852#note-7 -- https://bugs.ruby-lang.org/
Issue #21963 has been updated by jhawthorn (John Hawthorn). Eregon (Benoit Daloze) wrote:
`Class#new`, `dup`, `clone` and `Marshal` always use the internal alloc function, because they guarantee to call `initialize`, `initialize_dup` or `initialize_clone`.
Users have control over `initialize`, `initialize_dup` or `initialize_clone`. What's to stop them from replacing those methods with a no-op? On your branch: ```
RUBY_DESCRIPTION => "ruby 4.1.0dev (2026-03-24T15:12:19Z internal_alloc_fun.. b3a027d207) +PRISM [x86_64-linux]" match = "a".match(/./) => #<MatchData "a"> match.clone => #<MatchData "a"> def match.initialize_copy(x); end => :initialize_copy match.clone => #<MatchData:0x00007fd8a78022c0> # <- uninitialized match data
I thought about introducing a flag like this in #21267, but I just don't see a way that it guarantees the inability to create one of these uninitialized objects (rather than just making it slightly more difficult).
----------------------------------------
Feature #21963: A solution to completely avoid allocated-but-uninitialized objects
https://bugs.ruby-lang.org/issues/21963#change-116847
* Author: Eregon (Benoit Daloze)
* Status: Open
----------------------------------------
A common issue when defining a class is to handle allocated-but-uninitialized objects.
For example:
```ruby
obj = MyClass.allocate
obj.some_method
This can easily segfault for classes defined in C and raise an unclear exception for classes defined in Ruby. As a workaround many core (and non-core) classes add a check that they are initialized in *every* instance method. This is suboptimal for performance and correctness, classes should not need to care about allocated-but-uninitialized objects. Fundamentally, to solve this we need to guarantee that after the allocation function is used that either `initialize`, `initialize_dup` or `initialize_clone` is called. And we can't guarantee that for `Class#allocate`. The current workarounds are: * `undef allocate`, but this does not prevent `Class.instance_method(:allocate).bind_call(Foo)`. * `rb_undef_alloc_func()` but this breaks `dup`, `clone` and `Marshal`. The idea is to have in addition of the `public alloc function` (in `rb_classext_struct.as.class.allocator`) an `internal alloc function`. Then: * `Class#new`, `dup`, `clone` and `Marshal` always use the internal alloc function, because they guarantee to call `initialize`, `initialize_dup` or `initialize_clone`. * `rb_define_alloc_func()` sets both fields. * `rb_undef_alloc_func()` sets both fields. * `rb_get_alloc_func()` reads the public alloc function (unchanged) * `Class#allocate` uses the public alloc function (unchanged) We add a new method on `Class`, for example `Class#safe_initialization`, which: * Sets the public alloc function to `UNDEF_ALLOC_FUNC`, same as `rb_undef_alloc_func()`, so `Class#allocate` and `rb_get_alloc_func()` will raise if they are used (as they are unsafe). * Preserves the internal alloc function so `Class#new`, `dup`, `clone` and `Marshal` keep working. After that the class has fully safe intialization and does not need to worry about allocated-but-uninitialized objects anymore. From https://bugs.ruby-lang.org/issues/21852#note-7 -- https://bugs.ruby-lang.org/
Issue #21963 has been updated by Eregon (Benoit Daloze). Regarding the name in C API it could be `rb_class_safe_initialization()` to match `Class#safe_initialization` or maybe more intuitive `rb_define_internal_alloc_func()` or so. The disadvantage of the latter is that wouldn't be a good name for a Ruby method, and this functionality is useful for classes defined in Ruby too. ---------------------------------------- Feature #21963: A solution to completely avoid allocated-but-uninitialized objects https://bugs.ruby-lang.org/issues/21963#change-116851 * Author: Eregon (Benoit Daloze) * Status: Open ---------------------------------------- A common issue when defining a class is to handle allocated-but-uninitialized objects. For example: ```ruby obj = MyClass.allocate obj.some_method ``` This can easily segfault for classes defined in C and raise an unclear exception for classes defined in Ruby. As a workaround many core (and non-core) classes add a check that they are initialized in *every* instance method. This is suboptimal for performance and correctness, classes should not need to care about allocated-but-uninitialized objects. Fundamentally, to solve this we need to guarantee that after the allocation function is used that either `initialize`, `initialize_dup` or `initialize_clone` is called. And we can't guarantee that for `Class#allocate`. The current workarounds are: * `undef allocate`, but this does not prevent `Class.instance_method(:allocate).bind_call(Foo)`. * `rb_undef_alloc_func()` but this breaks `dup`, `clone` and `Marshal`. The idea is to have in addition of the `public alloc function` (in `rb_classext_struct.as.class.allocator`) an `internal alloc function`. Then: * `Class#new`, `dup`, `clone` and `Marshal` always use the internal alloc function, because they guarantee to call `initialize`, `initialize_dup` or `initialize_clone`. * `rb_define_alloc_func()` sets both fields. * `rb_undef_alloc_func()` sets both fields. * `rb_get_alloc_func()` reads the public alloc function (unchanged) * `Class#allocate` uses the public alloc function (unchanged) We add a new method on `Class`, for example `Class#safe_initialization`, which: * Sets the public alloc function to `UNDEF_ALLOC_FUNC`, same as `rb_undef_alloc_func()`, so `Class#allocate` and `rb_get_alloc_func()` will raise if they are used (as they are unsafe). * Preserves the internal alloc function so `Class#new`, `dup`, `clone` and `Marshal` keep working. After that the class has fully safe intialization and does not need to worry about allocated-but-uninitialized objects anymore. From https://bugs.ruby-lang.org/issues/21852#note-7 -- https://bugs.ruby-lang.org/
Issue #21963 has been updated by Eregon (Benoit Daloze). @jhawthorn That's a good point, thank you. I reread https://bugs.ruby-lang.org/issues/21267 and back then I also wanted to have a way for safe initialization but didn't look yet at how to achieve it. First I think this proposal still has value because it ensures that `initialize/initialize_dup/initialize_clone` are called after allocation, and that's wasn't the case before (because the user could just call `Class#allocate` and never follow with `initialize*`). Indeed, `initialize/initialize_clone/initialize_dup` can still be overwritten to produce a logically-broken object, that is already the case today. Overwriting these methods is effectively breaking the object and it is a bad case of monkey-patching, so I think any exception or different behavior is fair enough there (the user is breaking the object, we cannot prevent that override but they cannot expect things to work after they broke it), however it must not segfault in that case (I suppose we all agree on that, though I would be tempted to say it's the user's fault but I don't think that will fly). Currently my PR removes the checks so it could segfault. So one way to make progress without introducing segfaults would be to keep those checks. I think that's valuable enough on its own, though not fully satisfying as it keeps these easy-to-forget checks in every instance method. I'd like to avoid those checks, to do that without risking segfaults I think we then need to improve the reliability of initialization and copying for classes defined in C (classes defined in Ruby should not be able to cause a segfault anyway, so that part is not a concern). What if one could provide a `initialization` and `copy` functions/hooks for `TypedData` / `rb_data_type_t`? Then `.new`/`.dup`/`.clone` would call these hooks before `initialize/initialize_clone/initialize_dup`, so we have the guarantee they are always run before handing the object to the user. So we'd have something like: ```c static const rb_data_type_t my_data_type = { ..., .init = my_initialize, // VALUE (*)(int argc, VALUE *argv, VALUE self) .copy = my_init_copy // VALUE (*)(VALUE copy, VALUE original) } ``` The function signatures would match the signatures typically used for `initialize` and `initialize_copy` so it would be easier to share logic with older Ruby versions not having those hooks. One extra complication here is MatchData is not a `TypedData` but a raw `struct RMatch`. Concretely we could redefine `dup` and `clone` on MatchData to achieve the same and call `match_init_copy` before `initialize_dup/initialize_clone` (by reusing `rb_obj_dup_setup`/`rb_obj_clone_setup`). We'd also `rb_undef_alloc_func()` for `MatchData` to make sure `Kernel#dup`/`Kernel#clone` is not used to bypass the initialization logic in the overwritten `dup`/`clone`. `MatchData` doesn't have `initialize` or `new` so we don't need to worry about that one, but if it had we could override `new` to call `match_initialize` before the `initialize` method (e.g. with `rb_obj_call_init_kw`). What do you think? Another idea would be to prevent redefining these crucial hooks (`initialize/initialize_clone/initialize_dup/initialize_copy`) for classes using `Class#safe_initialization`. Preventing override of these methods entirely would be too limitating for subclasses which override the hooks correctly. So instead we could ensure that any override would `super` into the original hook, that would be safe and it could be checked by looking at the AST/bytecode/IR of the overriding method. It might be somewhat complicated if a module is later included and defines e.g. `initialize_copy` but it should be possible to check that it calls `super` too when including in a `safe_initialization` class (directly or indirectly). Preventing monkey-patching in Ruby is unusual, but maybe it would make sense here? Such monkey-patches or overrides which don't call `super` seems inherently broken so maybe we'd only forbid broken defintions which is then a good thing? ---------------------------------------- Feature #21963: A solution to completely avoid allocated-but-uninitialized objects https://bugs.ruby-lang.org/issues/21963#change-116857 * Author: Eregon (Benoit Daloze) * Status: Open ---------------------------------------- A common issue when defining a class is to handle allocated-but-uninitialized objects. For example: ```ruby obj = MyClass.allocate obj.some_method ``` This can easily segfault for classes defined in C and raise an unclear exception for classes defined in Ruby. As a workaround many core (and non-core) classes add a check that they are initialized in *every* instance method. This is suboptimal for performance and correctness, classes should not need to care about allocated-but-uninitialized objects. Fundamentally, to solve this we need to guarantee that after the allocation function is used that either `initialize`, `initialize_dup` or `initialize_clone` is called. And we can't guarantee that for `Class#allocate`. The current workarounds are: * `undef allocate`, but this does not prevent `Class.instance_method(:allocate).bind_call(Foo)`. * `rb_undef_alloc_func()` but this breaks `dup`, `clone` and `Marshal`. The idea is to have in addition of the `public alloc function` (in `rb_classext_struct.as.class.allocator`) an `internal alloc function`. Then: * `Class#new`, `dup`, `clone` and `Marshal` always use the internal alloc function, because they guarantee to call `initialize`, `initialize_dup` or `initialize_clone`. * `rb_define_alloc_func()` sets both fields. * `rb_undef_alloc_func()` sets both fields. * `rb_get_alloc_func()` reads the public alloc function (unchanged) * `Class#allocate` uses the public alloc function (unchanged) We add a new method on `Class`, for example `Class#safe_initialization`, which: * Sets the public alloc function to `UNDEF_ALLOC_FUNC`, same as `rb_undef_alloc_func()`, so `Class#allocate` and `rb_get_alloc_func()` will raise if they are used (as they are unsafe). * Preserves the internal alloc function so `Class#new`, `dup`, `clone` and `Marshal` keep working. After that the class has fully safe intialization and does not need to worry about allocated-but-uninitialized objects anymore. From https://bugs.ruby-lang.org/issues/21852#note-7 -- https://bugs.ruby-lang.org/
Issue #21963 has been updated by Eregon (Benoit Daloze). If these `init` & `copy` C function hooks would be on RClass instead of `rb_data_type_t` they could be called from the (confusingly-named) function `init_copy` which is used by `rb_obj_dup_setup/rb_obj_clone_setup` and so by `dup/clone` before `initialize_dup/initialize_clone`. And then we could just use these new function hooks for MatchData and other core types which are not `TypedData`. `init_copy` already does copying of the ivars, flags and GC attributes so it seems a good fit for "minimal initialization to make the object not segfault" for classes defined in C. That would be quite elegant I think. The main problem there is `RClass` is currently using all of its 160 bytes slot size, and bumping it to twice that doesn't seem great. ---------------------------------------- Feature #21963: A solution to completely avoid allocated-but-uninitialized objects https://bugs.ruby-lang.org/issues/21963#change-116858 * Author: Eregon (Benoit Daloze) * Status: Open ---------------------------------------- A common issue when defining a class is to handle allocated-but-uninitialized objects. For example: ```ruby obj = MyClass.allocate obj.some_method ``` This can easily segfault for classes defined in C and raise an unclear exception for classes defined in Ruby. As a workaround many core (and non-core) classes add a check that they are initialized in *every* instance method. This is suboptimal for performance and correctness, classes should not need to care about allocated-but-uninitialized objects. Fundamentally, to solve this we need to guarantee that after the allocation function is used that either `initialize`, `initialize_dup` or `initialize_clone` is called. And we can't guarantee that for `Class#allocate`. The current workarounds are: * `undef allocate`, but this does not prevent `Class.instance_method(:allocate).bind_call(Foo)`. * `rb_undef_alloc_func()` but this breaks `dup`, `clone` and `Marshal`. The idea is to have in addition of the `public alloc function` (in `rb_classext_struct.as.class.allocator`) an `internal alloc function`. Then: * `Class#new`, `dup`, `clone` and `Marshal` always use the internal alloc function, because they guarantee to call `initialize`, `initialize_dup` or `initialize_clone`. * `rb_define_alloc_func()` sets both fields. * `rb_undef_alloc_func()` sets both fields. * `rb_get_alloc_func()` reads the public alloc function (unchanged) * `Class#allocate` uses the public alloc function (unchanged) We add a new method on `Class`, for example `Class#safe_initialization`, which: * Sets the public alloc function to `UNDEF_ALLOC_FUNC`, same as `rb_undef_alloc_func()`, so `Class#allocate` and `rb_get_alloc_func()` will raise if they are used (as they are unsafe). * Preserves the internal alloc function so `Class#new`, `dup`, `clone` and `Marshal` keep working. After that the class has fully safe intialization and does not need to worry about allocated-but-uninitialized objects anymore. From https://bugs.ruby-lang.org/issues/21852#note-7 -- https://bugs.ruby-lang.org/
Issue #21963 has been updated by Eregon (Benoit Daloze). I realized these `init` & `copy` C function hooks could actually be done partly with the proposal in #21852, cc @byroot. Specifically, the `rb_copy_alloc_func_t` gets the original object, so that's equivalent to `copy` and it would be called almost at the right time. And that function can then correctly initialize the C parts of the object so it's valid (at least can't cause segfaults) after it returns. The one difference in timing is for `clone` it would be called before the singleton class is copied & set (in case the original object has a singleton class), doesn't seem much of an issue. The missing part is the `rb_copy_alloc_func_t` when called from `Class#new` doesn't receive the arguments and so it is hard to properly initialize the C structs without the arguments. So maybe the new allocator function should be like: ```c typedef VALUE (*rb_copy_alloc_func_t)(VALUE klass, VALUE original, int initialize_argc, const VALUE *initialize_argv); ``` or so, and either `original` (when called from `dup`/`clone`) or `initialize_argc + initialize_argv` would be set (when called from `Class#new`). ---------------------------------------- Feature #21963: A solution to completely avoid allocated-but-uninitialized objects https://bugs.ruby-lang.org/issues/21963#change-116874 * Author: Eregon (Benoit Daloze) * Status: Open ---------------------------------------- A common issue when defining a class is to handle allocated-but-uninitialized objects. For example: ```ruby obj = MyClass.allocate obj.some_method ``` This can easily segfault for classes defined in C and raise an unclear exception for classes defined in Ruby. As a workaround many core (and non-core) classes add a check that they are initialized in *every* instance method. This is suboptimal for performance and correctness, classes should not need to care about allocated-but-uninitialized objects. Fundamentally, to solve this we need to guarantee that after the allocation function is used that either `initialize`, `initialize_dup` or `initialize_clone` is called. And we can't guarantee that for `Class#allocate`. The current workarounds are: * `undef allocate`, but this does not prevent `Class.instance_method(:allocate).bind_call(Foo)`. * `rb_undef_alloc_func()` but this breaks `dup`, `clone` and `Marshal`. The idea is to have in addition of the `public alloc function` (in `rb_classext_struct.as.class.allocator`) an `internal alloc function`. Then: * `Class#new`, `dup`, `clone` and `Marshal` always use the internal alloc function, because they guarantee to call `initialize`, `initialize_dup` or `initialize_clone`. * `rb_define_alloc_func()` sets both fields. * `rb_undef_alloc_func()` sets both fields. * `rb_get_alloc_func()` reads the public alloc function (unchanged) * `Class#allocate` uses the public alloc function (unchanged) We add a new method on `Class`, for example `Class#safe_initialization`, which: * Sets the public alloc function to `UNDEF_ALLOC_FUNC`, same as `rb_undef_alloc_func()`, so `Class#allocate` and `rb_get_alloc_func()` will raise if they are used (as they are unsafe). * Preserves the internal alloc function so `Class#new`, `dup`, `clone` and `Marshal` keep working. After that the class has fully safe intialization and does not need to worry about allocated-but-uninitialized objects anymore. From https://bugs.ruby-lang.org/issues/21852#note-7 -- https://bugs.ruby-lang.org/
Issue #21963 has been updated by Eregon (Benoit Daloze). Eregon (Benoit Daloze) wrote in #note-8:
So maybe the new allocator function should be like:
```c typedef VALUE (*rb_copy_alloc_func_t)(VALUE klass, VALUE original, int initialize_argc, const VALUE *initialize_argv); ```
I'm thinking a more explicit name would be good, as `rb_copy_alloc_func_t` might sound like it's only for copying. So how about `rb_initializing_alloc_func_t`? ```c typedef VALUE (*rb_initializing_alloc_func_t)(VALUE klass, VALUE original, int initialize_argc, const VALUE *initialize_argv); ``` That clearly indicates such an alloc func also does some initialization (for the native parts), and the docs can make it clear `initialize` is still called, this is just extra native initialization to avoid dangerously-uninitialized objects. If that's deemed too long, we could go `rb_safe_alloc_func_t` but that's much less explicit. Because it's a type used once per class I think it's fine to have a longer name. ---------------------------------------- Feature #21963: A solution to completely avoid allocated-but-uninitialized objects https://bugs.ruby-lang.org/issues/21963#change-117019 * Author: Eregon (Benoit Daloze) * Status: Open ---------------------------------------- A common issue when defining a class is to handle allocated-but-uninitialized objects. For example: ```ruby obj = MyClass.allocate obj.some_method ``` This can easily segfault for classes defined in C and raise an unclear exception for classes defined in Ruby. As a workaround many core (and non-core) classes add a check that they are initialized in *every* instance method. This is suboptimal for performance and correctness, classes should not need to care about allocated-but-uninitialized objects. Fundamentally, to solve this we need to guarantee that after the allocation function is used that either `initialize`, `initialize_dup` or `initialize_clone` is called. And we can't guarantee that for `Class#allocate`. The current workarounds are: * `undef allocate`, but this does not prevent `Class.instance_method(:allocate).bind_call(Foo)`. * `rb_undef_alloc_func()` but this breaks `dup`, `clone` and `Marshal`. The idea is to have in addition of the `public alloc function` (in `rb_classext_struct.as.class.allocator`) an `internal alloc function`. Then: * `Class#new`, `dup`, `clone` and `Marshal` always use the internal alloc function, because they guarantee to call `initialize`, `initialize_dup` or `initialize_clone`. * `rb_define_alloc_func()` sets both fields. * `rb_undef_alloc_func()` sets both fields. * `rb_get_alloc_func()` reads the public alloc function (unchanged) * `Class#allocate` uses the public alloc function (unchanged) We add a new method on `Class`, for example `Class#safe_initialization`, which: * Sets the public alloc function to `UNDEF_ALLOC_FUNC`, same as `rb_undef_alloc_func()`, so `Class#allocate` and `rb_get_alloc_func()` will raise if they are used (as they are unsafe). * Preserves the internal alloc function so `Class#new`, `dup`, `clone` and `Marshal` keep working. After that the class has fully safe intialization and does not need to worry about allocated-but-uninitialized objects anymore. From https://bugs.ruby-lang.org/issues/21852#note-7 -- https://bugs.ruby-lang.org/
participants (2)
-
Eregon (Benoit Daloze) -
jhawthorn (John Hawthorn)