
Issue #20878 has been updated by mdalessio (Mike Dalessio).
Is there a real-world use case to make a String with a pointer allocated outside of xmalloc?
I don't personally have one.. Also perhaps @mdalessio (Mike Dalessio) would have some in nokogiri?
Yes, it would be easiest for Nokogiri if non-xmalloc string pointers were supported, but if it was decided to not support this, I could work around it. Nokogiri actively configures libxml2's memory management functions. On windows, libxml2 is configure to use `malloc` because of bugs in some versions of libxml2. [^1] On other platforms, Nokogiri configures libxml2 to use `ruby_xmalloc` by default, but users can opt into using `malloc`, for example if they want to optimize performance and don't mind having a larger max heap size. [^2] But! If anyone is opting into using `malloc`, it is likely for performance reasons. If the performance improvement from pointer adoption is great enough, and `malloc` strings are not supported, then I would consider removing the feature. On windows, the libxml2 bugs have been fixed for three years (fixed 2022-02 in v2.9.13 [^3]) and most windows developers are using the precompiled native gem anyway, so if I have to, I would be comfortable changing the default to be `ruby_xmalloc` on windows or working around the limitation in pointer adoption. [^1]: https://github.com/sparklemotion/nokogiri/issues/2241o [^2]: https://github.com/sparklemotion/nokogiri/blob/main/adr/2023-04-libxml-memor... [^3]: https://gitlab.gnome.org/GNOME/libxml2/-/commit/a7b9f3eb ---------------------------------------- Feature #20878: A new C API to create a String by adopting a pointer: `rb_enc_str_adopt(const char *ptr, long len, long capa, rb_encoding *enc)` https://bugs.ruby-lang.org/issues/20878#change-111327 * Author: byroot (Jean Boussier) * Status: Open ---------------------------------------- ### Context A common use case when writing C extensions is to generate text or bytes into a buffer, and to return it back wrapped into a Ruby String. Examples are `JSON.generate(obj) -> String`, and all other format serializers, compression libraries such as `ZLib.deflate`, etc, but also methods such as `Time.strftime`, ### Current Solution #### Work in a buffer and copy the result The most often used solution is to work with a native buffer and to manage a native allocated buffer, and once the generation is done, call `rb_str_new*` to copy the result inside memory managed by Ruby. It works, but isn't very efficient because it cause an extra copy and an extra `free()`. On `ruby/json` macro-benchmarks, this represent around 5% of the time spent in `JSON.generate`. ```c static void fbuffer_free(FBuffer *fb) { if (fb->ptr && fb->type == FBUFFER_HEAP_ALLOCATED) { ruby_xfree(fb->ptr); } } static VALUE fbuffer_to_s(FBuffer *fb) { VALUE result = rb_utf8_str_new(FBUFFER_PTR(fb), FBUFFER_LEN(fb)); fbuffer_free(fb); return result; } ``` #### Work inside RString allocated memory Another way this is currently done, is to allocate an `RString` using `rb_str_buf_new`, and write into it with various functions such as `rb_str_catf`, or writing past `RString.len` through `RSTRING_PTR` and then resize it with `rb_str_set_len`. The downside with this approach is that it contains a lot of inefficiencies, as `rb_str_set_len` will perform numerous safety checks, compute coderange, and write the string terminator on every invocation. Another major inneficiency is that this API make it hard to be in control of the buffer growth, so it can result in a lot more `realloc()` calls than manually managing the buffer. This method is used by `Kernel#sprintf`, `Time#strftime` etc, and when I attempted to improve `Time#strftime` performance, this problem showed up as the biggest bottleneck: - https://github.com/ruby/ruby/pull/11547 - https://github.com/ruby/ruby/pull/11544 - https://github.com/ruby/ruby/pull/11542 ### Proposed API I think a more effcient way to do this would be to work with a native buffer, and then build a RString that "adopt" the memory region. Technically, you can currently do this by reaching directly into `RString` members, but I don't think it's clean, and a dedicated API would be preferable: ```c /** * Similar to rb_str_new(), but it adopts the pointer instead of copying. * * @param[in] ptr A memory region of `capa` bytes length. MUST have been allocated with `ruby_xmalloc` * @param[in] len Length of the string, in bytes, not including the * terminating NUL character, not including extra capacity. * @param[in] capa The usable length of `ptr`, in bytes, including the * terminating NUL character. * @param[in] enc Encoding of `ptr`. * @exception rb_eArgError `len` is negative. * @return An instance of ::rb_cString, of `len` bytes length, `capa - 1` bytes capacity, * and of `enc` encoding. * @pre At least `capa` bytes of continuous memory region shall be * accessible via `ptr`. * @pre `ptr` MUST have been allocated with `ruby_xmalloc`. * @pre `ptr` MUST not be manually freed after `rb_enc_str_adopt` has been called. * @note `enc` can be a null pointer. It can also be seen as a routine * identical to rb_usascii_str_new() then. */ rb_enc_str_adopt(const char *ptr, long len, long capa, rb_encoding *enc); ``` An alternative to the `adopt` term, could be `move`. ---Files-------------------------------- Capture d’écran 2024-12-11 à 11.03.08.png (250 KB) -- https://bugs.ruby-lang.org/