[ruby-core:115550] [Ruby master Bug#20031] Regexp using greedy quantifier and unions on a big string uses a lot of memory

Issue #20031 has been reported by FabienChaynes (Fabien Chaynes). ---------------------------------------- Bug #20031: Regexp using greedy quantifier and unions on a big string uses a lot of memory https://bugs.ruby-lang.org/issues/20031 * Author: FabienChaynes (Fabien Chaynes) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Trying to match on some regexp using a greedy quantifier on any character (`.*`) and unions (`foo|bar`) uses a lot of memory if the string we want to match on is big. # Reproduction code ```ruby puts RUBY_DESCRIPTION unless File.exist?("/tmp/fake_email.eml") content = "Accept-Language: fr-FR, en-US#{"0" * 50_000_000}" File.write("/tmp/fake_email.eml", content) end content = File.read("/tmp/fake_email.eml") def print_size(context, sike_kb) puts "#{context}: #{(sike_kb / 1024.0).round(1)} MB" end print_size("baseline", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) print_size("content", content.bytesize / 1024) REGEXP_UNION = Regexp.union( "Accept-Language:", "Thread-Topic:", ).freeze reg = /.*#{REGEXP_UNION}/ p reg GC.start print_size("before match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) allocated = GC.stat(:total_allocated_objects) p content.match(reg) p GC.stat(:total_allocated_objects) - allocated GC.start print_size("after match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) p "---" ``` Output: ```bash $ /usr/bin/time -v ruby full_repro.rb ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] baseline: 71.3 MB content: 47.7 MB /.*(?-mix:Accept\-Language:|Thread\-Topic:)/ before match?: 71.3 MB #<MatchData "Accept-Language:"> 14 after match?: 71.6 MB "---" Command being timed: "ruby full_repro.rb" User time (seconds): 1.15 System time (seconds): 0.74 Percent of CPU this job got: 100% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.89 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2423108 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 609575 Voluntary context switches: 52 Involuntary context switches: 12 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` strace: ```bash $ sudo strace -p 588834 strace: Process 588834 attached [...] mmap(NULL, 249856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92dd3000 mremap(0x7f8e92dd3000, 249856, 495616, MREMAP_MAYMOVE) = 0x7f8e92d5a000 mremap(0x7f8e92d5a000, 495616, 987136, MREMAP_MAYMOVE) = 0x7f8e92c69000 mremap(0x7f8e92c69000, 987136, 1970176, MREMAP_MAYMOVE) = 0x7f8e92a88000 mremap(0x7f8e92a88000, 1970176, 3936256, MREMAP_MAYMOVE) = 0x7f8e926c7000 mremap(0x7f8e926c7000, 3936256, 7868416, MREMAP_MAYMOVE) = 0x7f8e91f46000 mremap(0x7f8e91f46000, 7868416, 15732736, MREMAP_MAYMOVE) = 0x7f8e91045000 mremap(0x7f8e91045000, 15732736, 31461376, MREMAP_MAYMOVE) = 0x7f8e8a8af000 mremap(0x7f8e8a8af000, 31461376, 62918656, MREMAP_MAYMOVE) = 0x7f8e86cae000 mremap(0x7f8e86cae000, 62918656, 125833216, MREMAP_MAYMOVE) = 0x7f8e7f4ad000 mremap(0x7f8e7f4ad000, 125833216, 251662336, MREMAP_MAYMOVE) = 0x7f8e704ac000 mremap(0x7f8e704ac000, 251662336, 503320576, MREMAP_MAYMOVE) = 0x7f8e524ab000 mremap(0x7f8e524ab000, 503320576, 1006637056, MREMAP_MAYMOVE) = 0x7f8e164aa000 mremap(0x7f8e164aa000, 1006637056, 2013270016, MREMAP_MAYMOVE) = 0x7f8d9e4a9000 mremap(0x7f8d9e4a9000, 2013270016, 4026535936, MREMAP_MAYMOVE) = 0x7f8cae4a8000 mmap(NULL, 12500992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92224000 munmap(0x7f8cae4a8000, 4026535936) = 0 munmap(0x7f8e92224000, 12500992) = 0 writev(1, [{iov_base="#<MatchData \"Accept-Language:\">", iov_len=31}, {iov_base="\n", iov_len=1}], 2) = 32 [...] ``` We can see that the memory retained by Ruby is low (71.6 MB), but `content.match(reg)` used 2_423_108 kbytes (~2.4 GB) in the regex engine. # Findings To reproduce we need: - a greedy quantifier on any character (`.*`) - a union with at least two members (`foo|bar`). More members won't increase the memory consumption further - to perform the match on a long string The memory seems to increase linearly against the string size: |String size |Memory consumed | |--|--| | 11.9 MB | 629_988 kbytes | | 23.8 MB | 1_235_976 kbytes | | 47.7 MB | 2_423_076 kbytes | I was able to reproduce it on old Ruby versions as well. @byroot helped me to put all this together and according to his investigation the memory allocation would be coming from `regexec.c:4100` (https://github.com/ruby/ruby/blob/85092ecd6f5c4d12d0cb1d6dfa7040337a4f558b/r...): ``` size_t match_cache_buf_length = (num_match_cache_points >> 3) + (num_match_cache_points & 7 ? 1 : 0) + 1; uint8_t* match_cache_buf = (uint8_t*)xmalloc(match_cache_buf_length * sizeof(uint8_t)); ``` It seems surprising to use that much memory compared to the string size. Is it expected? -- https://bugs.ruby-lang.org/

Issue #20031 has been updated by mame (Yusuke Endoh). I think this behavior is expected. The regex engine stretches the stack by the number of charaters in the input string because of backtracking. The size of the stack is visible as memory usage. Simply put, `/.*(foo|bar)/` is inefficient because it looks (by recursive calls) for the *last* occurrence of `/foo|bar/` in the string. If the *first* occurrence is acceptable, then `/.*?(foo|bar)/` would be much efficient. Note that the code at regexec.c:4100 is unrelated to this problem, or at least not the dominant factor. This is code for ReDoS protection, introduced in Ruby 3.2, which allocates only 1/8 of the string size at most. The memory consumption in question is reproduced in Ruby 3.1, where this code does not exist. ---------------------------------------- Bug #20031: Regexp using greedy quantifier and unions on a big string uses a lot of memory https://bugs.ruby-lang.org/issues/20031#change-105491 * Author: FabienChaynes (Fabien Chaynes) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Trying to match on some regexp using a greedy quantifier on any character (`.*`) and unions (`foo|bar`) uses a lot of memory if the string we want to match on is big. # Reproduction code ```ruby puts RUBY_DESCRIPTION unless File.exist?("/tmp/fake_email.eml") content = "Accept-Language: fr-FR, en-US#{"0" * 50_000_000}" File.write("/tmp/fake_email.eml", content) end content = File.read("/tmp/fake_email.eml") def print_size(context, sike_kb) puts "#{context}: #{(sike_kb / 1024.0).round(1)} MB" end print_size("baseline", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) print_size("content", content.bytesize / 1024) REGEXP_UNION = Regexp.union( "Accept-Language:", "Thread-Topic:", ).freeze reg = /.*#{REGEXP_UNION}/ p reg GC.start print_size("before match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) allocated = GC.stat(:total_allocated_objects) p content.match(reg) p GC.stat(:total_allocated_objects) - allocated GC.start print_size("after match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) p "---" ``` Output: ```bash $ /usr/bin/time -v ruby full_repro.rb ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] baseline: 71.3 MB content: 47.7 MB /.*(?-mix:Accept\-Language:|Thread\-Topic:)/ before match?: 71.3 MB #<MatchData "Accept-Language:"> 14 after match?: 71.6 MB "---" Command being timed: "ruby full_repro.rb" User time (seconds): 1.15 System time (seconds): 0.74 Percent of CPU this job got: 100% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.89 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2423108 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 609575 Voluntary context switches: 52 Involuntary context switches: 12 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` strace: ```bash $ sudo strace -p 588834 strace: Process 588834 attached [...] mmap(NULL, 249856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92dd3000 mremap(0x7f8e92dd3000, 249856, 495616, MREMAP_MAYMOVE) = 0x7f8e92d5a000 mremap(0x7f8e92d5a000, 495616, 987136, MREMAP_MAYMOVE) = 0x7f8e92c69000 mremap(0x7f8e92c69000, 987136, 1970176, MREMAP_MAYMOVE) = 0x7f8e92a88000 mremap(0x7f8e92a88000, 1970176, 3936256, MREMAP_MAYMOVE) = 0x7f8e926c7000 mremap(0x7f8e926c7000, 3936256, 7868416, MREMAP_MAYMOVE) = 0x7f8e91f46000 mremap(0x7f8e91f46000, 7868416, 15732736, MREMAP_MAYMOVE) = 0x7f8e91045000 mremap(0x7f8e91045000, 15732736, 31461376, MREMAP_MAYMOVE) = 0x7f8e8a8af000 mremap(0x7f8e8a8af000, 31461376, 62918656, MREMAP_MAYMOVE) = 0x7f8e86cae000 mremap(0x7f8e86cae000, 62918656, 125833216, MREMAP_MAYMOVE) = 0x7f8e7f4ad000 mremap(0x7f8e7f4ad000, 125833216, 251662336, MREMAP_MAYMOVE) = 0x7f8e704ac000 mremap(0x7f8e704ac000, 251662336, 503320576, MREMAP_MAYMOVE) = 0x7f8e524ab000 mremap(0x7f8e524ab000, 503320576, 1006637056, MREMAP_MAYMOVE) = 0x7f8e164aa000 mremap(0x7f8e164aa000, 1006637056, 2013270016, MREMAP_MAYMOVE) = 0x7f8d9e4a9000 mremap(0x7f8d9e4a9000, 2013270016, 4026535936, MREMAP_MAYMOVE) = 0x7f8cae4a8000 mmap(NULL, 12500992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92224000 munmap(0x7f8cae4a8000, 4026535936) = 0 munmap(0x7f8e92224000, 12500992) = 0 writev(1, [{iov_base="#<MatchData \"Accept-Language:\">", iov_len=31}, {iov_base="\n", iov_len=1}], 2) = 32 [...] ``` We can see that the memory retained by Ruby is low (71.6 MB), but `content.match(reg)` used 2_423_108 kbytes (~2.4 GB) in the regex engine. # Findings To reproduce we need: - a greedy quantifier on any character (`.*`) - a union with at least two members (`foo|bar`). More members won't increase the memory consumption further - to perform the match on a long string The memory seems to increase linearly against the string size: |String size |Memory consumed | |--|--| | 11.9 MB | 629_988 kbytes | | 23.8 MB | 1_235_976 kbytes | | 47.7 MB | 2_423_076 kbytes | I was able to reproduce it on old Ruby versions as well. @byroot helped me to put all this together and according to his investigation the memory allocation would be coming from `regexec.c:4100` (https://github.com/ruby/ruby/blob/85092ecd6f5c4d12d0cb1d6dfa7040337a4f558b/r...): ``` size_t match_cache_buf_length = (num_match_cache_points >> 3) + (num_match_cache_points & 7 ? 1 : 0) + 1; uint8_t* match_cache_buf = (uint8_t*)xmalloc(match_cache_buf_length * sizeof(uint8_t)); ``` It seems surprising to use that much memory compared to the string size. Is it expected? -- https://bugs.ruby-lang.org/

Issue #20031 has been updated by Eregon (Benoit Daloze). I played around a bit and both time and memory seem linear in the size of the input string on CRuby. However the memory usage is indeed huge, ~2GB for 50MB string, ~4GB for 100MB string, ~8GB for 200MB string, etc. I tried it on TruffleRuby which uses TRegex (a DFA regexp engine) for curiosity, and that uses 235MB for 50MB string, 334MB for 100MB string, 530MB for 200MB string. (with `/usr/bin/time -v ruby --experimental-options --engine.Compilation=false file.rb` to exclude the JIT memory to get more stable numbers). ---------------------------------------- Bug #20031: Regexp using greedy quantifier and unions on a big string uses a lot of memory https://bugs.ruby-lang.org/issues/20031#change-105497 * Author: FabienChaynes (Fabien Chaynes) * Status: Open * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Trying to match on some regexp using a greedy quantifier on any character (`.*`) and unions (`foo|bar`) uses a lot of memory if the string we want to match on is big. # Reproduction code ```ruby puts RUBY_DESCRIPTION unless File.exist?("/tmp/fake_email.eml") content = "Accept-Language: fr-FR, en-US#{"0" * 50_000_000}" File.write("/tmp/fake_email.eml", content) end content = File.read("/tmp/fake_email.eml") def print_size(context, sike_kb) puts "#{context}: #{(sike_kb / 1024.0).round(1)} MB" end print_size("baseline", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) print_size("content", content.bytesize / 1024) REGEXP_UNION = Regexp.union( "Accept-Language:", "Thread-Topic:", ).freeze reg = /.*#{REGEXP_UNION}/ p reg GC.start print_size("before match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) allocated = GC.stat(:total_allocated_objects) p content.match(reg) p GC.stat(:total_allocated_objects) - allocated GC.start print_size("after match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) p "---" ``` Output: ```bash $ /usr/bin/time -v ruby full_repro.rb ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] baseline: 71.3 MB content: 47.7 MB /.*(?-mix:Accept\-Language:|Thread\-Topic:)/ before match?: 71.3 MB #<MatchData "Accept-Language:"> 14 after match?: 71.6 MB "---" Command being timed: "ruby full_repro.rb" User time (seconds): 1.15 System time (seconds): 0.74 Percent of CPU this job got: 100% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.89 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2423108 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 609575 Voluntary context switches: 52 Involuntary context switches: 12 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` strace: ```bash $ sudo strace -p 588834 strace: Process 588834 attached [...] mmap(NULL, 249856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92dd3000 mremap(0x7f8e92dd3000, 249856, 495616, MREMAP_MAYMOVE) = 0x7f8e92d5a000 mremap(0x7f8e92d5a000, 495616, 987136, MREMAP_MAYMOVE) = 0x7f8e92c69000 mremap(0x7f8e92c69000, 987136, 1970176, MREMAP_MAYMOVE) = 0x7f8e92a88000 mremap(0x7f8e92a88000, 1970176, 3936256, MREMAP_MAYMOVE) = 0x7f8e926c7000 mremap(0x7f8e926c7000, 3936256, 7868416, MREMAP_MAYMOVE) = 0x7f8e91f46000 mremap(0x7f8e91f46000, 7868416, 15732736, MREMAP_MAYMOVE) = 0x7f8e91045000 mremap(0x7f8e91045000, 15732736, 31461376, MREMAP_MAYMOVE) = 0x7f8e8a8af000 mremap(0x7f8e8a8af000, 31461376, 62918656, MREMAP_MAYMOVE) = 0x7f8e86cae000 mremap(0x7f8e86cae000, 62918656, 125833216, MREMAP_MAYMOVE) = 0x7f8e7f4ad000 mremap(0x7f8e7f4ad000, 125833216, 251662336, MREMAP_MAYMOVE) = 0x7f8e704ac000 mremap(0x7f8e704ac000, 251662336, 503320576, MREMAP_MAYMOVE) = 0x7f8e524ab000 mremap(0x7f8e524ab000, 503320576, 1006637056, MREMAP_MAYMOVE) = 0x7f8e164aa000 mremap(0x7f8e164aa000, 1006637056, 2013270016, MREMAP_MAYMOVE) = 0x7f8d9e4a9000 mremap(0x7f8d9e4a9000, 2013270016, 4026535936, MREMAP_MAYMOVE) = 0x7f8cae4a8000 mmap(NULL, 12500992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92224000 munmap(0x7f8cae4a8000, 4026535936) = 0 munmap(0x7f8e92224000, 12500992) = 0 writev(1, [{iov_base="#<MatchData \"Accept-Language:\">", iov_len=31}, {iov_base="\n", iov_len=1}], 2) = 32 [...] ``` We can see that the memory retained by Ruby is low (71.6 MB), but `content.match(reg)` used 2_423_108 kbytes (~2.4 GB) in the regex engine. # Findings To reproduce we need: - a greedy quantifier on any character (`.*`) - a union with at least two members (`foo|bar`). More members won't increase the memory consumption further - to perform the match on a long string The memory seems to increase linearly against the string size: |String size |Memory consumed | |--|--| | 11.9 MB | 629_988 kbytes | | 23.8 MB | 1_235_976 kbytes | | 47.7 MB | 2_423_076 kbytes | I was able to reproduce it on old Ruby versions as well. @byroot helped me to put all this together and according to his investigation the memory allocation would be coming from `regexec.c:4100` (https://github.com/ruby/ruby/blob/85092ecd6f5c4d12d0cb1d6dfa7040337a4f558b/r...): ``` size_t match_cache_buf_length = (num_match_cache_points >> 3) + (num_match_cache_points & 7 ? 1 : 0) + 1; uint8_t* match_cache_buf = (uint8_t*)xmalloc(match_cache_buf_length * sizeof(uint8_t)); ``` It seems surprising to use that much memory compared to the string size. Is it expected? -- https://bugs.ruby-lang.org/

Issue #20031 has been updated by mame (Yusuke Endoh). Tracker changed from Bug to Feature Status changed from Open to Feedback ruby -v deleted (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) Backport deleted (3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN) This is not a bug. If anyone creates a patch to improve this, we may consider it. We could reduce the stack size consumed, or add a `/.*(foo|bar)/` pattern-specific optimization, if this pattern is frequent. ---------------------------------------- Feature #20031: Regexp using greedy quantifier and unions on a big string uses a lot of memory https://bugs.ruby-lang.org/issues/20031#change-105781 * Author: FabienChaynes (Fabien Chaynes) * Status: Feedback * Priority: Normal ---------------------------------------- Trying to match on some regexp using a greedy quantifier on any character (`.*`) and unions (`foo|bar`) uses a lot of memory if the string we want to match on is big. # Reproduction code ```ruby puts RUBY_DESCRIPTION unless File.exist?("/tmp/fake_email.eml") content = "Accept-Language: fr-FR, en-US#{"0" * 50_000_000}" File.write("/tmp/fake_email.eml", content) end content = File.read("/tmp/fake_email.eml") def print_size(context, sike_kb) puts "#{context}: #{(sike_kb / 1024.0).round(1)} MB" end print_size("baseline", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) print_size("content", content.bytesize / 1024) REGEXP_UNION = Regexp.union( "Accept-Language:", "Thread-Topic:", ).freeze reg = /.*#{REGEXP_UNION}/ p reg GC.start print_size("before match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) allocated = GC.stat(:total_allocated_objects) p content.match(reg) p GC.stat(:total_allocated_objects) - allocated GC.start print_size("after match?", `ps ax -o pid,rss | grep -E "^[[:space:]]*#{$$}"`.strip.split.map(&:to_i).last) p "---" ``` Output: ```bash $ /usr/bin/time -v ruby full_repro.rb ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] baseline: 71.3 MB content: 47.7 MB /.*(?-mix:Accept\-Language:|Thread\-Topic:)/ before match?: 71.3 MB #<MatchData "Accept-Language:"> 14 after match?: 71.6 MB "---" Command being timed: "ruby full_repro.rb" User time (seconds): 1.15 System time (seconds): 0.74 Percent of CPU this job got: 100% Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.89 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 2423108 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 609575 Voluntary context switches: 52 Involuntary context switches: 12 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 ``` strace: ```bash $ sudo strace -p 588834 strace: Process 588834 attached [...] mmap(NULL, 249856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92dd3000 mremap(0x7f8e92dd3000, 249856, 495616, MREMAP_MAYMOVE) = 0x7f8e92d5a000 mremap(0x7f8e92d5a000, 495616, 987136, MREMAP_MAYMOVE) = 0x7f8e92c69000 mremap(0x7f8e92c69000, 987136, 1970176, MREMAP_MAYMOVE) = 0x7f8e92a88000 mremap(0x7f8e92a88000, 1970176, 3936256, MREMAP_MAYMOVE) = 0x7f8e926c7000 mremap(0x7f8e926c7000, 3936256, 7868416, MREMAP_MAYMOVE) = 0x7f8e91f46000 mremap(0x7f8e91f46000, 7868416, 15732736, MREMAP_MAYMOVE) = 0x7f8e91045000 mremap(0x7f8e91045000, 15732736, 31461376, MREMAP_MAYMOVE) = 0x7f8e8a8af000 mremap(0x7f8e8a8af000, 31461376, 62918656, MREMAP_MAYMOVE) = 0x7f8e86cae000 mremap(0x7f8e86cae000, 62918656, 125833216, MREMAP_MAYMOVE) = 0x7f8e7f4ad000 mremap(0x7f8e7f4ad000, 125833216, 251662336, MREMAP_MAYMOVE) = 0x7f8e704ac000 mremap(0x7f8e704ac000, 251662336, 503320576, MREMAP_MAYMOVE) = 0x7f8e524ab000 mremap(0x7f8e524ab000, 503320576, 1006637056, MREMAP_MAYMOVE) = 0x7f8e164aa000 mremap(0x7f8e164aa000, 1006637056, 2013270016, MREMAP_MAYMOVE) = 0x7f8d9e4a9000 mremap(0x7f8d9e4a9000, 2013270016, 4026535936, MREMAP_MAYMOVE) = 0x7f8cae4a8000 mmap(NULL, 12500992, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f8e92224000 munmap(0x7f8cae4a8000, 4026535936) = 0 munmap(0x7f8e92224000, 12500992) = 0 writev(1, [{iov_base="#<MatchData \"Accept-Language:\">", iov_len=31}, {iov_base="\n", iov_len=1}], 2) = 32 [...] ``` We can see that the memory retained by Ruby is low (71.6 MB), but `content.match(reg)` used 2_423_108 kbytes (~2.4 GB) in the regex engine. # Findings To reproduce we need: - a greedy quantifier on any character (`.*`) - a union with at least two members (`foo|bar`). More members won't increase the memory consumption further - to perform the match on a long string The memory seems to increase linearly against the string size: |String size |Memory consumed | |--|--| | 11.9 MB | 629_988 kbytes | | 23.8 MB | 1_235_976 kbytes | | 47.7 MB | 2_423_076 kbytes | I was able to reproduce it on old Ruby versions as well. @byroot helped me to put all this together and according to his investigation the memory allocation would be coming from `regexec.c:4100` (https://github.com/ruby/ruby/blob/85092ecd6f5c4d12d0cb1d6dfa7040337a4f558b/r...): ``` size_t match_cache_buf_length = (num_match_cache_points >> 3) + (num_match_cache_points & 7 ? 1 : 0) + 1; uint8_t* match_cache_buf = (uint8_t*)xmalloc(match_cache_buf_length * sizeof(uint8_t)); ``` It seems surprising to use that much memory compared to the string size. Is it expected? -- https://bugs.ruby-lang.org/
participants (3)
-
Eregon (Benoit Daloze)
-
FabienChaynes (Fabien Chaynes)
-
mame (Yusuke Endoh)