[ruby-core:116941] [Ruby master Bug#20301] `Set#add?` does two hash look-ups

Issue #20301 has been reported by AMomchilov (Alexander Momchilov). ---------------------------------------- Bug #20301: `Set#add?` does two hash look-ups https://bugs.ruby-lang.org/issues/20301 * Author: AMomchilov (Alexander Momchilov) * Status: Open * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: ```ruby SEEN_VALUES = Set.new def receive_value(value) if SEEN_VALUES.add?(value) puts "Saw #{value} for the first time." else puts "Already seen #{value}, ignoring." end end receive_value(1) # Saw 1 for the first time. receive_value(2) # Saw 2 for the first time. receive_value(3) # Saw 3 for the first time. receive_value(1) # Already seen 1, ignoring. ``` Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) ```rb class Set def add?(o # 1. `include?(o)` looks up into `@hash` # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` add(o) unless include?(o) end end ``` This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#update_value` to do exactly that. If that existed, we can re-implement `#add?` as: ```rb class Set def add?(o) # Only requires a single look-up into `@hash`! self unless @hash.update_value(o, true) end ``` Here's a PR: https://github.com/ruby/ruby/pull/10093 How much of a benefit this has depends on two things: 1. How much `#hash` is called, which depends on how many new objects are added to the set. * If every object is new, then `#hash` is called twice on every `#add?`. This is where this improvement makes the biggest (2x!) change. * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement * Every other case lies somewhere in between those two. 2. How slow `#hash` is to compute for the key * If the hash is slow to compute, this change will make a bigger improvement * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. Here is a summary of the benchmark results: | | All objects are new | All objects are preexisting | |---------------------------|-------:|------:| | objects with slow `#hash` | 100.0% | ~0.0% | | objects with fast `#hash` | 24.5% | 4.6% | -- https://bugs.ruby-lang.org/

Issue #20301 has been updated by Dan0042 (Daniel DeLorme). Now I understand why you proposed #20300 Hash#update_value However I'd like to suggest an alternative approach for your consideration: ```ruby def add?(k) added = nil @hash.add(k){ added = true } #call block only if k not in @hash; return existing or added value self if added end ``` This is likely to be a bit less efficient than your approach, however `Hash#add` is a method I've been missing from ruby for a long time, and would find infinitely more useful than `Hash#update_value` ---------------------------------------- Bug #20301: `Set#add?` does two hash look-ups https://bugs.ruby-lang.org/issues/20301#change-106990 * Author: AMomchilov (Alexander Momchilov) * Status: Open * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: ```ruby SEEN_VALUES = Set.new def receive_value(value) if SEEN_VALUES.add?(value) puts "Saw #{value} for the first time." else puts "Already seen #{value}, ignoring." end end receive_value(1) # Saw 1 for the first time. receive_value(2) # Saw 2 for the first time. receive_value(3) # Saw 3 for the first time. receive_value(1) # Already seen 1, ignoring. ``` Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) ```rb class Set def add?(o # 1. `include?(o)` looks up into `@hash` # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` add(o) unless include?(o) end end ``` This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#update_value` to do exactly that. If that existed, we can re-implement `#add?` as: ```rb class Set def add?(o) # Only requires a single look-up into `@hash`! self unless @hash.update_value(o, true) end ``` Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093 # Theory How much of a benefit this has depends on 2 factors: 1. How much `#hash` is called, which depends on how many **new** objects are added to the set. * If every object is new, then `#hash` used to be called twice on every `#add?`. * This is where this improvement makes the biggest (2x!) change. * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. * Every other case lies somewhere in between those two, depending on the % of objects which are new. 2. How slow `#hash` is to compute for the key * If the hash is slow to compute, this change will make a bigger improvement * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. # Benchmark summary | | All objects are new | All objects are preexisting | |---------------------------|-------:|------:| | objects with slow `#hash` | 100.0% | ~0.0% | | objects with fast `#hash` | 24.5% | 4.6% | As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093 -- https://bugs.ruby-lang.org/

Issue #20301 has been updated by Eregon (Benoit Daloze). @Dan0042 That's related to #17342 then, and also known as `compute_if_absent` in [concurrent-ruby](https://ruby-concurrency.github.io/concurrent-ruby/1.1.5/Concurrent/Map.html...). ---------------------------------------- Bug #20301: `Set#add?` does two hash look-ups https://bugs.ruby-lang.org/issues/20301#change-106994 * Author: AMomchilov (Alexander Momchilov) * Status: Open * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: ```ruby SEEN_VALUES = Set.new def receive_value(value) if SEEN_VALUES.add?(value) puts "Saw #{value} for the first time." else puts "Already seen #{value}, ignoring." end end receive_value(1) # Saw 1 for the first time. receive_value(2) # Saw 2 for the first time. receive_value(3) # Saw 3 for the first time. receive_value(1) # Already seen 1, ignoring. ``` Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) ```rb class Set def add?(o # 1. `include?(o)` looks up into `@hash` # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` add(o) unless include?(o) end end ``` This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#update_value` to do exactly that. If that existed, we can re-implement `#add?` as: ```rb class Set def add?(o) # Only requires a single look-up into `@hash`! self unless @hash.update_value(o, true) end ``` Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093 # Theory How much of a benefit this has depends on 2 factors: 1. How much `#hash` is called, which depends on how many **new** objects are added to the set. * If every object is new, then `#hash` used to be called twice on every `#add?`. * This is where this improvement makes the biggest (2x!) change. * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. * Every other case lies somewhere in between those two, depending on the % of objects which are new. 2. How slow `#hash` is to compute for the key * If the hash is slow to compute, this change will make a bigger improvement * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. # Benchmark summary | | All objects are new | All objects are preexisting | |---------------------------|-------:|------:| | objects with slow `#hash` | 100.0% | ~0.0% | | objects with fast `#hash` | 24.5% | 4.6% | As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093 -- https://bugs.ruby-lang.org/

Issue #20301 has been updated by AMomchilov (Alexander Momchilov). I don't mind it @Dan0042, but that's a secondary issue IMO. The block call defeats the benefit of this optimization. It'll even slow down the case where you're looking up pre-existing objects (that's currently net-even perf after these changes), and that's a big no-no. ---------------------------------------- Bug #20301: `Set#add?` does two hash look-ups https://bugs.ruby-lang.org/issues/20301#change-107002 * Author: AMomchilov (Alexander Momchilov) * Status: Open * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: ```ruby SEEN_VALUES = Set.new def receive_value(value) if SEEN_VALUES.add?(value) puts "Saw #{value} for the first time." else puts "Already seen #{value}, ignoring." end end receive_value(1) # Saw 1 for the first time. receive_value(2) # Saw 2 for the first time. receive_value(3) # Saw 3 for the first time. receive_value(1) # Already seen 1, ignoring. ``` Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) ```rb class Set def add?(o # 1. `include?(o)` looks up into `@hash` # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` add(o) unless include?(o) end end ``` This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#update_value` to do exactly that. If that existed, we can re-implement `#add?` as: ```rb class Set def add?(o) # Only requires a single look-up into `@hash`! self unless @hash.update_value(o, true) end ``` Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093 # Theory How much of a benefit this has depends on 2 factors: 1. How much `#hash` is called, which depends on how many **new** objects are added to the set. * If every object is new, then `#hash` used to be called twice on every `#add?`. * This is where this improvement makes the biggest (2x!) change. * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. * Every other case lies somewhere in between those two, depending on the % of objects which are new. 2. How slow `#hash` is to compute for the key * If the hash is slow to compute, this change will make a bigger improvement * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. # Benchmark summary | | All objects are new | All objects are preexisting | |---------------------------|-------:|------:| | objects with slow `#hash` | 100.0% | ~0.0% | | objects with fast `#hash` | 24.5% | 4.6% | As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093 -- https://bugs.ruby-lang.org/

Issue #20301 has been updated by shyouhei (Shyouhei Urabe). Why not: ```ruby def add?(o) n = size add(o) m = size return n == m ? self : nil end ``` This implementation involves only one hash lookup. ---------------------------------------- Bug #20301: `Set#add?` does two hash look-ups https://bugs.ruby-lang.org/issues/20301#change-107217 * Author: AMomchilov (Alexander Momchilov) * Status: Open * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: ```ruby SEEN_VALUES = Set.new def receive_value(value) if SEEN_VALUES.add?(value) puts "Saw #{value} for the first time." else puts "Already seen #{value}, ignoring." end end receive_value(1) # Saw 1 for the first time. receive_value(2) # Saw 2 for the first time. receive_value(3) # Saw 3 for the first time. receive_value(1) # Already seen 1, ignoring. ``` Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) ```rb class Set def add?(o # 1. `include?(o)` looks up into `@hash` # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` add(o) unless include?(o) end end ``` This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#exchange_value` to do exactly that. If that existed, we can re-implement `#add?` as: ```rb class Set def add?(o) # Only requires a single look-up into `@hash`! self unless @hash.exchange_value(o, true) end ``` Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093 # Theory How much of a benefit this has depends on 2 factors: 1. How much `#hash` is called, which depends on how many **new** objects are added to the set. * If every object is new, then `#hash` used to be called twice on every `#add?`. * This is where this improvement makes the biggest (2x!) change. * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. * Every other case lies somewhere in between those two, depending on the % of objects which are new. 2. How slow `#hash` is to compute for the key * If the hash is slow to compute, this change will make a bigger improvement * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. # Benchmark summary | | All objects are new | All objects are preexisting | |---------------------------|-------:|------:| | objects with slow `#hash` | 100.0% | ~0.0% | | objects with fast `#hash` | 24.5% | 4.6% | As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093 -- https://bugs.ruby-lang.org/

Issue #20301 has been updated by nobu (Nobuyoshi Nakada). shyouhei (Shyouhei Urabe) wrote in #note-8:
```ruby return n == m ? self : nil ```
The return value is inverse. ---------------------------------------- Bug #20301: `Set#add?` does two hash look-ups https://bugs.ruby-lang.org/issues/20301#change-107220 * Author: AMomchilov (Alexander Momchilov) * Status: Open * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: ```ruby SEEN_VALUES = Set.new def receive_value(value) if SEEN_VALUES.add?(value) puts "Saw #{value} for the first time." else puts "Already seen #{value}, ignoring." end end receive_value(1) # Saw 1 for the first time. receive_value(2) # Saw 2 for the first time. receive_value(3) # Saw 3 for the first time. receive_value(1) # Already seen 1, ignoring. ``` Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) ```rb class Set def add?(o # 1. `include?(o)` looks up into `@hash` # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` add(o) unless include?(o) end end ``` This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#exchange_value` to do exactly that. If that existed, we can re-implement `#add?` as: ```rb class Set def add?(o) # Only requires a single look-up into `@hash`! self unless @hash.exchange_value(o, true) end ``` Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093 # Theory How much of a benefit this has depends on 2 factors: 1. How much `#hash` is called, which depends on how many **new** objects are added to the set. * If every object is new, then `#hash` used to be called twice on every `#add?`. * This is where this improvement makes the biggest (2x!) change. * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. * Every other case lies somewhere in between those two, depending on the % of objects which are new. 2. How slow `#hash` is to compute for the key * If the hash is slow to compute, this change will make a bigger improvement * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. # Benchmark summary | | All objects are new | All objects are preexisting | |---------------------------|-------:|------:| | objects with slow `#hash` | 100.0% | ~0.0% | | objects with fast `#hash` | 24.5% | 4.6% | As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093 -- https://bugs.ruby-lang.org/

Issue #20301 has been updated by shyouhei (Shyouhei Urabe). nobu (Nobuyoshi Nakada) wrote in #note-9:
shyouhei (Shyouhei Urabe) wrote in #note-8:
```ruby return n == m ? self : nil ```
The return value is inverse.
My bad. Thank you for correction. ---------------------------------------- Bug #20301: `Set#add?` does two hash look-ups https://bugs.ruby-lang.org/issues/20301#change-107221 * Author: AMomchilov (Alexander Momchilov) * Status: Open * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: ```ruby SEEN_VALUES = Set.new def receive_value(value) if SEEN_VALUES.add?(value) puts "Saw #{value} for the first time." else puts "Already seen #{value}, ignoring." end end receive_value(1) # Saw 1 for the first time. receive_value(2) # Saw 2 for the first time. receive_value(3) # Saw 3 for the first time. receive_value(1) # Already seen 1, ignoring. ``` Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) ```rb class Set def add?(o # 1. `include?(o)` looks up into `@hash` # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` add(o) unless include?(o) end end ``` This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#exchange_value` to do exactly that. If that existed, we can re-implement `#add?` as: ```rb class Set def add?(o) # Only requires a single look-up into `@hash`! self unless @hash.exchange_value(o, true) end ``` Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093 # Theory How much of a benefit this has depends on 2 factors: 1. How much `#hash` is called, which depends on how many **new** objects are added to the set. * If every object is new, then `#hash` used to be called twice on every `#add?`. * This is where this improvement makes the biggest (2x!) change. * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. * Every other case lies somewhere in between those two, depending on the % of objects which are new. 2. How slow `#hash` is to compute for the key * If the hash is slow to compute, this change will make a bigger improvement * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. # Benchmark summary | | All objects are new | All objects are preexisting | |---------------------------|-------:|------:| | objects with slow `#hash` | 100.0% | ~0.0% | | objects with fast `#hash` | 24.5% | 4.6% | As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093 -- https://bugs.ruby-lang.org/

Issue #20301 has been updated by Eregon (Benoit Daloze). That implementation using `size` is not thread-safe, even on CRuby AFAIK. For example, if T2 calls `add?` with a new element while T1 calls `add?` with an existing element. If T1 is just before `m = size` when T2 executes `add(o)`, then both threads return "element added" but T1 did not add an element (incorrect result). The original code with two lookups does not have that race condition. However it can have the race condition that two threads adding the same new element both return "element added". `Hash#exchange_value` would fix that. ---------------------------------------- Bug #20301: `Set#add?` does two hash look-ups https://bugs.ruby-lang.org/issues/20301#change-107267 * Author: AMomchilov (Alexander Momchilov) * Status: Open * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: ```ruby SEEN_VALUES = Set.new def receive_value(value) if SEEN_VALUES.add?(value) puts "Saw #{value} for the first time." else puts "Already seen #{value}, ignoring." end end receive_value(1) # Saw 1 for the first time. receive_value(2) # Saw 2 for the first time. receive_value(3) # Saw 3 for the first time. receive_value(1) # Already seen 1, ignoring. ``` Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) ```rb class Set def add?(o # 1. `include?(o)` looks up into `@hash` # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` add(o) unless include?(o) end end ``` This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#exchange_value` to do exactly that. If that existed, we can re-implement `#add?` as: ```rb class Set def add?(o) # Only requires a single look-up into `@hash`! self unless @hash.exchange_value(o, true) end ``` Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093 # Theory How much of a benefit this has depends on 2 factors: 1. How much `#hash` is called, which depends on how many **new** objects are added to the set. * If every object is new, then `#hash` used to be called twice on every `#add?`. * This is where this improvement makes the biggest (2x!) change. * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. * Every other case lies somewhere in between those two, depending on the % of objects which are new. 2. How slow `#hash` is to compute for the key * If the hash is slow to compute, this change will make a bigger improvement * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. # Benchmark summary | | All objects are new | All objects are preexisting | |---------------------------|-------:|------:| | objects with slow `#hash` | 100.0% | ~0.0% | | objects with fast `#hash` | 24.5% | 4.6% | As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093 -- https://bugs.ruby-lang.org/

Issue #20301 has been updated by shyouhei (Shyouhei Urabe). Yes. `add(o) unless include?(o)` isn't thread safe already. My implementation just doesn't care to improve that. ---------------------------------------- Bug #20301: `Set#add?` does two hash look-ups https://bugs.ruby-lang.org/issues/20301#change-107279 * Author: AMomchilov (Alexander Momchilov) * Status: Open * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: ```ruby SEEN_VALUES = Set.new def receive_value(value) if SEEN_VALUES.add?(value) puts "Saw #{value} for the first time." else puts "Already seen #{value}, ignoring." end end receive_value(1) # Saw 1 for the first time. receive_value(2) # Saw 2 for the first time. receive_value(3) # Saw 3 for the first time. receive_value(1) # Already seen 1, ignoring. ``` Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) ```rb class Set def add?(o # 1. `include?(o)` looks up into `@hash` # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` add(o) unless include?(o) end end ``` This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#exchange_value` to do exactly that. If that existed, we can re-implement `#add?` as: ```rb class Set def add?(o) # Only requires a single look-up into `@hash`! self unless @hash.exchange_value(o, true) end ``` Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093 # Theory How much of a benefit this has depends on 2 factors: 1. How much `#hash` is called, which depends on how many **new** objects are added to the set. * If every object is new, then `#hash` used to be called twice on every `#add?`. * This is where this improvement makes the biggest (2x!) change. * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. * Every other case lies somewhere in between those two, depending on the % of objects which are new. 2. How slow `#hash` is to compute for the key * If the hash is slow to compute, this change will make a bigger improvement * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. # Benchmark summary | | All objects are new | All objects are preexisting | |---------------------------|-------:|------:| | objects with slow `#hash` | 100.0% | ~0.0% | | objects with fast `#hash` | 24.5% | 4.6% | As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093 -- https://bugs.ruby-lang.org/

Issue #20301 has been updated by AMomchilov (Alexander Momchilov). shyouhei (Shyouhei Urabe) wrote in #note-8:
Why not:
Because I didn't think of that :) I would be okay with it, but I think the thread safety issue is also worth solving. The implementation I'm proposing solves both the performance and safety problems. ---------------------------------------- Bug #20301: `Set#add?` does two hash look-ups https://bugs.ruby-lang.org/issues/20301#change-107281 * Author: AMomchilov (Alexander Momchilov) * Status: Open * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN ---------------------------------------- A common usage of `Set`s is to keep track of seen objects, and do something different whenever an object is seen for the first time, e.g.: ```ruby SEEN_VALUES = Set.new def receive_value(value) if SEEN_VALUES.add?(value) puts "Saw #{value} for the first time." else puts "Already seen #{value}, ignoring." end end receive_value(1) # Saw 1 for the first time. receive_value(2) # Saw 2 for the first time. receive_value(3) # Saw 3 for the first time. receive_value(1) # Already seen 1, ignoring. ``` Readers might reasonably assume that `add?` is only looking up into the set a single time, but it's actually doing two separate look-ups! ([source](https://github.com/ruby/ruby/blob/c976cb5/lib/set.rb#L517-L525)) ```rb class Set def add?(o # 1. `include?(o)` looks up into `@hash` # 2. if the value isn't there, `add(o)` does a second look-up into `@hash` add(o) unless include?(o) end end ``` This gets especially expensive if the values are large hash/arrays/objects, whose `#hash` is expensive to compute. We can optimize this if it was possible to set a value in hash, *and* retrieve the value that was already there, in a single go. I propose adding `Hash#exchange_value` to do exactly that. If that existed, we can re-implement `#add?` as: ```rb class Set def add?(o) # Only requires a single look-up into `@hash`! self unless @hash.exchange_value(o, true) end ``` Here's a proof-of-concept implementation: https://github.com/ruby/ruby/pull/10093 # Theory How much of a benefit this has depends on 2 factors: 1. How much `#hash` is called, which depends on how many **new** objects are added to the set. * If every object is new, then `#hash` used to be called twice on every `#add?`. * This is where this improvement makes the biggest (2x!) change. * If every object has already been seen, then `#hash` was never being called twice before anyway, so there would be no improvement. * It's important to not regress in this case, because many use cases of sets don't deal with many distinct objects, but just need to do quick checks against an existing set. * Every other case lies somewhere in between those two, depending on the % of objects which are new. 2. How slow `#hash` is to compute for the key * If the hash is slow to compute, this change will make a bigger improvement * If the hash value is fast to compute, then it won't matter as much. Even if we called it half as much, it's a minority of the total time, so it won't have much net impact. # Benchmark summary | | All objects are new | All objects are preexisting | |---------------------------|-------:|------:| | objects with slow `#hash` | 100.0% | ~0.0% | | objects with fast `#hash` | 24.5% | 4.6% | As we see, this change makes a huge improvement the cases where it helps, and crucially, doesn't slow down the cases where it can't. For the complete benchmark source code and results, see the PR: https://github.com/ruby/ruby/pull/10093 -- https://bugs.ruby-lang.org/
participants (5)
-
AMomchilov (Alexander Momchilov)
-
Dan0042 (Daniel DeLorme)
-
Eregon (Benoit Daloze)
-
nobu (Nobuyoshi Nakada)
-
shyouhei (Shyouhei Urabe)