
Issue #21518 has been updated by Eregon (Benoit Daloze). mrkn (Kenta Murata) wrote in #note-12:
In general, adding only `mean` (I prefer `mean` over `average`, see below) and `median` won't cover real-world statistical needs. When a sample mean is required, variance or standard deviation usually follow; where a sample median is used, quantiles or percentiles typically follow. Truly “median-only” scenarios are rare in my experience.
I think `mean` and `median` are frequently needed (at least I have reimplemented them many times) and would be worth adding to Array. Not sure of the value to add them to Enumerable instead of Array (it would be much slower implemented on Enumerable). I typically use the [median absolute deviation](https://en.wikipedia.org/wiki/Median_absolute_deviation) as a robust measure of the variability when using the median, and that can be trivially implemented on top of `#median`. So for that case, only `median` is enough. Regarding variance or standard deviation those are not robust and over-influenced by outliers, so I think it would make sense to not provide them, because they are often no longer recommended. mrkn (Kenta Murata) wrote in #note-12:
avoid full sort in favor of selection algorithms such as quickselect.
That seems one good reason to add it in core, the optimal algorithm is actually non-trivial and cannot easily be done in a Ruby one-liner for `median`. `mean` is trivial but would still be nice to provide given it's so frequently used (also `data.sum / data.size.to_f` is not so pretty)). Percentiles would be nice, especially if there is a more efficient algorithm for them than just sorting + indexing. Percentiles are frequently useful e.g. to characterize response time/latency and also for boxplots. It's also a more robust way (e.g. with quartiles, so just 25 and 75 percentiles) to measure the variability than the standard deviation. ---------------------------------------- Feature #21518: Statistical helpers to `Enumerable` https://bugs.ruby-lang.org/issues/21518#change-114374 * Author: Amitleshed (Amit Leshed) * Status: Open ---------------------------------------- **Summary** I'd like to add two statistical helpers to `Enumerable`: - `Enumerable#average` (arithmetic mean) - `Enumerable#median` Both are small, well-defined operations that many Rubyists re-implement in apps and gems. Providing them in core avoids repeated, ad-hoc code and aligns with `Enumerable#sum`, which Ruby already ships. **Motivation** - These are among the most common “roll-your-own” helpers for arrays/ranges of numbers. - They are conceptually simple, universally useful beyond web/Rails. - Similar to `sum`, they’re primitives for quick data analysis, ETL scripts, CLI tooling, etc. - Including them encourages consistent semantics (what to do with empty sets, mixed numerics, etc.). ## Proposed API & Semantics ```ruby Enumerable#average -> Float or nil Enumerable#median -> Numeric or nil ``` ```ruby [1, 2, 3, 4].average # => 2.5 (1..4).average # => 2.5 [].average # => nil [1, 3, 2].median # => 2 [1, 2, 3, 10].median # => 2.5 (1..6).median # => 3.5 [].median # => nil ``` Ruby implementation ```ruby module Enumerable def average count = 0 total = 0.0 each do |x| raise TypeError, "non-numeric value for average" unless x.is_a?(Numeric) total += x count += 1 end count.zero? ? nil : total / count end def median arr = to_a return nil if arr.empty? arr.each { |x| raise TypeError, "non-numeric value for median" unless x.is_a?(Numeric) } arr.sort! mid = arr.length / 2 arr.length.odd? ? arr[mid] : (arr[mid - 1] + arr[mid]) / 2.0 end end ``` **Upon approval I'm more than willing to implement spec and code in C.** -- https://bugs.ruby-lang.org/