Closed
Description
Similarly as we now have masked implementations for sum, prod, min and max for the nullable integer array (first PR #30982, now lives at https://github.com/pandas-dev/pandas/blob/master/pandas/core/array_algos/masked_reductions.py), we can add one for the mean
reduction as well.
Very rough check gives a nice speed-up:
In [27]: arr = pd.array(np.random.randint(0, 1000, 1_000_000), dtype="Int64")
In [28]: arr[np.random.randint(0, 1_000_000, 1000)] = pd.NA
In [30]: arr._reduce("mean")
Out[30]: 499.27095868772903
In [31]: %timeit arr._reduce("mean")
7.26 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [32]: arr._data.sum(where=~arr._mask, dtype="float64") / (~arr._mask).sum()
Out[32]: 499.27095868772903
In [33]: %timeit arr._data.sum(where=~arr._mask, dtype="float64") / (~arr._mask).sum()
2.08 ms ± 6.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The nanmean
version lives here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/nanops.py#L517
And as reference, numpy is also adding a version that accepts a mask: numpy/numpy#15852 (which could be used in the future, and as inspiration for the implementation now).