Skip to content

ENH: add masked algorithm for mean() #34754

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Similarly as we now have masked implementations for sum, prod, min and max for the nullable integer array (first PR #30982, now lives at https://github.com/pandas-dev/pandas/blob/master/pandas/core/array_algos/masked_reductions.py), we can add one for the mean reduction as well.

Very rough check gives a nice speed-up:

In [27]: arr = pd.array(np.random.randint(0, 1000, 1_000_000), dtype="Int64") 

In [28]: arr[np.random.randint(0, 1_000_000, 1000)] = pd.NA 

In [30]: arr._reduce("mean") 
Out[30]: 499.27095868772903

In [31]: %timeit arr._reduce("mean") 
7.26 ms ± 335 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [32]: arr._data.sum(where=~arr._mask, dtype="float64") / (~arr._mask).sum() 
Out[32]: 499.27095868772903

In [33]: %timeit arr._data.sum(where=~arr._mask, dtype="float64") / (~arr._mask).sum()  
2.08 ms ± 6.89 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The nanmean version lives here: https://github.com/pandas-dev/pandas/blob/master/pandas/core/nanops.py#L517
And as reference, numpy is also adding a version that accepts a mask: numpy/numpy#15852 (which could be used in the future, and as inspiration for the implementation now).

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions