Skip to content

BUG: df.agg(sum, axis=1) gives wrong result when Nan value is in frame #21134

Closed
@topper-123

Description

@topper-123

Using the built-in sum function gives correct result in df.agg(sum, axis=0), but wrong result in df.agg(sum, axis=1).

>>> df = pd.DataFrame([[np.nan, 2], [3, 4]])
>>> df.agg(sum, axis=0)
0    3.0
1    6.0
dtype: float64
>>> df.agg(sum, axis=1)
0    NaN
1    7.0
dtype: float64

The NaN in the last example should be 2.0.

Also, operation using the builtin sum in agg with axis=1 are very slow:

>>> n = 1_000
>>> df = pd.DataFrame({'a': range(n), 'b': range(n)})
>>> %timeit df.agg(sum, axis=1)
16.5 ms ± 25.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %timeit df.T.agg(sum)  # correct result *and* faster
312 µs ± 3.17 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Problem description

Currently, df.agg(func, axis=1) defers to df.apply(func, axis=1). This is not done for axis=0, and the operation may therefore give unexpected results and slow the operation down (because df.apply can be very slow).

Expected Output

The expected output is:

>>> df.agg(sum, axis=1)
0    2.0
1    7.0
dtype: float64

Solution proposal

I'm thinking about putting in df.T.agg(func, axis=0) rather than df.apply(func, axis=1) in a few strategic places. This should ensure both getting correct results and faster operations. will report back if this succeeds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffCompatpandas objects compatability with Numpy or Python functions

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions