Skip to content

DEPR: Special casing of NumPy and Python builtin functions #53425

Closed
@rhshadrach

Description

@rhshadrach

Currently various code paths detects whether a user passes e.g. Python's builtin sum or np.sum and replaces the operation with the pandas equivalent. I think we shouldn't alias a potentially valid UDF to the pandas version; this prevents users from using the NumPy version of the method if that is indeed what they want. This can lead to some surprising behavior: for example, NumPy ddof defaults to 0 whereas pandas defaults to 1. We also only do the switch when there are no args/kwargs provided. Thus:

df.groupby("a").agg(np.sum)  # Uses pandas sum with a numeric_only argument
df.groupby("a").agg(np.sum, numeric_only=True)  # Uses np.sum; raises since numeric_only is not an argument

It can also lead to different floating point behavior:

np.random.seed(26)
size = 100000
df = pd.DataFrame(
    {
        "a": 0,
        "b": np.random.random(size),
    }
)
gb = df.groupby("a")
print(gb.sum().iloc[0, 0])
# 50150.538337372185
print(df["b"].to_numpy().sum())
# 50150.53833737219

However, using the Python builtin functions on DataFrames can give some surprising behavior:

ser = pd.Series([1, 1, 2], name="a")
print(max(ser))
# 2

df = pd.DataFrame({'a': [1, 1, 2], 'b': [3, 4, 5]})
print(max(df))
# b

If we remove special casing of these builtins, then apply/agg/transform will also exhibit this behavior. In particular, whether a DataFrame is broken up into Series internally to compute the operation will determine whether we get a result like 2 or b above.

However, I think this is still okay to do as the change to users is straightforward: use "max" instead of max.

I've opened #53426 to see what impact this would have on our test suite.

cc @topper-123 @mroeschke @jbrockmendel

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapDeprecateFunctionality to remove in pandas

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions