Skip to content

BUG: groupby.agg with udf incorrectly inferring pyarrow timestamp dtype #49163

Closed
@jbrockmendel

Description

@jbrockmendel
# based on test_agg_tzaware_non_datetime_result
import pyarrow as pa
import pandas as pd

dtype = pd.ArrowDtype(pa.timestamp(unit="ns", tz="UTC"))
dti = pd.date_range("2012-01-01", periods=4, tz="UTC").astype(dtype)
df = pd.DataFrame({"a": [0, 0, 1, 1], "b": dti})

gb = df.groupby("a")

result = gb["b"].agg(lambda x: x.iloc[0].year)
expected = pd.Series([2012, 2012], name="b")
expected.index.name = "a"
tm.assert_series_equal(result, expected)  # <- raises

The agg call does dtype inference on the result by calling maybe_cast_pointwise_result, which has special-casing (which we've wanted to get rid of for a while) for pd.DatetimeTZDtype.

AFAICT the way to fix this is to make _from_sequence stricter xref #33254.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ApplyApply, Aggregate, Transform, MapArrowpyarrow functionalityBugDtype ConversionsUnexpected or buggy dtype conversionsGroupby

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions