Skip to content

API: pd.StringDtype.value_counts should return pd.Int64Dtype #59346

Open
@WillAyd

Description

@WillAyd

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

From discussion in https://github.com/pandas-dev/pandas/pull/59330/files#r1695250192

The fact that pd.StringDtype(storage="pyarrow") returns a int64[pyarrow] is a mistake with our API. While this is well intentioned for performance reasons, it introduces an inconsistency with our extension types

If we continue to have pd.StringDtype.value_counts return a pd.Int64Dtype() we can leave it as an implementation detail whether or not that Int64Dtype is backed by pyarrow or numpy. The user interface would not need to change at all; users could simply install (or not install) pyarrow and things should continue to work the same. This would long term also align better with PDEP-13 #58455

Feature Description

Make algorithms for the StringDtype that use pd.NA as the missing value sentinel consistently return other pandas extension types, and leave it as an implementation detail if those are backed by pyarrow or not

Alternative Solutions

status quo

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    API - ConsistencyInternal Consistency of API/BehaviorAPI DesignArrowpyarrow functionalityDtype ConversionsUnexpected or buggy dtype conversionsExtensionArrayExtending pandas with custom dtypes or arrays.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions