Skip to content

DISCUSS/API: setitem-like operations should only update inplace and never fallback with upcast (i.e never change the dtype) #39584

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Currently, setitem-like operations (i.e. operations that change values in an existing series or dataframe such as __setitem__ and .loc/.iloc setitem, or filling methods like fillna) first try to update in place, but if there is a dtype mismatch, pandas will upcast to a common dtype (typically object dtype).

For example, setting a string into an integer Series upcasts to object:

>>> s = pd.Series([1, 2, 3])
>>> s.loc[1] = "B"
>>> s
0    1
1    B
2    3
dtype: object

or doing a fillna with an invalid fill value also upcasts instead of raising an error:

>>> s = pd.Series(["2020-01-01", "NaT"], dtype="datetime64[ns]")
>>> s
0   2020-01-01
1          NaT
dtype: datetime64[ns]
>>> s.fillna(1)
0    2020-01-01 00:00:00
1                      1
dtype: object

My general proposal would be that in some future (eg pandas 2.0 + after a deprecation), such inherently inplace operation should have the guarantee to either happen in place or either error, and thus never change the dtype of the original Series/DataFrame.

This is similar to eg numpy's behaviour where setitem never changes the dtype. Showing the first example from above in equivalent numpy code:

>>> arr = np.array([1, 2, 3])
>>> arr[1] = "B"
...
ValueError: invalid literal for int() with base 10: 'B'

Apart from that, I also think this is the cleaner behaviour with less surprises. If a user specifically wants to allow mixed types in a column, they can manually cast to object dtype first.

On the other hand, this is quite a big change in how we generally are permissive right now and easily upcast, and such a change will certainly impact quite some user code (but, it's perfectly possible to do this with proper deprecation warnings in advance warning for the specific cases where it will error in the future AFAIK).

There are certainly some more details that need to discussed as well if we want this (which exact values are regarded as compatible with the dtype, eg setting a float in an integer column, should that error or silently round the float?). But what are people's thoughts on the general idea?

cc @pandas-dev/pandas-core

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignDtype ConversionsUnexpected or buggy dtype conversionsIndexingRelated to indexing on series/frames, not to indexes themselvesNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions