Skip to content

API: setitem copy/view behavior ndarray vs Categorical vs other EA #38896

Closed
@jbrockmendel

Description

@jbrockmendel

xref #33457 which is about similar issue but goes through different code paths.

In Block.setitem in cases where we are setting all the values for this block we have:

        elif exact_match and is_categorical_dtype(arr_value.dtype):
            # GH25495 - If the current dtype is not categorical, we need to create a new categorical block
            values[indexer] = value
            return self.make_block(Categorical(self.values, dtype=arr_value.dtype))

        elif exact_match and is_ea_value:
            # GH#32395 if we're going to replace the values entirely, just substitute in the new array
            return self.make_block(arr_value)

        elif exact_match:
            # We are setting _all_ of the array's values, so can cast to new dtype
            values[indexer] = value

            values = values.astype(arr_value.dtype, copy=False)

So we overwrite the existing values for categorical value or non-EA value. Example:

df = pd.DataFrame({
    "A": [.1, .2, .3],
    "B": pd.array([1, 2, None], dtype="Int64"),
    "C": ["a", "b", "c"]
})
orig_df = df[:]

arr_np = df["A"]._values
arr_ea = df["B"]._values
cat = pd.Categorical(df["C"])

# Note: there are many equivalent-looking ways of doing this setitem operation but few of them go through this code path.
df.loc[range(3), "A"] = arr_np[::-1]
df.loc[range(3), "B"] = arr_ea[::-1]
df.loc[range(3), "C"] = cat[::-1]

>>> df
     A     B  C
0  0.3  <NA>  c
1  0.2     2  b
2  0.1     1  a

>>> df.dtypes
A     float64
B       Int64
C    category
dtype: object

>>> orig_df
     A     B  C
0  0.3     1  a
1  0.2     2  b
2  0.1  <NA>  c

>>> orig_df.dtypes
A    float64
B      Int64
C     object
dtype: object

The categorical behavior we implemented in #23393 and AFAICT the over-writing behavior was not discussed/intentional. Similarly the other EA behavior was implemented in #32479 and I don't see anything about the overwrite-or-not. I haven't tracked down the origin of the non-EA behavior.

I think all three cases should have the same behavior. We should also have the same behavior for should-be-equivalent setters, e.g. if we used iloc instead of loc, or [:, "A"] instead of [range(3), "A"].

I think I agree with @TomAugspurger's comment that these should always be in-place, but not sure ATM if that can be done without breaking consistency elsewhere.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions