Skip to content

API: DataFrame setitem: setting columns with a DataFrame RHS doesn't align column names? #46974

Open
@jorisvandenbossche

Description

@jorisvandenbossche

The case being considered here is when setting multiple columns into a DataFrame (using __setitem__, df[[..]] = ..), using a DataFrame right-hand-side value. So a simple, unambiguous example is:

>>> df1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['a', 'b'])
>>> df2 = pd.DataFrame(np.arange(6).reshape(3, 2) * 2, columns=['a', 'b'])
>>> df1[['a', 'b']] = df2
>>> df1
   a   b
0  0   2
1  4   6
2  8  10

However, we are setting the multiple columns column-by-column in order, ignoring potential misaligned column names:

>>> df1[['a', 'b']] = df2[['b', 'a']]
>>> df1
    a  b
0   2  0
1   6  4
2  10  8

I think this is "expected" behaviour. Meaning, this seems to be intentional and long standing behaviour. Although I personally find this surprisin, especially because when using loc instead of plain setitem, i.e. df1.loc[:, ['a', 'b']] = df2[['b', 'a']], does align the column names:

>>> df1.loc[:, ['a', 'b']] = df2[['b', 'a']]
>>> df1
   a   b
0  0   2
1  4   6
2  8  10

I didn't directly find an issue about this, only a PR that touched the code that handles this but in case of duplicate columns (#39403), and a comment at https://github.com/pandas-dev/pandas/pull/39341/files#r563895152 about column names being irrelevant for setitem (cc @phofl @jbrockmendel)

But, because of the fact that we ignore alignment of column names, but then do the setting by name (and not position):

pandas/pandas/core/frame.py

Lines 3747 to 3750 in dd6869f

if isinstance(value, DataFrame):
check_key_length(self.columns, key, value)
for k1, k2 in zip(key, value.columns):
self[k1] = value[k2]

you get inconsistent results with duplicate column names.

For example, in this case the second column of df2 is set to both "b" columns of df1

>>> df1 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'b'])
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'c'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   2
1   6   8   8
2  12  14  14

On the other hand, if I change the column names in df2 to also have duplicate columns, but in a different order, depending on the exact order you get an error or a "working" example:

>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'b'])
>>> df1[['a', 'b']] = df2
...
ValueError: Columns must be same length as key

>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'a'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   4
1   6   8  10
2  12  14  16

And if the columns names order matches exactly, the columns are set "correctly" as well:

>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['a', 'b', 'b'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   4
1   6   8  10
2  12  14  16

So in general, in those examples, the column names do matter.


General questions:

  • Are we OK with __setitem__ (df[key] = value) with a dataframe value ignoring the value's column names? (not aligning key and value.columns) And are we OK with this being different as .loc[]?
  • If we keep the current behaviour, should we set those columns by position instead of column name, so that also for duplicate column names you don't get such inconsistent results?
    (but how to we change this? (it's a breaking change) maybe we should deprecate/disallow such setitem with duplicate column names?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignIndexingRelated to indexing on series/frames, not to indexes themselves

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions