API: DataFrame setitem: setting columns with a DataFrame RHS doesn't align column names?

The case being considered here is when setting multiple columns into a DataFrame (using `__setitem__`, `df[[..]] = ..`), using a DataFrame right-hand-side value. So a simple, unambiguous example is:

```python
>>> df1 = pd.DataFrame(np.arange(6).reshape(3, 2), columns=['a', 'b'])
>>> df2 = pd.DataFrame(np.arange(6).reshape(3, 2) * 2, columns=['a', 'b'])
>>> df1[['a', 'b']] = df2
>>> df1
   a   b
0  0   2
1  4   6
2  8  10
```

However, we are setting the multiple columns column-by-column _in order_, ignoring potential misaligned column names:

```python
>>> df1[['a', 'b']] = df2[['b', 'a']]
>>> df1
    a  b
0   2  0
1   6  4
2  10  8
```

I _think_ this is "expected" behaviour. Meaning, this seems to be intentional and long standing behaviour. Although I personally find this surprisin, especially because when using `loc` instead of plain setitem, i.e. `df1.loc[:, ['a', 'b']] = df2[['b', 'a']]`, _does_ align the column names: 

```python
>>> df1.loc[:, ['a', 'b']] = df2[['b', 'a']]
>>> df1
   a   b
0  0   2
1  4   6
2  8  10
```

I didn't directly find an issue about this, only a PR that touched the code that handles this but in case of duplicate columns (https://github.com/pandas-dev/pandas/pull/39403), and a comment at https://github.com/pandas-dev/pandas/pull/39341/files#r563895152 about column names being irrelevant for setitem (cc @phofl @jbrockmendel)

But, because of the fact that we ignore alignment of column names, but then _do_ the setting by name (and not position):

https://github.com/pandas-dev/pandas/blob/dd6869f77d1623eea177c27bdfc873698b241ac6/pandas/core/frame.py#L3747-L3750

you get inconsistent results with duplicate column names.

For example, in this case the second column of `df2` is set to both "b" columns of `df1`
```
>>> df1 = pd.DataFrame(np.arange(9).reshape(3, 3), columns=['a', 'b', 'b'])
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'c'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   2
1   6   8   8
2  12  14  14
```

On the other hand, if I change the column names in ``df2`` to also have duplicate columns, but in a different order, depending on the exact order you get an error or a "working" example:

```
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'b'])
>>> df1[['a', 'b']] = df2
...
ValueError: Columns must be same length as key

>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['b', 'a', 'a'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   4
1   6   8  10
2  12  14  16
```

And if the columns names order matches exactly, the columns are set "correctly" as well:
```
>>> df2 = pd.DataFrame(np.arange(9).reshape(3, 3) * 2, columns=['a', 'b', 'b'])
>>> df1[['a', 'b']] = df2
>>> df1
    a   b   b
0   0   2   4
1   6   8  10
2  12  14  16
```

So in general, in those examples, the column names _do_ matter. 

---

General questions:

* Are we OK with `__setitem__` (`df[key] = value`) with a dataframe `value` ignoring the value's column names? (not aligning `key` and `value.columns`) And are we OK with this being different as `.loc[]`?
* If we keep the current behaviour, should we set those columns _by position_ instead of column name, so that also for duplicate column names you don't get such inconsistent results?
  (but how to we change this? (it's a breaking change) maybe we should deprecate/disallow such setitem with duplicate column names?)


	if isinstance(value, DataFrame):
	check_key_length(self.columns, key, value)
	for k1, k2 in zip(key, value.columns):
	self[k1] = value[k2]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: DataFrame setitem: setting columns with a DataFrame RHS doesn't align column names? #46974

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API: DataFrame setitem: setting columns with a DataFrame RHS doesn't align column names? #46974

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions