API: concatting of Series/DataFrame - handling (not skipping) of empty objects

Follow-up on https://github.com/pandas-dev/pandas/pull/38843 and https://github.com/pandas-dev/pandas/pull/39035. 

Currently, we generally (some exceptions can be considered bugs, I think) do not drop empty objects when concatting DataFrames, but we do explicitly drop empties when concatting Series (in `dtypes/concat.py::concat_compat`, for `axis==0`). 

We should make this consistent throughout pandas, and generally I would argue for not skipping empties:  when not skipping empty objects, the resulting dtype of a concat-operation only depends on the input *dtypes*, and not on the exact content (the exact values, how many values (shape)). In general we want to get rid of value-dependent behaviour. In the past we discussed this in the context of the certain values (eg presence of NaNs or not), but I think also the shape should not matter (eg when slicing dataframes before concatting, you can get empties or not depending on *values*).

*If* people agree on going the way of not skipping empties in `concat` (and `append`, and friends), some different areas of work:

* **DataFrames**: in general we already do not skip empty objects when determining the resulting dtype (except for DataFrames with EA columns, see below). However:
  * [ ] Add a bunch of explicit tests for this behaviour.
  * [ ] There are still some bugs, see eg https://github.com/pandas-dev/pandas/issues/32934 about M8[ns] + float64 giving M8[ns] for when empty and object otherwise
* **DataFrame/Series with new (nullable) extension dtypes**: since those are still experimental, I think we can still just make a breaking change here?
  * [ ] Update `concat_compat` to not skip empty nullable EAs
* **Series** (and DataFrame with EA columns): those now skip empties. Can we "just" change this (it's long standing behaviour, and changing it can certainly break code (eg if you at once have no longer a datetime column but object dtype column)? Can we deprecate it? (doesn't sound easy) Or leave it as breaking change for pandas 2.0? (which means keeping the inconsistency for a while longer ..)

So IMO it's mainly the last bullet point (Series/DataFrame with longer-existing EAs) that requires some more discussion on *how* we want to change it.

---

Some illustrative examples:

```python
# dataframe with int64 (default consolidated dtype) and period (EA dtype) column
>>> df1 = pd.DataFrame({'a': [1, 2, 3], 'b': pd.period_range("2012", freq="D", periods=3)})
# dataframe with object dtype columns
>>> df2 = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['a', 'b', 'c']})
```

For Series with basic dtype (int64), int64 + object dtype results in object dtype, but not when the object dtype Series is empty:

```python
>>> pd.concat([df1['a'], df2['a']]).dtype
dtype('O')
>>> pd.concat([df1['a'], df2['a'][:0]]).dtype
dtype('int64')
```

For DataFrame, you can see that the int64 + object always gives object (even when one is empty), but for period dtype, the empty object dtype gets ignored:

```python
>>> pd.concat([df1, df2]).dtypes
a    object
b    object
dtype: object
>>> pd.concat([df1, df2[:0]]).dtypes
a       object
b    period[D]
dtype: object
```

cc @pandas-dev/pandas-core 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: concatting of Series/DataFrame - handling (not skipping) of empty objects #39122

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

API: concatting of Series/DataFrame - handling (not skipping) of empty objects #39122

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions