Skip to content

API: concatting of Series/DataFrame - handling (not skipping) of empty objects #39122

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

Follow-up on #38843 and #39035.

Currently, we generally (some exceptions can be considered bugs, I think) do not drop empty objects when concatting DataFrames, but we do explicitly drop empties when concatting Series (in dtypes/concat.py::concat_compat, for axis==0).

We should make this consistent throughout pandas, and generally I would argue for not skipping empties: when not skipping empty objects, the resulting dtype of a concat-operation only depends on the input dtypes, and not on the exact content (the exact values, how many values (shape)). In general we want to get rid of value-dependent behaviour. In the past we discussed this in the context of the certain values (eg presence of NaNs or not), but I think also the shape should not matter (eg when slicing dataframes before concatting, you can get empties or not depending on values).

If people agree on going the way of not skipping empties in concat (and append, and friends), some different areas of work:

  • DataFrames: in general we already do not skip empty objects when determining the resulting dtype (except for DataFrames with EA columns, see below). However:
  • DataFrame/Series with new (nullable) extension dtypes: since those are still experimental, I think we can still just make a breaking change here?
    • Update concat_compat to not skip empty nullable EAs
  • Series (and DataFrame with EA columns): those now skip empties. Can we "just" change this (it's long standing behaviour, and changing it can certainly break code (eg if you at once have no longer a datetime column but object dtype column)? Can we deprecate it? (doesn't sound easy) Or leave it as breaking change for pandas 2.0? (which means keeping the inconsistency for a while longer ..)

So IMO it's mainly the last bullet point (Series/DataFrame with longer-existing EAs) that requires some more discussion on how we want to change it.


Some illustrative examples:

# dataframe with int64 (default consolidated dtype) and period (EA dtype) column
>>> df1 = pd.DataFrame({'a': [1, 2, 3], 'b': pd.period_range("2012", freq="D", periods=3)})
# dataframe with object dtype columns
>>> df2 = pd.DataFrame({'a': ['a', 'b', 'c'], 'b': ['a', 'b', 'c']})

For Series with basic dtype (int64), int64 + object dtype results in object dtype, but not when the object dtype Series is empty:

>>> pd.concat([df1['a'], df2['a']]).dtype
dtype('O')
>>> pd.concat([df1['a'], df2['a'][:0]]).dtype
dtype('int64')

For DataFrame, you can see that the int64 + object always gives object (even when one is empty), but for period dtype, the empty object dtype gets ignored:

>>> pd.concat([df1, df2]).dtypes
a    object
b    object
dtype: object
>>> pd.concat([df1, df2[:0]]).dtypes
a       object
b    period[D]
dtype: object

cc @pandas-dev/pandas-core

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions