Skip to content

Infer dtype when using df.explode()ENH: #34923

Open
@hamzahiqb

Description

@hamzahiqb

Is your feature request related to a problem?

Yes. Currently, the df.explode method always returns an object for the column being exploded. This leads to loss of information about the dtype of the exploded column.

E.g.

s = pd.Series([1,2,3]) # <- dtype('int64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1})
df.explode("A").dtypes
0
A object
B int64

It would be great if pandas could return the underlying dtype if it was consistent across all rows. (Or return the best dtype (int -> float -> object).)

Describe the solution you'd like

  1. solution 1: The best case scenario would be where pandas would directly infer the dtype if it was consistent (ignoring NaNs) across the across the row.
s = pd.Series([1,None,3]) # <- dtype('float64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1}) # <- empty list is converted to NaN 
df.explode("A").dtypes
0
A float64
B int64
  1. solution 2: Providing a argument to force inferring the dtype:
s = pd.Series([1,None,3]) # <- dtype('float64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1}) # <- empty list is converted to NaN 
df.explode("A", infer_type=True).dtypes
0
A float64
B int64

Describe alternatives you've considered

Currently, I use the following workaround:

s = pd.Series([1,None,3]) # <- dtype('float64')
df = pd.DataFrame({'A': [s, s, s, s], 'B': 1}) # <- empty list is converted to NaN 

d = df.A[0].dtype
df2 = df.explode("A")
df2.A = df2.A.astype(d)

API breaking implications

Not sure.

Metadata

Metadata

Assignees

No one assigned

    Labels

    API - ConsistencyInternal Consistency of API/BehaviorDtype ConversionsUnexpected or buggy dtype conversionsEnhancementReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions