Skip to content

ENH: Support explode for ArrowDtype  #53373

Closed
@csubhodeep

Description

@csubhodeep

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish for the functions that are still not supported an Exception is thrown. For instance, for the below snippet I try and use the explode method on two DataFrames one having pyarrow backed data type and another numpy.

import pyarrow as pa
import pyarrow.parquet as pq

pydict = {'x': [[1,2], [3]], 'y': ['a', 'b']}
table = pa.Table.from_pydict(pydict)
print(table.schema)
x: list<item: int64>
  child 0, item: int64
y: string
pq.write_table(table, 'test.parquet')
df_pa = pd.read_parquet('test.parquet', dtype_backend='pyarrow')
df = pd.read_parquet('test.parquet')

I make the DataFrames by reading from the parquet files because when doing in-memory using the convert_dtypes method it casts column x as object.

If I check schema

df_pa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype                     
---  ------  --------------  -----                     
 0   x       2 non-null      list<item: int64>[pyarrow]
 1   y       2 non-null      string[pyarrow]           
dtypes: list<item: int64>[pyarrow](1), string[pyarrow](1)
memory usage: 170.0 bytes
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       2 non-null      object
 1   y       2 non-null      object
dtypes: object(2)
memory usage: 160.0+ bytes

If I try to extract one element from each

df_pa.iloc[0].values
array([list([1, 2]), 'a'], dtype=object)
df.iloc[0].values
array([array([1, 2]), 'a'], dtype=object)

I do note how the dtype is different in both.

FInally, if I try to call the explode method on:

  1. The numpy backed dtype
df.explode('x')

Output

   x  y
0  1  a
0  2  a
1  3  b
  1. The pyarrow backed dtype
df_pa.explode('x')

Output (without any error)

       x  y
0  [1 2]  a
1    [3]  b

So basically, this was not the result I was expecting.

P.S. I was not sure if this would rather qualify as a bug.

Platform details

Platform:    macOS-13.3.1-arm64-arm-64bit
Python:      3.9.16 (main, May 16 2023, 20:00:19) 
[Clang 14.0.3 (clang-1403.0.22.14.1)]
numpy:       1.24.3
pandas:      2.0.1
pyarrow:     10.0.1

Feature Description

I don't know enough about the internals of pandas to propose anything here.

Alternative Solutions

For the functions which are not natively supported for pyarrow backed data frames could we internally convert them to the numpy one, do the operation and convert them back? In this case a warning message would make sense.

But again since I don't know enough whether such a make-shift solution meets the standards wrt performance/api design etc I would not be able to make a suggestion here.

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions