Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I wish for the functions that are still not supported an Exception
is thrown. For instance, for the below snippet I try and use the explode
method on two DataFrame
s one having pyarrow
backed data type and another numpy
.
import pyarrow as pa
import pyarrow.parquet as pq
pydict = {'x': [[1,2], [3]], 'y': ['a', 'b']}
table = pa.Table.from_pydict(pydict)
print(table.schema)
x: list<item: int64>
child 0, item: int64
y: string
pq.write_table(table, 'test.parquet')
df_pa = pd.read_parquet('test.parquet', dtype_backend='pyarrow')
df = pd.read_parquet('test.parquet')
I make the DataFrame
s by reading from the parquet
files because when doing in-memory using the convert_dtypes
method it casts column x
as object
.
If I check schema
df_pa.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 x 2 non-null list<item: int64>[pyarrow]
1 y 2 non-null string[pyarrow]
dtypes: list<item: int64>[pyarrow](1), string[pyarrow](1)
memory usage: 170.0 bytes
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 x 2 non-null object
1 y 2 non-null object
dtypes: object(2)
memory usage: 160.0+ bytes
If I try to extract one element from each
df_pa.iloc[0].values
array([list([1, 2]), 'a'], dtype=object)
df.iloc[0].values
array([array([1, 2]), 'a'], dtype=object)
I do note how the dtype is different in both.
FInally, if I try to call the explode
method on:
- The
numpy
backed dtype
df.explode('x')
Output
x y
0 1 a
0 2 a
1 3 b
- The
pyarrow
backed dtype
df_pa.explode('x')
Output (without any error)
x y
0 [1 2] a
1 [3] b
So basically, this was not the result I was expecting.
P.S. I was not sure if this would rather qualify as a bug.
Platform details
Platform: macOS-13.3.1-arm64-arm-64bit
Python: 3.9.16 (main, May 16 2023, 20:00:19)
[Clang 14.0.3 (clang-1403.0.22.14.1)]
numpy: 1.24.3
pandas: 2.0.1
pyarrow: 10.0.1
Feature Description
I don't know enough about the internals of pandas to propose anything here.
Alternative Solutions
For the functions which are not natively supported for pyarrow
backed data frames could we internally convert them to the numpy
one, do the operation and convert them back? In this case a warning message would make sense.
But again since I don't know enough whether such a make-shift solution meets the standards wrt performance/api design etc I would not be able to make a suggestion here.
Additional Context
No response