Skip to content

Provide a way to convert Arrow tables to Arrow-backed dataframes #51760

Closed
@datapythonista

Description

@datapythonista

As far as I could see, there is no easy way given a PyArrow table, to get a DataFrame with pyarrow types.

I'd expect that those idioms work:

import numpy
import pyarrow
import pandas

arrow_u8 = pyarrow.array([1, 2, 3], type=pyarrow.uint8())
arrow_f64 = pyarrow.array([1., 2., 3.], type=pyarrow.float64())
table = pyarrow.table([arrow_u8, arrow_f64], names=['u8', 'f64'])

# Using the PyArrow `to_pandas` method will use NumPy backed data
df = table.to_pandas()

# Using the constructor with a PyArrow table raises: ValueError: DataFrame constructor not properly called!
df = pandas.DataFrame(table)

# This is not implemented (the method doesn't exist)
df = pandas.DataFrame.from_arrow(table)

# Creating a dataframe column by column naively from the arrow array will use NumPy dtypes
df = pandas.DataFrame({'u8': arrow_u8,
                       'f64': arrow_f64})

I think the easier way to make the transition is with something like this:

df =  pandas.DataFrame({name: pandas.Series(array,
                                            dtype=pandas.ArrowDtype(array.type))
                        for array, name
                        in zip(table.columns, table.column_names)})

@pandas-dev/pandas-core Given that Arrow dtypes is one of the highlights of pandas 2.0, shouldn't we provide at least one easy way to convert before the release?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions