Skip to content

BUG: inferring the future string dtype doesn't work for pyarrow's "large_string" type #54798

Open
@jorisvandenbossche

Description

@jorisvandenbossche

Create a Parquet file with both string and large string type (note both are the same in Parquet, but pyarrow will restore the large_string type when reading back):

import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({'col1': pa.array(['a', 'b', None], pa.string()), 'col2': pa.array([None, 'b', 'c'], pa.large_string())})
pq.write_table(table, "test_large_string.parquet")

Reading this file with pd.read_parquet and the future string dtype enabled (#54792):

In [27]: pd.options.future.infer_string = True

In [28]: pd.read_parquet("test_large_string.parquet").dtypes
Out[28]: 
col1    string[pyarrow_numpy]
col2                   object
dtype: object

Only the first column has the proper dtype, the column that uses pa.large_string() on the pyarrow side still falls back to object dtype.

As an example, polars users might easily get a large string type in arrow, since polars uses a large string under the hood for their string dtype.

Longer term, there might be a question whether we want to enable zero-copy support for large strings as well (e.g. right now you could already use ArrowDtype(pa.large_string()) for that). But short term (for the infer_strings option and for 3.0), I think we will want to convert large strings to normal strings and correctly use StringDtype("pyarrow_numpy"). That's not fully zero-copy (the offsets have to be casted from int64 to int32); but the current default fallback to object dtype is much worse on that topic.

One potential issue is when the large string pyarrow array cannot be casted to string because it exceeds the size that can be stored in string type. A simple cast will raise an error in that case. A workaround is splitting the array in a chunked array with multiple chunks. That's not an issue for our StringDtype, because we store the data as a chunked array anyway. But I am not sure there is an automated way to get this with pyarrow APIs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityBugStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions