Skip to content

Plan for a native string dtype #35169

Closed
Closed
@TomAugspurger

Description

@TomAugspurger

Apache Arrow has support for natively storing UTF-8 data. And work is ongoing
adding kernels (e.g. str.isupper()) to operate on that data. This issue
is to discuss how we can expose the native string dtype to pandas' users.

There are several things to discuss:

  1. How do users opt into this behavior?
  2. A fallback mode for not implemented kernels.

How do users opt into Arrow-backed StringArray?

The primary difficulty is the additional Arrow dependency. I'm assuming that we
are not ready to adopt it as a required dependency, so all of this will be
opt-in for now (though this point is open for discussion).

StringArray is marked as experimental, so our usual API-breaking restrictions
rules don't apply. But we want to do this in a way that's not too disruptive.

There are three was to get a StringDtype-dtype array today:

  1. Infer: pd.array(['a', 'b', None])
  2. Explicit dtype=pd.StringDtype()
  3. String alias dtype="string"

My preference is for all of these to stay consistent. They all either give a
StringArray backed by an object-dtype ndarray or a StringArray backed by Arrow
memory.

I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype() as a
way to get the StringArray backed by an object-dtype ndarray.

The easiest way to support this is, I think, an option.

>>> pd.options.mode.use_arrow_string_dtype = True

Then all of those would create an Arrow-backed StringArray.

Fallback Mode

It's likely that Arrow 1.0 will not implement all the string kernels we need.
So when someone does

>>> Series(['a', 'b'], dtype="string").str.normalize()  # no arrow kernel

we have a few options:

  1. Raise, stating that there's no kernel for normalize.
  2. PerformanceWarning, astype to object, do the operation, and convert back

I'm not sure which is best. My preference for now is probably to raise, but I could see doing either.

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions