Description
Apache Arrow has support for natively storing UTF-8 data. And work is ongoing
adding kernels (e.g. str.isupper()
) to operate on that data. This issue
is to discuss how we can expose the native string dtype to pandas' users.
There are several things to discuss:
- How do users opt into this behavior?
- A fallback mode for not implemented kernels.
How do users opt into Arrow-backed StringArray?
The primary difficulty is the additional Arrow dependency. I'm assuming that we
are not ready to adopt it as a required dependency, so all of this will be
opt-in for now (though this point is open for discussion).
StringArray is marked as experimental, so our usual API-breaking restrictions
rules don't apply. But we want to do this in a way that's not too disruptive.
There are three was to get a StringDtype
-dtype array today:
- Infer:
pd.array(['a', 'b', None])
- Explicit
dtype=pd.StringDtype()
- String alias
dtype="string"
My preference is for all of these to stay consistent. They all either give a
StringArray backed by an object-dtype ndarray or a StringArray backed by Arrow
memory.
I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like pd.PythonStringDtype()
as a
way to get the StringArray backed by an object-dtype ndarray.
The easiest way to support this is, I think, an option.
>>> pd.options.mode.use_arrow_string_dtype = True
Then all of those would create an Arrow-backed StringArray.
Fallback Mode
It's likely that Arrow 1.0 will not implement all the string kernels we need.
So when someone does
>>> Series(['a', 'b'], dtype="string").str.normalize() # no arrow kernel
we have a few options:
- Raise, stating that there's no kernel for normalize.
- PerformanceWarning, astype to object, do the operation, and convert back
I'm not sure which is best. My preference for now is probably to raise, but I could see doing either.