Plan for a native string dtype

Apache Arrow has support for natively storing UTF-8 data. And [work is ongoing](https://github.com/apache/arrow/pull/7449)
adding kernels (e.g. `str.isupper()`) to operate on that data. This issue
is to discuss how we can expose the native string dtype to pandas' users.

There are several things to discuss:

1. How do users opt into this behavior?
2. A fallback mode for not implemented kernels.

### How do users opt into Arrow-backed StringArray?

The primary difficulty is the additional Arrow dependency. I'm assuming that we
are not ready to adopt it as a required dependency, so all of this will be
opt-in for now (though this point is open for discussion).

StringArray is marked as experimental, so our usual API-breaking restrictions
rules don't apply. But we want to do this in a way that's not too disruptive.

There are three was to get a `StringDtype`-dtype array today:

1. Infer: `pd.array(['a', 'b', None])`
2. Explicit `dtype=pd.StringDtype()`
3. String alias `dtype="string"`

My preference is for all of these to stay consistent. They all either give a
StringArray backed by an object-dtype ndarray or a StringArray backed by Arrow
memory.

I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like `pd.PythonStringDtype()` as a
way to get the StringArray backed by an object-dtype ndarray.

The easiest way to support this is, I think, an option.

```python
>>> pd.options.mode.use_arrow_string_dtype = True
```

Then all of those would create an Arrow-backed StringArray.

### Fallback Mode

It's likely that Arrow 1.0 will not implement all the string kernels we need.
So when someone does

```python
>>> Series(['a', 'b'], dtype="string").str.normalize()  # no arrow kernel
```

we have a few options:

1. Raise, stating that there's no kernel for normalize.
2. PerformanceWarning, astype to object, do the operation, and convert back

I'm not sure which is best. My preference for now is probably to raise, but I could see doing either.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Plan for a native string dtype #35169

How do users opt into Arrow-backed StringArray?

Fallback Mode

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Plan for a native string dtype #35169

Description

How do users opt into Arrow-backed StringArray?

Fallback Mode

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions