Pandas string dtype needs from NumPy - prototyping & plan of attack

The purpose of this issue is to discuss a plan of attack for improving string dtypes in NumPy to better suit Pandas. 

### Context

1. @seberg has spent a ton of effort improving the infrastructure NumPy offers for implement dtypes, in NumPy itself and as third-party dtypes. See [NEP 40-43](https://numpy.org/neps/nep-0040-legacy-datatype-impl.html). It's far enough along now that it makes sense to start using it, even if some things may still be missing. String dtypes were explicitly thought about in that design. 
2. Pandas is a main potential consumer of new/improved string dtypes. There's current two ways to do strings in Pandas, via `object` (no longer recommended) and via `StringDtype` (which can have multiple implementations it looks like): https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-types. 
3. There's bandwidth available (thanks to a cross-project NASA grant for the next 2.5 years, @jreback was involved in adding this topic to the grant from the Pandas side) to work on a prototype for improved string dtypes that can improve on what NumPy offers today, focused on Pandas needs. This can live in a separate repo / a Pandas fork for quite a while.

There's a ton of relevant threads and issues for both NumPy and Pandas, I'm not going to try to link them all here. 

### Proposed way of approaching this

There's folks from Pandas (I think at least @jreback, @jbrockmendel and @jorisvandenbossche), NumPy (@seberg, @mattip), the NASA grant (@peytondmurray who will do some of the heavy lifting here on the prototype, Cc @dharhas as PI) with an interest in this. It's probably also relevant for other dataframe libraries; what Arrow provides is relevant; the dataframe interchange protocol probably too. In short: many potentially interested people and projects. So I'd suggest we add comments, new ideas, concerns on this issue - and then also have a call next week with whoever is interested, to have a bit higher-bandwidth conversation on how to get started.

### A few thoughts on what to do

1. A true variable-length string dtype for NumPy is probably most interesting (more so than, for example, reimplementing the fixed-length dtypes in the new dtype framework). Such a variable-length dtype is also mentioned on the [NumPy Roadmap](https://numpy.org/neps/roadmap.html#extensibility). So best to only focus on that first.
2. Start working in a separate repo for this, and link it from here. I'll also note that @seberg has a bunch of example dtypes (including one string one) in https://github.com/seberg/experimental_user_dtypes.
3. Collect Pandas wishes, needs and pain points in this issue. Cross-link to other issues as appropriate (I apologize for not digging through the Pandas issue tracker to make a start - I figured that Pandas devs may know already what is most relevant here, and I don't).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas string dtype needs from NumPy - prototyping & plan of attack #47884

Context

Proposed way of approaching this

A few thoughts on what to do

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pandas string dtype needs from NumPy - prototyping & plan of attack #47884

Description

Context

Proposed way of approaching this

A few thoughts on what to do

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions