Description
The purpose of this issue is to discuss a plan of attack for improving string dtypes in NumPy to better suit Pandas.
Context
- @seberg has spent a ton of effort improving the infrastructure NumPy offers for implement dtypes, in NumPy itself and as third-party dtypes. See NEP 40-43. It's far enough along now that it makes sense to start using it, even if some things may still be missing. String dtypes were explicitly thought about in that design.
- Pandas is a main potential consumer of new/improved string dtypes. There's current two ways to do strings in Pandas, via
object
(no longer recommended) and viaStringDtype
(which can have multiple implementations it looks like): https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-types. - There's bandwidth available (thanks to a cross-project NASA grant for the next 2.5 years, @jreback was involved in adding this topic to the grant from the Pandas side) to work on a prototype for improved string dtypes that can improve on what NumPy offers today, focused on Pandas needs. This can live in a separate repo / a Pandas fork for quite a while.
There's a ton of relevant threads and issues for both NumPy and Pandas, I'm not going to try to link them all here.
Proposed way of approaching this
There's folks from Pandas (I think at least @jreback, @jbrockmendel and @jorisvandenbossche), NumPy (@seberg, @mattip), the NASA grant (@peytondmurray who will do some of the heavy lifting here on the prototype, Cc @dharhas as PI) with an interest in this. It's probably also relevant for other dataframe libraries; what Arrow provides is relevant; the dataframe interchange protocol probably too. In short: many potentially interested people and projects. So I'd suggest we add comments, new ideas, concerns on this issue - and then also have a call next week with whoever is interested, to have a bit higher-bandwidth conversation on how to get started.
A few thoughts on what to do
- A true variable-length string dtype for NumPy is probably most interesting (more so than, for example, reimplementing the fixed-length dtypes in the new dtype framework). Such a variable-length dtype is also mentioned on the NumPy Roadmap. So best to only focus on that first.
- Start working in a separate repo for this, and link it from here. I'll also note that @seberg has a bunch of example dtypes (including one string one) in https://github.com/seberg/experimental_user_dtypes.
- Collect Pandas wishes, needs and pain points in this issue. Cross-link to other issues as appropriate (I apologize for not digging through the Pandas issue tracker to make a start - I figured that Pandas devs may know already what is most relevant here, and I don't).