Refactor string methods for StringArray + return IntegerArray for numeric results #29640

TomAugspurger · 2019-11-15T17:06:59Z

Intended as an alternative to #29637.

This is a much smaller change. It only changes the codepath for StringDtype.
I think this is OK since someday (well down the road) we'll want to deprecate
the .str accessor on object-dtype Series. When we enforce that, we can just
delete the entire old implementation.

The API change is limited to always returning Int64Dtype for numeric outputs, rather than int if there's no NAs and float if there are any.
When BoolArray is done, we'll change that for the boolean-returning ones.

As a side benefit, we get a nice perf boost, since we have deterministic output dtypes we
can skip an object-dtype allocation.

In [2]: s = pd.Series(['a'] * 100_000 + [None], dtype="string")

In [3]: t = s.astype(object)

In [4]: %timeit s.str.count("a")
50.7 ms ± 2.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [5]: %timeit t.str.count("a")
62.7 ms ± 742 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

pandas/core/strings.py

doc/source/user_guide/text.rst

pandas/_libs/lib.pyx

pandas/core/strings.py

jorisvandenbossche · 2019-11-16T09:45:19Z

As a side benefit, we get a nice perf boost

Maybe a copy paste error, but the example doesn't show much performance boost :) (less than a percent)

jorisvandenbossche · 2019-11-16T09:46:00Z

This looks good to me! I suppose this is simpler than your other PR

jbrockmendel · 2019-11-16T16:44:04Z

Is this time-sensitive or can i wait and look at it on Monday?

TomAugspurger · 2019-11-16T19:57:03Z

No rush on this.

I did fail on the copy paste. The speed up is about 15% IIRC.

pandas/core/strings.py

pandas/tests/test_strings.py

doc/source/user_guide/text.rst

pandas/core/strings.py

jreback · 2019-11-17T13:45:22Z

pandas/core/strings.py

@@ -109,9 +118,51 @@ def cat_safe(list_of_columns: List, sep: str):

 def _na_map(f, arr, na_result=np.nan, dtype=object):
    # should really _check_ for NA
+    if is_extension_array_dtype(arr.dtype):


can you sync the signatures up with _map, also rename _map -> _map_arraylike (or similar)

Sorry, don't follow this. I think they do match, aside from @jbrockmendel's request to call it func rather than f.

I guess na_map calls it na_result while _map calls it na_value.

I am really -1 on 2 different branches here. If they have exactly the same signature a little less negative. again I would rename _map to be more informative here

What do you mean with "-1 on 2 different branches here" ?
The whole purpose of this PR is to add a separate branch in case of StringArray (because we can be more efficient and want to be more specific in the result dtype)

That's why my initial PR was so much more complex, since it tried to handle both cases similarly. I think that was more complex than this.

As Joris says, the main point of divergence is that for StringArray we usually know the result dtype exactly. It doesn't depend on the presence of NAs. Additionally,

We're still using map_infer_dtype for both, so the core implementation is the same.

We'll eventually deprecate .str on object-dtype, so we will end up with just this implementation.

ok, at the very least these signatures for what you call _map and _stringarray_map should be exactly the same.
and _map -> _map_object and _stringarray_map -> _map_stringarray.

I think this is crucial for not adding technical debt.

@jreback is your suggestion to add na_mask to the _stringarray_map signature and have it just not be used? I think this relates to the "this should be a StringArray method" discussion

_map and _na_map already have inconsistent signatures. I'm not sure why it's that way on master, but I'm a bit against adding unused arguments in this case.

What's the technical debt we're adding here? By definition, we need to handle StringArray differently, since its result type doesn't depend on the presence of NAs.

_map and _na_map already have inconsistent signatures. I'm not sure why it's that way on master, but I'm a bit against adding unused arguments in this case.

And there is also a good reason for that, as _map has an additional argument na_mask that is used internally in _map (for a recursive call).
I think refactoring _map is outside of the scope of this PR.

can you create an issue to clean this up (_map and _na_map), and/or refactor of this, post this PR.

TomAugspurger · 2019-11-18T12:37:15Z

Just the clipboard failure: #29676

jreback · 2019-11-18T13:45:41Z

pandas/core/strings.py

@@ -109,9 +118,51 @@ def cat_safe(list_of_columns: List, sep: str):

 def _na_map(f, arr, na_result=np.nan, dtype=object):
    # should really _check_ for NA
+    if is_extension_array_dtype(arr.dtype):


I am really -1 on 2 different branches here. If they have exactly the same signature a little less negative. again I would rename _map to be more informative here

pandas/core/strings.py

TomAugspurger · 2019-11-18T16:16:13Z

CI failure is from #29514

doc/source/user_guide/text.rst

pandas/_libs/lib.pyx

pandas/core/strings.py

jbrockmendel · 2019-11-18T17:13:38Z

pandas/core/strings.py

    return _map(f, arr, na_mask=True, na_value=na_result, dtype=dtype)


+def _stringarray_map(
+    func: Callable[[str], Any], arr: "StringArray", na_value: Any, dtype: Dtype


na_value restricted to scalar?

No, I don't think so.

jbrockmendel · 2019-11-18T17:14:40Z

pandas/core/strings.py

+    na_value : Any
+        The value to use for missing values. By default, this is
+        the original value (NA).
+    dtype : Dtype


Elsewhere we say "np.dtype or ExtensionDtype". Is this the new policy?

Not sure if we have a policy. If it matters, this is an internal docstring, so I'm OK to use things from pandas._typing that we wouldn't have in a public docstring yet.

pandas/core/strings.py

TomAugspurger · 2019-11-18T17:57:50Z

Thanks for the review. Addressed all the comments I think.

After this, I'll have a small change at #29597 to get it using the new methods (there's some no unnecessary changes in #29597 that need to be removed).

TomAugspurger · 2019-11-18T18:52:53Z

Just the clipboard CI failure again.

TomAugspurger · 2019-11-19T13:16:20Z

@jreback thoughts on the conversation around the signatures? I think this is slightly blocking the pd.NA PR now.

…

On Nov 19, 2019, at 06:01, Joris Van den Bossche ***@***.***> wrote: @jorisvandenbossche approved this pull request. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

jreback

very minor comments, rebase and ping on green.

jreback · 2019-11-19T13:32:26Z

doc/source/whatsnew/v1.0.0.rst

@@ -63,7 +63,7 @@ Previously, strings were typically stored in object-dtype NumPy arrays.
   ``StringDtype`` is currently considered experimental. The implementation
   and parts of the API may change without warning.

-The text extension type solves several issues with object-dtype NumPy arrays:
+The ``'string'`` extension type solves several issues with object-dtype NumPy arrays:


does this need any updating an/or reference to your new section that you added in text.rst?

@TomAugspurger you can probably update the sentence "The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype." 20 lines below this line, to include that it can also return IntegerDtype in certain cases.

jreback · 2019-11-19T13:33:27Z

pandas/core/strings.py

@@ -109,9 +118,51 @@ def cat_safe(list_of_columns: List, sep: str):

 def _na_map(f, arr, na_result=np.nan, dtype=object):
    # should really _check_ for NA
+    if is_extension_array_dtype(arr.dtype):


can you create an issue to clean this up (_map and _na_map), and/or refactor of this, post this PR.

jreback · 2019-11-19T13:33:44Z

pandas/core/strings.py

+    if is_extension_array_dtype(arr.dtype):
+        # just StringDtype
+        arr = extract_array(arr)
+        return _stringarray_map(f, arr, na_value=na_result, dtype=dtype)


I would rename these as indicated above.

I would rename these as indicated above.

Can you be explicit in what exact names you want to propose? (I am getting a bit lost in all the comments)
Tom already renamed _ea_map to _stringarray_map

right we should change _map -> _object_map (though I actually prefer the opposite, _map_object and _map_stringarray), but either is ok.

Thanks, make sense. I renamed to _map_object and _map_stringarray

TomAugspurger · 2019-11-19T14:01:00Z

Merged master, and writing up an issue for the followup.

TomAugspurger · 2019-11-19T14:48:36Z

#29710 for followup.

TomAugspurger · 2019-11-19T16:22:29Z

Thanks all!

…eric results (pandas-dev#29640)

take 2

14ca804

TomAugspurger mentioned this pull request Nov 15, 2019

REF/API: String methods refactor #29637

Closed

TomAugspurger added Performance Memory or execution speed performance Strings String extension data type and string data labels Nov 15, 2019

TomAugspurger added this to the 1.0 milestone Nov 15, 2019

jbrockmendel reviewed Nov 15, 2019

View reviewed changes

pandas/core/strings.py Outdated Show resolved Hide resolved

updates

07fc53d

jorisvandenbossche reviewed Nov 16, 2019

View reviewed changes

doc/source/user_guide/text.rst Outdated Show resolved Hide resolved

pandas/_libs/lib.pyx Show resolved Hide resolved

pandas/core/strings.py Outdated Show resolved Hide resolved

pandas/core/strings.py Outdated Show resolved Hide resolved

TomAugspurger added 2 commits November 16, 2019 16:33

Merge remote-tracking branch 'upstream/master' into masked-string-2

ddef378

updates

a147081

WillAyd requested changes Nov 17, 2019

View reviewed changes

pandas/core/strings.py Outdated Show resolved Hide resolved

pandas/tests/test_strings.py Outdated Show resolved Hide resolved

jreback requested changes Nov 17, 2019

View reviewed changes

TomAugspurger added 3 commits November 17, 2019 09:54

type callable

36b1bf7

wip

d01ddfe

more dtypes

22a2c79

jorisvandenbossche approved these changes Nov 18, 2019

View reviewed changes

jreback requested changes Nov 18, 2019

View reviewed changes

TomAugspurger added 2 commits November 18, 2019 09:30

Merge remote-tracking branch 'upstream/master' into masked-string-2

686de87

doc

0ea2d6b