Skip to content

Refactor string methods for StringArray + return IntegerArray for numeric results #29640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Nov 19, 2019
Merged
36 changes: 34 additions & 2 deletions doc/source/user_guide/text.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Text Data Types

.. versionadded:: 1.0.0

There are two main ways to store text data
There are two ways to store text data in pandas:

1. ``object`` -dtype NumPy array.
2. :class:`StringDtype` extension type.
Expand Down Expand Up @@ -63,7 +63,39 @@ Or ``astype`` after the ``Series`` or ``DataFrame`` is created
s
s.astype("string")

Everything that follows in the rest of this document applies equally to
.. _text.differences:

Behavior differences
^^^^^^^^^^^^^^^^^^^^

These are places where the behavior of ``StringDtype`` objects differ from
``object`` dtype

l. For ``StringDtype``, :ref:`String accessor methods<api.series.str>`
that return **numeric** output will always return a nullable integer dtype,
rather either int or float dtype, depending on the presence of NA values.

.. ipython:: python

s = pd.Series(["a", None, "b"], dtype="string")
s.str.count("a")
s.dropna().str.count("a")

Both outputs are ``Int64`` dtype. Compare that with object-dtype

.. ipython:: python

s.astype(object).str.count("a")
s.astype(object).dropna().str.count("a")

When NA values are present, the output dtype is float64.

2. Some string methods, like :meth:`Series.str.decode` are not available
on ``StringArray`` because ``StringArray`` only holds strings, not
bytes.


Everything else that follows in the rest of this document applies equally to
``string`` and ``object`` dtype.

.. _text.string_methods:
Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ Previously, strings were typically stored in object-dtype NumPy arrays.
``StringDtype`` is currently considered experimental. The implementation
and parts of the API may change without warning.

The text extension type solves several issues with object-dtype NumPy arrays:
The ``'string'`` extension type solves several issues with object-dtype NumPy arrays:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need any updating an/or reference to your new section that you added in text.rst?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger you can probably update the sentence "The usual string accessor methods work. Where appropriate, the return type of the Series or columns of a DataFrame will also have string dtype." 20 lines below this line, to include that it can also return IntegerDtype in certain cases.


1. You can accidentally store a *mixture* of strings and non-strings in an
``object`` dtype array. A ``StringArray`` can only store strings.
Expand Down
15 changes: 11 additions & 4 deletions pandas/_libs/lib.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -2208,9 +2208,13 @@ def maybe_convert_objects(ndarray[object] objects, bint try_float=0,
return objects


_no_default = object()


@cython.boundscheck(False)
@cython.wraparound(False)
def map_infer_mask(ndarray arr, object f, const uint8_t[:] mask, bint convert=1):
def map_infer_mask(ndarray arr, object f, const uint8_t[:] mask, bint convert=1,
object na_value=_no_default, object dtype=object):
"""
Substitute for np.vectorize with pandas-friendly dtype inference

Expand All @@ -2225,14 +2229,17 @@ def map_infer_mask(ndarray arr, object f, const uint8_t[:] mask, bint convert=1)
"""
cdef:
Py_ssize_t i, n
ndarray[object] result
ndarray result
object val

n = len(arr)
result = np.empty(n, dtype=object)
result = np.empty(n, dtype=dtype)
for i in range(n):
if mask[i]:
val = arr[i]
if na_value is _no_default:
val = arr[i]
else:
val = na_value
else:
val = f(arr[i])

Expand Down
50 changes: 49 additions & 1 deletion pandas/core/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
from functools import wraps
import re
import textwrap
from typing import Dict, List
from typing import TYPE_CHECKING, Any, Callable, Dict, List
import warnings

import numpy as np
Expand All @@ -15,10 +15,14 @@
ensure_object,
is_bool_dtype,
is_categorical_dtype,
is_extension_array_dtype,
is_integer,
is_integer_dtype,
is_list_like,
is_object_dtype,
is_re,
is_scalar,
is_string_dtype,
)
from pandas.core.dtypes.generic import (
ABCDataFrame,
Expand All @@ -28,9 +32,14 @@
)
from pandas.core.dtypes.missing import isna

from pandas._typing import ArrayLike, Dtype
from pandas.core.algorithms import take_1d
from pandas.core.base import NoNewAttributesMixin
import pandas.core.common as com
from pandas.core.construction import extract_array

if TYPE_CHECKING:
from pandas.arrays import StringArray

_cpython_optimized_encoders = (
"utf-8",
Expand Down Expand Up @@ -109,9 +118,48 @@ def cat_safe(list_of_columns: List, sep: str):

def _na_map(f, arr, na_result=np.nan, dtype=object):
# should really _check_ for NA
if is_extension_array_dtype(arr.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you sync the signatures up with _map, also rename _map -> _map_arraylike (or similar)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, don't follow this. I think they do match, aside from @jbrockmendel's request to call it func rather than f.

I guess na_map calls it na_result while _map calls it na_value.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really -1 on 2 different branches here. If they have exactly the same signature a little less negative. again I would rename _map to be more informative here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with "-1 on 2 different branches here" ?
The whole purpose of this PR is to add a separate branch in case of StringArray (because we can be more efficient and want to be more specific in the result dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why my initial PR was so much more complex, since it tried to handle both cases similarly. I think that was more complex than this.

As Joris says, the main point of divergence is that for StringArray we usually know the result dtype exactly. It doesn't depend on the presence of NAs. Additionally,

  1. We're still using map_infer_dtype for both, so the core implementation is the same.
  2. We'll eventually deprecate .str on object-dtype, so we will end up with just this implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, at the very least these signatures for what you call _map and _stringarray_map should be exactly the same.
and _map -> _map_object and _stringarray_map -> _map_stringarray.

I think this is crucial for not adding technical debt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback is your suggestion to add na_mask to the _stringarray_map signature and have it just not be used? I think this relates to the "this should be a StringArray method" discussion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_map and _na_map already have inconsistent signatures. I'm not sure why it's that way on master, but I'm a bit against adding unused arguments in this case.

What's the technical debt we're adding here? By definition, we need to handle StringArray differently, since its result type doesn't depend on the presence of NAs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_map and _na_map already have inconsistent signatures. I'm not sure why it's that way on master, but I'm a bit against adding unused arguments in this case.

And there is also a good reason for that, as _map has an additional argument na_mask that is used internally in _map (for a recursive call).
I think refactoring _map is outside of the scope of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you create an issue to clean this up (_map and _na_map), and/or refactor of this, post this PR.

arr = extract_array(arr, extract_numpy=True)
return _map_ea(f, arr, na_value=na_result, dtype=dtype)
return _map(f, arr, na_mask=True, na_value=na_result, dtype=dtype)


def _map_ea(
func: Callable, arr: "StringArray", na_value: Any, dtype: Dtype
) -> ArrayLike:
from pandas.arrays import IntegerArray, StringArray

mask = isna(arr)

assert isinstance(arr, StringArray)
arr = arr._ndarray

if is_integer_dtype(dtype):
na_value_is_na = isna(na_value)
if na_value_is_na:
na_value = 1
result = lib.map_infer_mask(
arr,
func,
mask.view("uint8"),
convert=False,
na_value=na_value,
dtype=np.dtype("int64"),
)

if not na_value_is_na:
mask[:] = False

return IntegerArray(result, mask)

elif is_string_dtype(dtype) and not is_object_dtype(dtype):
result = lib.map_infer_mask(arr, func, mask.view("uint8"), na_value=na_value)
return StringArray(result)
# TODO: BooleanArray
else:
return lib.map_infer_mask(arr, func, mask.view("uint8"))


def _map(f, arr, na_mask=False, na_value=np.nan, dtype=object):
if not len(arr):
return np.ndarray(0, dtype=dtype)
Expand Down
45 changes: 39 additions & 6 deletions pandas/tests/test_strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -731,7 +731,10 @@ def test_count(self):
tm.assert_series_equal(result, exp)

# mixed
mixed = ["a", np.nan, "b", True, datetime.today(), "foo", None, 1, 2.0]
mixed = np.array(
["a", np.nan, "b", True, datetime.today(), "foo", None, 1, 2.0],
dtype=object,
)
rs = strings.str_count(mixed, "a")
xp = np.array([1, np.nan, 0, np.nan, np.nan, 0, np.nan, np.nan, np.nan])
tm.assert_numpy_array_equal(rs, xp)
Expand All @@ -755,14 +758,14 @@ def test_contains(self):
expected = np.array([False, np.nan, False, False, True], dtype=np.object_)
tm.assert_numpy_array_equal(result, expected)

values = ["foo", "xyz", "fooommm__foo", "mmm_"]
values = np.array(["foo", "xyz", "fooommm__foo", "mmm_"], dtype=object)
result = strings.str_contains(values, pat)
expected = np.array([False, False, True, True])
assert result.dtype == np.bool_
tm.assert_numpy_array_equal(result, expected)

# case insensitive using regex
values = ["Foo", "xYz", "fOOomMm__fOo", "MMM_"]
values = np.array(["Foo", "xYz", "fOOomMm__fOo", "MMM_"], dtype=object)
result = strings.str_contains(values, "FOO|mmm", case=False)
expected = np.array([True, False, True, True])
tm.assert_numpy_array_equal(result, expected)
Expand All @@ -773,7 +776,10 @@ def test_contains(self):
tm.assert_numpy_array_equal(result, expected)

# mixed
mixed = ["a", np.nan, "b", True, datetime.today(), "foo", None, 1, 2.0]
mixed = np.array(
["a", np.nan, "b", True, datetime.today(), "foo", None, 1, 2.0],
dtype=object,
)
rs = strings.str_contains(mixed, "o")
xp = np.array(
[False, np.nan, False, np.nan, np.nan, True, np.nan, np.nan, np.nan],
Expand Down Expand Up @@ -869,7 +875,10 @@ def test_endswith(self):
tm.assert_series_equal(result, exp.fillna(False).astype(bool))

# mixed
mixed = ["a", np.nan, "b", True, datetime.today(), "foo", None, 1, 2.0]
mixed = np.array(
["a", np.nan, "b", True, datetime.today(), "foo", None, 1, 2.0],
dtype=object,
)
rs = strings.str_endswith(mixed, "f")
xp = np.array(
[False, np.nan, False, np.nan, np.nan, False, np.nan, np.nan, np.nan],
Expand Down Expand Up @@ -3488,10 +3497,13 @@ def test_casefold(self):


def test_string_array(any_string_method):
method_name, args, kwargs = any_string_method
if method_name == "decode":
pytest.skip("decode requires bytes.")

data = ["a", "bb", np.nan, "ccc"]
a = Series(data, dtype=object)
b = Series(data, dtype="string")
method_name, args, kwargs = any_string_method

expected = getattr(a.str, method_name)(*args, **kwargs)
result = getattr(b.str, method_name)(*args, **kwargs)
Expand All @@ -3502,8 +3514,29 @@ def test_string_array(any_string_method):
):
assert result.dtype == "string"
result = result.astype(object)

elif expected.dtype == "float" and expected.isna().any():
assert result.dtype == "Int64"
result = result.astype("float")

elif isinstance(expected, DataFrame):
columns = expected.select_dtypes(include="object").columns
assert all(result[columns].dtypes == "string")
result[columns] = result[columns].astype(object)
tm.assert_equal(result, expected)


@pytest.mark.parametrize(
"method,args,expected",
[
("count", ("a",), [2, None]),
("find", ("a",), [0, None]),
("index", ("a",), [0, None]),
("rindex", ("a",), [2, None]),
],
)
def test_string_array_numeric_integer_array(method, args, expected):
s = Series(["aba", None], dtype="string")
result = getattr(s.str, method)(*args)
expected = Series(expected, dtype="Int64")
tm.assert_series_equal(result, expected)