Skip to content

ENH: Implement _from_sequence_of_strings for BooleanArray #31159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jan 23, 2020
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ type dedicated to boolean data that can hold missing values. The default
``bool`` data type based on a bool-dtype NumPy array, the column can only hold
``True`` or ``False``, and not missing values. This new :class:`~arrays.BooleanArray`
can store missing values as well by keeping track of this in a separate mask.
(:issue:`29555`, :issue:`30095`)
(:issue:`29555`, :issue:`30095`, :issue:`31131`)

.. ipython:: python

Expand Down
13 changes: 13 additions & 0 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,19 @@ def _from_sequence(cls, scalars, dtype=None, copy: bool = False):
values, mask = coerce_to_array(scalars, copy=copy)
return BooleanArray(values, mask)

@classmethod
def _from_sequence_of_strings(cls, strings, dtype=None, copy=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type strings (List[str]), dtype and copy

def map_string(s):
if s in ["True", "true", "1"]:
return True
elif s in ["False", "false", "0"]:
return False
else:
return s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, we should probably make a more performant implementation for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, any thoughts on what such an implementation would look like?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, perf isn't worth worrying about too much until we can push the actual parsing down to the C reader. As is, we've already built up a list of Python strings in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually not too hard to do in parsers.pyx; this is generically handled in python code rather than in cython.


scalars = [map_string(x) for x in strings]
return cls._from_sequence(scalars, dtype, copy)

def _values_for_factorize(self) -> Tuple[np.ndarray, Any]:
data = self._data.astype("int8")
data[self._mask] = -1
Expand Down
17 changes: 17 additions & 0 deletions pandas/tests/arrays/test_boolean.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import io
import operator

import numpy as np
Expand Down Expand Up @@ -251,6 +252,22 @@ def test_coerce_to_numpy_array():
np.array(arr, dtype="bool")


def test_to_boolean_array_from_strings():
result = BooleanArray._from_sequence_of_strings(["True", "False"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add in null values e.g. '' and NaN i think would work (as strings).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this specific method breaks when passed other kinds of strings. I think (correct me if I'm wrong) the null strings get handled upstream by read_csv according to na_values.

I could add in a None if you think that makes sense, or add additional null strings to the read_csv test (could parametrize over different NA strings)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should handle nulls as well (they will be converted in the Boolean constructor), if not, then this is broken too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, the csv parser already handles the different null string representations. So testing here just with np.nan or None is fine.

expected = BooleanArray(np.array([True, False]), np.array([False, False]))

tm.assert_extension_array_equal(result, expected)


def test_boolean_from_csv():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to go in the pandas/tests/io/parsers/test_dtypes.py

pls follow the existing patterns

input_string = "a\nTrue\nFalse\nNA\n"

result = pd.read_csv(io.StringIO(input_string), dtype="boolean")
expected = pd.DataFrame({"a": pd.array([True, False, None], dtype="boolean")})

tm.assert_frame_equal(result, expected)


def test_repr():
df = pd.DataFrame({"A": pd.array([True, False, None], dtype="boolean")})
expected = " A\n0 True\n1 False\n2 <NA>"
Expand Down