Skip to content

[ArrowStringArray] PERF: implement ArrowStringArray._str_split #41085

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 32 additions & 0 deletions pandas/core/arrays/string_arrow.py
Original file line number Diff line number Diff line change
Expand Up @@ -763,3 +763,35 @@ def _str_lower(self):

def _str_upper(self):
return type(self)(pc.utf8_upper(self._data))

def _str_split(self, pat=None, n=-1, expand=False):
if pat is None:
if hasattr(pc, "utf8_split_whitespace"):
if n is None or n == 0:
n = -1
result = pc.utf8_split_whitespace(self._data, max_splits=n)
else:
return super()._str_split(pat=pat, n=n, expand=expand)
else:
if len(pat) == 1 and hasattr(pc, "split_pattern"):
if n is None or n == 0:
n = -1
result = pc.split_pattern(self._data, pattern=pat, max_splits=n)
else:
return super()._str_split(pat=pat, n=n, expand=expand)

if result.null_count:
is_valid = np.array(result.is_valid())
result = np.array(result)
result[~is_valid] = self.dtype.na_value
valid = result[is_valid]
# we need to loop through to avoid numpy indexing assignment errors when
# the result is not a ragged array and interpreted as a 2 dimensional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show a small example of the problem? Quickly testing I see no 2D array:

In [49]: np.array(pa.array([[1, 2], [3, 4]]))
Out[49]: array([array([1, 2]), array([3, 4])], dtype=object)

# array
for i, val in enumerate(valid):
valid[i] = val.tolist()
else:
result = np.array(result)
for i, val in enumerate(result):
result[i] = val.tolist()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would maybe leave out this conversion to lists for now (it's opt-in, so there can be some change in behaviour where needed; and since the main reason to use this arrow dtype is performance, it could make sense to keep the arrays).
Or does that break many tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no tests break with expand=True and only one test (that we can change) for expand=False.

but that is only part of the perf issue. have added benchmark but will circle back to this shortly (pushing changes to origin trigger a notification, but not ready for review yet)

Copy link
Member Author

@simonjayhawkins simonjayhawkins May 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The additional time in DataFrame construction from an array of arrays is more than the time taken to convert to lists.

convert to lists

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   100000    0.051    0.000    0.051    0.000 {method 'tolist' of 'numpy.ndarray' objects}
       23    0.018    0.001    0.018    0.001 {built-in method numpy.array}
       10    0.015    0.001    0.015    0.001 {pyarrow.lib.array}
        2    0.012    0.006    0.012    0.006 {method 'call' of 'pyarrow._compute.Function' objects}
        1    0.012    0.012    0.015    0.015 accessor.py:290(<listcomp>)
        1    0.011    0.011    0.072    0.072 {pandas._libs.lib.map_infer_mask}
        1    0.010    0.010    0.027    0.027 accessor.py:286(<listcomp>)
   100000    0.010    0.000    0.061    0.000 string_arrow.py:923(<lambda>)
   100000    0.009    0.000    0.016    0.000 accessor.py:280(cons_row)
   100001    0.009    0.000    0.012    0.000 accessor.py:289(<genexpr>)
       10    0.009    0.001    0.009    0.001 {built-in method pandas._libs.lib.ensure_string_array}
   100032    0.007    0.000    0.007    0.000 {pandas._libs.lib.is_list_like}
200095/200071    0.007    0.000    0.007    0.000 {built-in method builtins.len}

not converting to lists

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.070    0.070    0.070    0.070 construction.py:793(<listcomp>)
       21    0.021    0.001    0.021    0.001 {built-in method numpy.array}
        1    0.015    0.015    0.019    0.019 accessor.py:290(<listcomp>)
       10    0.015    0.002    0.015    0.002 {pyarrow.lib.array}
        1    0.012    0.012    0.012    0.012 {method 'call' of 'pyarrow._compute.Function' objects}
        1    0.010    0.010    0.025    0.025 accessor.py:286(<listcomp>)
       10    0.009    0.001    0.009    0.001 {built-in method pandas._libs.lib.ensure_string_array}
   100001    0.009    0.000    0.013    0.000 accessor.py:289(<genexpr>)
   100000    0.008    0.000    0.015    0.000 accessor.py:280(cons_row)

return result
Loading