-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
[ArrowStringArray] PERF: implement ArrowStringArray._str_split #41085
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
34df9e5
7ad0269
427eff7
09ad85e
39dd30a
c9511d9
8678af7
5c2ab24
12407fb
24d2395
f43c61c
3d9297d
a574ccb
c7ba99c
9fc0144
8855100
ad3480f
70677c4
af58055
2abc508
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -763,3 +763,35 @@ def _str_lower(self): | |
|
||
def _str_upper(self): | ||
return type(self)(pc.utf8_upper(self._data)) | ||
|
||
def _str_split(self, pat=None, n=-1, expand=False): | ||
if pat is None: | ||
if hasattr(pc, "utf8_split_whitespace"): | ||
if n is None or n == 0: | ||
n = -1 | ||
result = pc.utf8_split_whitespace(self._data, max_splits=n) | ||
else: | ||
return super()._str_split(pat=pat, n=n, expand=expand) | ||
else: | ||
if len(pat) == 1 and hasattr(pc, "split_pattern"): | ||
if n is None or n == 0: | ||
n = -1 | ||
result = pc.split_pattern(self._data, pattern=pat, max_splits=n) | ||
else: | ||
return super()._str_split(pat=pat, n=n, expand=expand) | ||
|
||
if result.null_count: | ||
is_valid = np.array(result.is_valid()) | ||
result = np.array(result) | ||
result[~is_valid] = self.dtype.na_value | ||
valid = result[is_valid] | ||
# we need to loop through to avoid numpy indexing assignment errors when | ||
# the result is not a ragged array and interpreted as a 2 dimensional | ||
# array | ||
for i, val in enumerate(valid): | ||
valid[i] = val.tolist() | ||
else: | ||
result = np.array(result) | ||
for i, val in enumerate(result): | ||
result[i] = val.tolist() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In general, I would maybe leave out this conversion to lists for now (it's opt-in, so there can be some change in behaviour where needed; and since the main reason to use this arrow dtype is performance, it could make sense to keep the arrays). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no tests break with expand=True and only one test (that we can change) for expand=False. but that is only part of the perf issue. have added benchmark but will circle back to this shortly (pushing changes to origin trigger a notification, but not ready for review yet) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The additional time in DataFrame construction from an array of arrays is more than the time taken to convert to lists. convert to lists
not converting to lists
|
||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show a small example of the problem? Quickly testing I see no 2D array: