-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
CLN/ERR: str.cat internals #22725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLN/ERR: str.cat internals #22725
Changes from 1 commit
5f8890c
285a1f7
28e7859
807f18e
ed27c66
0d3c6d2
36c6240
a97fe67
e58ec9d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,7 +18,6 @@ | |
import pandas.core.common as com | ||
from pandas.core.algorithms import take_1d | ||
import pandas.compat as compat | ||
from pandas.compat.numpy import _np_version_under1p11 | ||
from pandas.core.base import NoNewAttributesMixin | ||
from pandas.util._decorators import Appender | ||
import re | ||
|
@@ -38,15 +37,15 @@ | |
_shared_docs = dict() | ||
|
||
|
||
def cat_core(all_cols, sep): | ||
def cat_core(list_of_columns, sep): | ||
""" | ||
Auxiliary function for :meth:`str.cat` | ||
|
||
Parameters | ||
---------- | ||
all_cols : two-dimensional numpy array | ||
array of columns to be concatenated with sep; | ||
this array may not contain NaNs! | ||
list_of_columns : list of numpy arrays | ||
List of arrays to be concatenated with sep; | ||
these arrays may not contain NaNs! | ||
sep : string | ||
The separator string for concatenating the columns | ||
|
||
|
@@ -55,15 +54,9 @@ def cat_core(all_cols, sep): | |
nd.array | ||
The concatenation of list_of_columns with sep | ||
""" | ||
list_of_columns = np.split(all_cols, all_cols.shape[1], axis=1) | ||
list_with_sep = [sep] * (2 * len(list_of_columns) - 1) | ||
list_with_sep[::2] = list_of_columns | ||
res = np.sum(list_with_sep, axis=0) | ||
if not (_np_version_under1p11 and len(res) == 0): | ||
# np.split splits into arrays of shape (N, 1); NOT (N,) | ||
# need to reduce dimensionality of result | ||
res = res[:, 0] | ||
return res | ||
return np.sum(list_with_sep, axis=0) | ||
|
||
|
||
def _na_map(f, arr, na_result=np.nan, dtype=object): | ||
|
@@ -2246,21 +2239,21 @@ def cat(self, others=None, sep=None, na_rep=None, join=None): | |
"'outer'|'inner'|'right'`. The future default will " | ||
"be `join='left'`.", FutureWarning, stacklevel=2) | ||
|
||
# concatenate others into DataFrame; need to add keys for uniqueness in | ||
# case of duplicate columns (for join is None, all indexes are already | ||
# the same after _get_series_list, which forces alignment in this case) | ||
others = concat(others, axis=1, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @WillAyd, this is the main reason for the slow-down. For working with a |
||
join=(join if join == 'inner' else 'outer'), | ||
keys=range(len(others)), copy=False) | ||
# if join is None, _get_series_list already force-aligned indexes | ||
join = 'left' if join is None else join | ||
|
||
# align if required | ||
if not data.index.equals(others.index): | ||
if any(not data.index.equals(x.index) for x in others): | ||
# Need to add keys for uniqueness in case of duplicate columns | ||
others = concat(others, axis=1, | ||
join=(join if join == 'inner' else 'outer'), | ||
keys=range(len(others)), copy=False) | ||
data, others = data.align(others, join=join) | ||
others = [others[x] for x in others] # again list of Series | ||
|
||
# collect all columns | ||
all_cols = ensure_object(concat([data, others], axis=1, copy=False)) | ||
na_masks = isna(all_cols) | ||
union_mask = np.logical_or.reduce(na_masks, axis=1) | ||
all_cols = [ensure_object(x) for x in [data] + others] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assuming the index is aligned here can we alternately just concat the columns together and call |
||
na_masks = np.array([isna(x) for x in all_cols]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar comment above but should be able to use |
||
union_mask = np.logical_or.reduce(na_masks, axis=0) | ||
|
||
if na_rep is None and union_mask.any(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. comment on these cases There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added comments |
||
# no na_rep means NaNs for all rows where any column has a NaN | ||
|
@@ -2269,10 +2262,13 @@ def cat(self, others=None, sep=None, na_rep=None, join=None): | |
np.putmask(result, union_mask, np.nan) | ||
|
||
not_masked = ~union_mask | ||
result[not_masked] = cat_core(all_cols[not_masked], sep) | ||
result[not_masked] = cat_core([x[not_masked] for x in all_cols], | ||
sep) | ||
elif na_rep is not None and union_mask.any(): | ||
# fill NaNs with na_rep in case there are actually any NaNs | ||
result = cat_core(np.where(na_masks, na_rep, all_cols), sep) | ||
all_cols = [np.where(nm, na_rep, col) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Readability nit but can you change |
||
for nm, col in zip(na_masks, all_cols)] | ||
result = cat_core(all_cols, sep) | ||
else: | ||
# no NaNs - can just concatenate | ||
result = cat_core(all_cols, sep) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is with this documented limitation? Believe in master that
NaN
is valid and returnsNaN
when passed inThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd
the new helper function
cat_core
has nothing to do with the existing (essentially fully-fledged up to index-handling)str_cat
. It's an internal function to avoid lots of copied code, and (as it just wrapsnp.sum
) is not nan-safe for string values.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd
Does that answer your question? what would you like to see for this docstring?