Skip to content

CLN/ERR: str.cat internals #22725

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 14, 2018
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 21 additions & 25 deletions pandas/core/strings.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
import pandas.core.common as com
from pandas.core.algorithms import take_1d
import pandas.compat as compat
from pandas.compat.numpy import _np_version_under1p11
from pandas.core.base import NoNewAttributesMixin
from pandas.util._decorators import Appender
import re
Expand All @@ -38,15 +37,15 @@
_shared_docs = dict()


def cat_core(all_cols, sep):
def cat_core(list_of_columns, sep):
"""
Auxiliary function for :meth:`str.cat`

Parameters
----------
all_cols : two-dimensional numpy array
array of columns to be concatenated with sep;
this array may not contain NaNs!
list_of_columns : list of numpy arrays
List of arrays to be concatenated with sep;
these arrays may not contain NaNs!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is with this documented limitation? Believe in master that NaN is valid and returns NaN when passed in

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd
the new helper function cat_core has nothing to do with the existing (essentially fully-fledged up to index-handling) str_cat. It's an internal function to avoid lots of copied code, and (as it just wraps np.sum) is not nan-safe for string values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd
Does that answer your question? what would you like to see for this docstring?

sep : string
The separator string for concatenating the columns

Expand All @@ -55,15 +54,9 @@ def cat_core(all_cols, sep):
nd.array
The concatenation of list_of_columns with sep
"""
list_of_columns = np.split(all_cols, all_cols.shape[1], axis=1)
list_with_sep = [sep] * (2 * len(list_of_columns) - 1)
list_with_sep[::2] = list_of_columns
res = np.sum(list_with_sep, axis=0)
if not (_np_version_under1p11 and len(res) == 0):
# np.split splits into arrays of shape (N, 1); NOT (N,)
# need to reduce dimensionality of result
res = res[:, 0]
return res
return np.sum(list_with_sep, axis=0)


def _na_map(f, arr, na_result=np.nan, dtype=object):
Expand Down Expand Up @@ -2246,21 +2239,21 @@ def cat(self, others=None, sep=None, na_rep=None, join=None):
"'outer'|'inner'|'right'`. The future default will "
"be `join='left'`.", FutureWarning, stacklevel=2)

# concatenate others into DataFrame; need to add keys for uniqueness in
# case of duplicate columns (for join is None, all indexes are already
# the same after _get_series_list, which forces alignment in this case)
others = concat(others, axis=1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd, this is the main reason for the slow-down. For working with a DataFrame below (as you wished), we first need to create it with pd.concat (expensive). Before, we were only using pd.concat if the indices need to be aligned (which they don't in the benchmarks).

join=(join if join == 'inner' else 'outer'),
keys=range(len(others)), copy=False)
# if join is None, _get_series_list already force-aligned indexes
join = 'left' if join is None else join

# align if required
if not data.index.equals(others.index):
if any(not data.index.equals(x.index) for x in others):
# Need to add keys for uniqueness in case of duplicate columns
others = concat(others, axis=1,
join=(join if join == 'inner' else 'outer'),
keys=range(len(others)), copy=False)
data, others = data.align(others, join=join)
others = [others[x] for x in others] # again list of Series

# collect all columns
all_cols = ensure_object(concat([data, others], axis=1, copy=False))
na_masks = isna(all_cols)
union_mask = np.logical_or.reduce(na_masks, axis=1)
all_cols = [ensure_object(x) for x in [data] + others]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the index is aligned here can we alternately just concat the columns together and call ensure_object on the entire frame instead of using a list comprehension?

na_masks = np.array([isna(x) for x in all_cols])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment above but should be able to use pd.isnull instead of a list comp with isna

union_mask = np.logical_or.reduce(na_masks, axis=0)

if na_rep is None and union_mask.any():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on these cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments

# no na_rep means NaNs for all rows where any column has a NaN
Expand All @@ -2269,10 +2262,13 @@ def cat(self, others=None, sep=None, na_rep=None, join=None):
np.putmask(result, union_mask, np.nan)

not_masked = ~union_mask
result[not_masked] = cat_core(all_cols[not_masked], sep)
result[not_masked] = cat_core([x[not_masked] for x in all_cols],
sep)
elif na_rep is not None and union_mask.any():
# fill NaNs with na_rep in case there are actually any NaNs
result = cat_core(np.where(na_masks, na_rep, all_cols), sep)
all_cols = [np.where(nm, na_rep, col)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability nit but can you change nm to na_mask? Makes it clearer what is going on (on initial glance misread nm to be something name related)

for nm, col in zip(na_masks, all_cols)]
result = cat_core(all_cols, sep)
else:
# no NaNs - can just concatenate
result = cat_core(all_cols, sep)
Expand Down