Skip to content

str.cat does not align on index? #18657

Closed
@h-vetinari

Description

@h-vetinari

The implicit index-matching of pandas for operations between different DataFrame/Series is great and most of the times, it just works. It does so consistently enough, that the expectation (for me) is that different Series will be aligned before an operation is performed.

For some reason, str.cat does not seem to do so.

import pandas as pd # 0.21.0
import numpy as np # 1.13.3
col = pd.Series(['a','b','c','d','e','f','g','h','i','j'])

# choose random subsets
ss1 = [8, 1, 2, 0, 6] # list(col.sample(5).index) 
ss2 = [4, 0, 9, 2, 6] # list(col.sample(5).index)

# perform str.cat
col.loc[ss1].str.cat(col.loc[ss2], sep = '').sort_index()
# 0    ac <-- UNMATCHED!
# 1    ba <-- UNMATCHED!
# 2    cj <-- UNMATCHED!
# 6    gg <-- correct by sheer luck
# 8    ie <-- UNMATCHED!

# compared for example with Boolean operations on unmatched series
# (matching indices and returning Series with union of both indices),
# this is inconsistent!
b = col.loc[ss1].astype(bool) & col.loc[ss2].astype(bool)
b
# 0     True
# 1    False
# 2     True
# 4    False
# 6     True
# 8    False
# 9    False

# if we manually align the Series
# (easy here by masking from the Series we just subsampled, hard in practice),
# then the NaNs are handled as expected:
m = col.where(np.isin(col.index, ss1)).str.cat(col.where(np.isin(col.index, ss2)), sep = '')
m
# 0     aa
# 1    NaN
# 2     cc
# 3    NaN
# 4    NaN
# 5    NaN
# 6     gg
# 7    NaN
# 8    NaN
# 9    NaN

# based on the normal pandas-behaviour for unmatched Series
# (for example as for Boolean "and" above), the following would be
# the expected result of col.loc[ss1].str.cat(col.loc[ss2], sep = '').sort_index() !
m.loc[b.index]
# 0     aa <-- MATCHED!
# 1    NaN
# 2     cc <-- MATCHED!
# 4    NaN
# 6     gg <-- MATCHED!
# 8    NaN
# 9    NaN

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignStringsString extension data type and string data

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions