Skip to content

BUG(string dtype): Arithmetic operations between Series with string dtype index #61425

Open
@rhshadrach

Description

@rhshadrach

Similar to #61099, but concerning lhs + rhs. Alignment in general is heavily involved here as well. One thing to note is that unlike in comparisons operations, in arithmetic operations the lhs.index dtype is favored, assuming no coercion is necessary.

dtypes = [
    np.dtype(object),
    pd.StringDtype("pyarrow", na_value=np.nan),
    pd.StringDtype("python", na_value=np.nan),
    pd.StringDtype("pyarrow", na_value=pd.NA),
    pd.StringDtype("python", na_value=pd.NA),
    pd.ArrowDtype(pa.string())
]
idx1 = pd.Series(["a", np.nan, "b"], dtype=dtypes[1])
idx2 = pd.Series(["a", np.nan, "b"], dtype=dtypes[3])
df1 = pd.DataFrame({"idx": idx1, "value": [1, 2, 3]}).set_index("idx")
df2 = pd.DataFrame({"idx": idx2, "value": [1, 2, 3]}).set_index("idx")
print(df1["value"] + df2["value"])
print(df2["value"] + df1["value"])

When concerning string dtypes, I've observed the following:

  • NaN vs NA generally aligns, the value propagated is always NA
  • NaN vs NA does not align when the NA arises from ArrowExtensionArray
  • NaN vs None (object) aligns, the value propagated is from lhs
  • NA vs None does not align
  • PyArrow-NA + ArrowExtensionArray results in object dtype (NAs do align)
  • Python-NA + PyArrow-NA results in PyArrow-NA; contrary to the left being preferred
  • Python-NA + PyArrow-NA results in object type (NAs do align)
  • When lhs and rhs have indices that are both object dtype:
    • NaN vs None aligns and propagates the lhs value.
    • NA vs None does not align
    • NA vs NaN does not align

I think the main two things we need to decide are:

  1. How should NA vs NaN vs None align.
  2. When they do align, which value should be propagated.

A few properties I think are crucial:

  • Alignment should only depend on value and left-vs-right operand, not storage.
  • Alignment should be transitive.

If we do decide on aligning between different values, a natural order is None < NaN < NA. However, the most backwards compatible would be to have None vs NaN be operand dependent with NA always propagating when present.

Metadata

Metadata

Labels

API - ConsistencyInternal Consistency of API/BehaviorBugNeeds DiscussionRequires discussion from core team before further actionStringsString extension data type and string data

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions