Skip to content

fix MultiIndex.difference not working with PyArrow timestamps (#61382) ,and some formating fix #61388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

NEREUScode
Copy link

Problem

The MultiIndex.difference method fails to remove entries when the index contains PyArrow-backed timestamps (timestamp[ns][pyarrow]). This occurs because direct tuple comparisons with PyArrow scalar types are unreliable during membership checks, causing entries to remain unexpectedly.

Example:

# PyArrow timestamp index
df = DataFrame(...).astype({"date": "timestamp[ns][pyarrow]"}).set_index(["id", "date"])
idx_val = df.index[0]
new_index = df.index.difference([idx_val])  # Fails to remove idx_val

Solution
Code Conversion: Map other values to integer codes compatible with the original index's levels.

Engine Validation: Use the MultiIndex's internal engine for membership checks, ensuring accurate handling of PyArrow types.

Mask-Based Exclusion: Create a boolean mask to filter out matched entries, then reconstruct the index.

Testing
Added a test in pandas/tests/indexes/multi/test_setops.py that:

Creates a MultiIndex with PyArrow timestamps.

Validates difference correctly removes entries.

Skips the test if PyArrow is not installed.

Use Case Impact
Fixes scenarios where users filter hierarchical datasets with PyArrow timestamps, such as:

# Remove specific timestamps from a time-series index
clean_index = raw_index.difference(unwanted_timestamps)

Closes #61382.

@NEREUScode NEREUScode closed this May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Multindex difference not working on columns with type Timestamp[ns][pyarrow]
1 participant