fix MultiIndex.difference not working with PyArrow timestamps (#61382) ,and some formating fix #61388
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
The
MultiIndex.difference
method fails to remove entries when the index contains PyArrow-backed timestamps (timestamp[ns][pyarrow]
). This occurs because direct tuple comparisons with PyArrow scalar types are unreliable during membership checks, causing entries to remain unexpectedly.Example:
Solution
Code Conversion: Map other values to integer codes compatible with the original index's levels.
Engine Validation: Use the MultiIndex's internal engine for membership checks, ensuring accurate handling of PyArrow types.
Mask-Based Exclusion: Create a boolean mask to filter out matched entries, then reconstruct the index.
Testing
Added a test in pandas/tests/indexes/multi/test_setops.py that:
Creates a MultiIndex with PyArrow timestamps.
Validates difference correctly removes entries.
Skips the test if PyArrow is not installed.
Use Case Impact
Fixes scenarios where users filter hierarchical datasets with PyArrow timestamps, such as:
Closes #61382.