Skip to content

[SPARK-43241][PS] MultiIndex.append not checking names for equality #42787

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented Sep 4, 2023

What changes were proposed in this pull request?

This PR proposes to fix the behavior of MultiIndex.append to do not checking names.

Why are the changes needed?

To match the behavior with pandas according to pandas-dev/pandas#48288

Does this PR introduce any user-facing change?

Yes, the behavior is changed to match with pandas:

Testing data

>>> psmidx1
MultiIndex([('a', 'x', 1),
            ('b', 'y', 2),
            ('c', 'z', 3)],
           names=['x', 'y', 'z'])
>>> psmidx2
MultiIndex([('a', 'x', 1),
            ('b', 'y', 2),
            ('c', 'z', 3)],
           names=['p', 'q', 'r'])

Before

>>> psmidx1.append(psmidx2)
MultiIndex([('a', 'x', 1),
            ('b', 'y', 2),
            ('c', 'z', 3),
            ('a', 'x', 1),
            ('b', 'y', 2),
            ('c', 'z', 3)],
           names=['x', 'y', 'z'])

After

>>> psmidx1.append(psmidx2)
MultiIndex([('a', 'x', 1),
            ('b', 'y', 2),
            ('c', 'z', 3),
            ('a', 'x', 1),
            ('b', 'y', 2),
            ('c', 'z', 3)],
           )

How was this patch tested?

Fix the existing UTs.

Was this patch authored or co-authored using generative AI tooling?

No.

internal = InternalFrame(
spark_frame=sdf_appended,
index_spark_columns=[
scol_for(sdf_appended, col) for col in self._internal.index_spark_column_names
],
index_names=index_names,
index_names=None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simply set the index_names to None to follow the behavior of Pandas, since Pandas doesn't keep the name of MultiIndex when computing the append from Pandas 2.0.0. (See pandas-dev/pandas#48288 more detail)

cc @zhengruifeng @HyukjinKwon as CI passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a bug in Pandas that might be fixed in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's an intentional behavior since they mentioned this in "Bug fixes" section in their release note?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I thought it was a bug in Pandas that is fixed in the Pandas 2.0.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also mention this in our migration doc?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I misunderstood the Pandas PR. LTGM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we also mention this in our migration doc?

Hmm.. I didn't mention this as a behavior change since it's a bug fix, but on second thought maybe we'd better to mention in the migration guide anyway.

Let me create a follow-up for updating the migration guide.

@zhengruifeng
Copy link
Contributor

merged to master

zhengruifeng pushed a commit that referenced this pull request Sep 5, 2023
### What changes were proposed in this pull request?

This follow-ups for #42787 to update the migration guide.

### Why are the changes needed?

We should mention all the behavior changes in migration guide.

### Does this PR introduce _any_ user-facing change?

No. it's documentation update

### How was this patch tested?

The existing CI should pass

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #42811 from itholic/43241-migration.

Authored-by: Haejoon Lee <[email protected]>
Signed-off-by: Ruifeng Zheng <[email protected]>
@itholic itholic deleted the SPARK-43241 branch November 20, 2023 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants