Skip to content

BUG: Check for duplicate names columns and index in crosstab #28474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from

Conversation

cuchoi
Copy link

@cuchoi cuchoi commented Sep 17, 2019

Problem: Currently if you give two Series with the same name to pd.crosstab it will ignore one of the columns. By minimally modifying a current test, it no longer passes.

Replicable example of the issue (more examples in the tests of this PR):

import pandas as pd
s1 = pd.Series(range(3), name='foo')
s2 = s1 + 1

expected = pd.crosstab(s1, s2.rename("bar")) # If we rename one of the series we get the correct result
result = pd.crosstab(s1, s2)

pd.testing.assert_frame_equal(result, expected)

Proposed solution (implemented in this PR):

  • Raise an exception if the user tries to use duplicate column names, duplicated index, or a name is shared between the index and columns.

Updated (implemented) proposal:
- Find which names are duplicated and add a number to them
- If a name is shared between rows and columns, add a "_row" and "_col" to them.

Alternative solutions:

  • Append a number without checking to each name.
    • Pro: Simpler code (don't have to look for duplicates nor know how many times a name is duplicated).
    • Con: Might be confusing for the user to see an error in column "column_name_3", specially if their names already had numbers in them.

@WillAyd
Copy link
Member

WillAyd commented Sep 17, 2019

Hmm I might be overlooking it but why don't we want to allow this? One of the names is for the index and one for the columns, no?

@WillAyd WillAyd added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Sep 17, 2019
@WillAyd
Copy link
Member

WillAyd commented Sep 17, 2019

By the way not sure if we want to do this...but in any case this is a good first PR!

@cuchoi
Copy link
Author

cuchoi commented Sep 17, 2019

Can also make it support the behaviour if that's what we want (there is a dictionary key being overwritten when using duplicate names).

Do you prefer a solution that appends an _x, _y or one that just works our of the box?

@TomAugspurger
Copy link
Contributor

IIUC, the problem was us overwriting a key in some internal dictionary we're building up the results in.

So if we wanted to use custom keys in that dictionary like "index" and "columns", that'd be fine. Then we build up the resulting DataFrame for the user and ensure that the names are correct (not using "index" and "columns" anymore).

@cuchoi
Copy link
Author

cuchoi commented Sep 18, 2019

Yes, that's the issue. Although adding an index_ and columns_ names is not enough because row names might also be duplicated (and column names as well).

Proposal

  1. If the column/row name is a string concat a counter_id to make them unique.
  2. Send the transformed DataFrame to pivot_table()*.
  3. Remove the ids.

*: A quick test showed me that pivot table doesn't support duplicated column names.

Aside from strings and tuples, what other column name types are supported? Any inmutable object? Any ideas of what to do in those cases? One "solution" is to convert them to strings, but it doesn't seem right.

@TomAugspurger
Copy link
Contributor

I think our only requirement on labels is that they're hashable.

Is it possible to just use our own names when index or columns is passed as an array with a name? Sorry, I'm not too familiar with this section of the code.

@cuchoi
Copy link
Author

cuchoi commented Sep 20, 2019

That's a good idea. So we create a mapping from our names to the names the user uses, send the DataFrame with our names to pivot_table and when it comes back we use the new names.

Our names could be the string representation of user's names (+ a dedup string) so if an error occurs they can check it.

Will give it a shot.

@cuchoi
Copy link
Author

cuchoi commented Sep 20, 2019

Built a solution that achieves the following:

  • Adds a _0, _1, etc. to the duplicated names in the order the user provided it if the names is duplicated within rows or within columns.
  • Adds _row or _col if the name is shared between columns and rows.
  • Returns back the names of the columns the user provided.
  • Only modifies the columns that are duplicated within rows/columns or shared across rows/columns. The objective of this is that if pivot_table() raises an error referring to a column or row, the user will see (unless duplicated) the names they gave us.

Notes:

  • _get_duplicate_count() could be replaced by Counter() with some small refactoring. Wasn't sure is there is a preference to avoid importing it. Only saw Counter() being used in the tests.
  • Added tests for each case and assert to check that is returning the original column names.
  • If this solution makes sense I can add tests for the helper functions.

Happy to do any changes needed.

@cuchoi
Copy link
Author

cuchoi commented Sep 25, 2019

Anything I can do to move this PR forward?

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2019

@cuchoi if you merge master and repush will take another look

@jbrockmendel
Copy link
Member

@cuchoi can you merge master

@cuchoi
Copy link
Author

cuchoi commented Nov 18, 2019

Done, thanks for taking the time to review!

@cuchoi
Copy link
Author

cuchoi commented May 30, 2020

Green

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a whatsnew note in reshaping bug fix section for 1.1

assert result.index.names == ["foo", "foo"]
assert result.columns.names == ["bar_col"]

# Column names duplicated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you split this into a separate test (and rename both appropriately)

Copy link
Author

@cuchoi cuchoi Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want "duplicated column names" as a separate test? In that case, I think it makes more sense to do three tests:

  • One for when the name is shared between rows and columns
  • Duplicated row names
  • Duplicated column names

Makes sense?

The other alternative is just renaming the test to test_crosstab_duplicated_row_and_col_names

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the test. Let me know if you prefer it split into 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes prefer this to be split


Parameters
----------
names : list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of creating uniques names, can we simply map input names -> position (which by definition are unique); still returning a dict and renaming is all ok.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. My only issue with that if there are any errors when creating the DataFrame (line 618) or calling df.pivot_table (line 628) then the user would see unrecognizable column names for those errors.

@jreback
Copy link
Contributor

jreback commented Jun 9, 2020

@cuchoi pretty close on this, just a couple of comments.

@jreback
Copy link
Contributor

jreback commented Jul 17, 2020

can u merge master and resolve conflicts

@simonjayhawkins
Copy link
Member

@jreback rebased + green if you wanted to get this in. I've had a brief look here, but having a long history and quite a bit of code added not in a position to say if this is ready.

@jreback
Copy link
Contributor

jreback commented Jul 17, 2020

thanks @simonjayhawkins yeah needs a look again will delay to 1.2

@jreback jreback added this to the 1.2 milestone Jul 17, 2020
@simonjayhawkins
Copy link
Member

@cuchoi can you move release note to 1.2

@cuchoi
Copy link
Author

cuchoi commented Aug 2, 2020

Moved to 1.2. Anything else that I can do?

@WillAyd
Copy link
Member

WillAyd commented Sep 10, 2020

@cuchoi can you fix the merge conflict and see if you can get CI green?

@cuchoi
Copy link
Author

cuchoi commented Sep 11, 2020

I merged and I am getting an error on "pandas.tests.io.test_parquet.TestParquetPyArrow"
AssertionError: DataFrame are different DataFrame shape mismatch [left]: (3, 2) [right]: (0, 0)

Not sure, but I don't think it is related to any changes I made. Is there a way to re run a test?

@jbrockmendel
Copy link
Member

That test failure is unrelated, dont worry about it

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls merge master as well

assert result.index.names == ["foo", "foo"]
assert result.columns.names == ["bar_col"]

# Column names duplicated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes prefer this to be split

# to prevent issues with duplicate columns/row names. GH Issue: #22529
shared_col_row_names = set(rownames).intersection(set(colnames))
row_names_mapper, unique_row_names = _build_names_mapper(
rownames, shared_col_row_names, "row"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we hae the _build_names take rownames, colnams and 'row/col' ? i think a bit easier to understand (do the intersection inside), OR maybe just return all 4 elements (row_names_mapper, unique_row_names, col_names_mapper, uniques). either way, whatever is simpler.

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cuchoi if you address @jreback comments & merge master it sounds like this is close to going in

@simonjayhawkins
Copy link
Member

@cuchoi if you address @jreback comments & merge master it sounds like this is close to going in

@jreback jreback removed this from the 1.2 milestone Nov 18, 2020
@jreback
Copy link
Contributor

jreback commented Nov 18, 2020

nice PR, but moving off 1.2 as it needs to be updated for comments.

@arw2019
Copy link
Member

arw2019 commented Nov 22, 2020

Closing in favor of #37997

@arw2019 arw2019 closed this Nov 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Crosstab Not Working with Duplicate Column Labels
8 participants