-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: Check for duplicate names columns and index in crosstab #28474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hmm I might be overlooking it but why don't we want to allow this? One of the names is for the index and one for the columns, no? |
By the way not sure if we want to do this...but in any case this is a good first PR! |
Can also make it support the behaviour if that's what we want (there is a dictionary key being overwritten when using duplicate names). Do you prefer a solution that appends an _x, _y or one that just works our of the box? |
IIUC, the problem was us overwriting a key in some internal dictionary we're building up the results in. So if we wanted to use custom keys in that dictionary like "index" and "columns", that'd be fine. Then we build up the resulting DataFrame for the user and ensure that the names are correct (not using "index" and "columns" anymore). |
Yes, that's the issue. Although adding an index_ and columns_ names is not enough because row names might also be duplicated (and column names as well). Proposal
*: A quick test showed me that pivot table doesn't support duplicated column names. Aside from strings and tuples, what other column name types are supported? Any inmutable object? Any ideas of what to do in those cases? One "solution" is to convert them to strings, but it doesn't seem right. |
I think our only requirement on labels is that they're hashable. Is it possible to just use our own names when |
That's a good idea. So we create a mapping from our names to the names the user uses, send the DataFrame with our names to pivot_table and when it comes back we use the new names. Our names could be the string representation of user's names (+ a dedup string) so if an error occurs they can check it. Will give it a shot. |
Built a solution that achieves the following:
Notes:
Happy to do any changes needed. |
Anything I can do to move this PR forward? |
@cuchoi if you merge master and repush will take another look |
@cuchoi can you merge master |
Done, thanks for taking the time to review! |
Green |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a whatsnew note in reshaping bug fix section for 1.1
assert result.index.names == ["foo", "foo"] | ||
assert result.columns.names == ["bar_col"] | ||
|
||
# Column names duplicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you split this into a separate test (and rename both appropriately)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want "duplicated column names" as a separate test? In that case, I think it makes more sense to do three tests:
- One for when the name is shared between rows and columns
- Duplicated row names
- Duplicated column names
Makes sense?
The other alternative is just renaming the test to test_crosstab_duplicated_row_and_col_names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the test. Let me know if you prefer it split into 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes prefer this to be split
|
||
Parameters | ||
---------- | ||
names : list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of creating uniques names, can we simply map input names -> position (which by definition are unique); still returning a dict and renaming is all ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can. My only issue with that if there are any errors when creating the DataFrame (line 618) or calling df.pivot_table (line 628) then the user would see unrecognizable column names for those errors.
@cuchoi pretty close on this, just a couple of comments. |
can u merge master and resolve conflicts |
@jreback rebased + green if you wanted to get this in. I've had a brief look here, but having a long history and quite a bit of code added not in a position to say if this is ready. |
thanks @simonjayhawkins yeah needs a look again will delay to 1.2 |
@cuchoi can you move release note to 1.2 |
Moved to 1.2. Anything else that I can do? |
@cuchoi can you fix the merge conflict and see if you can get CI green? |
I merged and I am getting an error on "pandas.tests.io.test_parquet.TestParquetPyArrow" Not sure, but I don't think it is related to any changes I made. Is there a way to re run a test? |
That test failure is unrelated, dont worry about it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls merge master as well
assert result.index.names == ["foo", "foo"] | ||
assert result.columns.names == ["bar_col"] | ||
|
||
# Column names duplicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes prefer this to be split
# to prevent issues with duplicate columns/row names. GH Issue: #22529 | ||
shared_col_row_names = set(rownames).intersection(set(colnames)) | ||
row_names_mapper, unique_row_names = _build_names_mapper( | ||
rownames, shared_col_row_names, "row" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we hae the _build_names take rownames, colnams and 'row/col' ? i think a bit easier to understand (do the intersection inside), OR maybe just return all 4 elements (row_names_mapper, unique_row_names, col_names_mapper, uniques). either way, whatever is simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice PR, but moving off 1.2 as it needs to be updated for comments. |
Closing in favor of #37997 |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Problem: Currently if you give two Series with the same name to pd.crosstab it will ignore one of the columns. By minimally modifying a current test, it no longer passes.
Replicable example of the issue (more examples in the tests of this PR):
Proposed solution (implemented in this PR):
Raise an exception if the user tries to use duplicate column names, duplicated index, or a name is shared between the index and columns.Updated (implemented) proposal:
- Find which names are duplicated and add a number to them
- If a name is shared between rows and columns, add a "_row" and "_col" to them.
Alternative solutions: