BUG: Check for duplicate names columns and index in crosstab #28474

cuchoi · 2019-09-17T01:32:07Z

closes Crosstab Not Working with Duplicate Column Labels #22529
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Problem: Currently if you give two Series with the same name to pd.crosstab it will ignore one of the columns. By minimally modifying a current test, it no longer passes.

Replicable example of the issue (more examples in the tests of this PR):

import pandas as pd
s1 = pd.Series(range(3), name='foo')
s2 = s1 + 1

expected = pd.crosstab(s1, s2.rename("bar")) # If we rename one of the series we get the correct result
result = pd.crosstab(s1, s2)

pd.testing.assert_frame_equal(result, expected)

Proposed solution (implemented in this PR):

~~Raise an exception if the user tries to use duplicate column names, duplicated index, or a name is shared between the index and columns~~.

Updated (implemented) proposal:
- Find which names are duplicated and add a number to them
- If a name is shared between rows and columns, add a "_row" and "_col" to them.

Alternative solutions:

Append a number without checking to each name.
- Pro: Simpler code (don't have to look for duplicates nor know how many times a name is duplicated).
- Con: Might be confusing for the user to see an error in column "column_name_3", specially if their names already had numbers in them.

…stab

WillAyd · 2019-09-17T04:15:22Z

Hmm I might be overlooking it but why don't we want to allow this? One of the names is for the index and one for the columns, no?

WillAyd · 2019-09-17T04:16:10Z

By the way not sure if we want to do this...but in any case this is a good first PR!

cuchoi · 2019-09-17T16:31:34Z

Can also make it support the behaviour if that's what we want (there is a dictionary key being overwritten when using duplicate names).

Do you prefer a solution that appends an _x, _y or one that just works our of the box?

TomAugspurger · 2019-09-17T16:49:27Z

IIUC, the problem was us overwriting a key in some internal dictionary we're building up the results in.

So if we wanted to use custom keys in that dictionary like "index" and "columns", that'd be fine. Then we build up the resulting DataFrame for the user and ensure that the names are correct (not using "index" and "columns" anymore).

cuchoi · 2019-09-18T04:35:55Z

Yes, that's the issue. Although adding an index_ and columns_ names is not enough because row names might also be duplicated (and column names as well).

Proposal

If the column/row name is a string concat a counter_id to make them unique.
Send the transformed DataFrame to pivot_table()*.
Remove the ids.

*: A quick test showed me that pivot table doesn't support duplicated column names.

Aside from strings and tuples, what other column name types are supported? Any inmutable object? Any ideas of what to do in those cases? One "solution" is to convert them to strings, but it doesn't seem right.

TomAugspurger · 2019-09-19T21:39:53Z

I think our only requirement on labels is that they're hashable.

Is it possible to just use our own names when index or columns is passed as an array with a name? Sorry, I'm not too familiar with this section of the code.

cuchoi · 2019-09-20T00:02:10Z

That's a good idea. So we create a mapping from our names to the names the user uses, send the DataFrame with our names to pivot_table and when it comes back we use the new names.

Our names could be the string representation of user's names (+ a dedup string) so if an error occurs they can check it.

Will give it a shot.

…stab()

cuchoi · 2019-09-20T03:40:20Z

Built a solution that achieves the following:

Adds a _0, _1, etc. to the duplicated names in the order the user provided it if the names is duplicated within rows or within columns.
Adds _row or _col if the name is shared between columns and rows.
Returns back the names of the columns the user provided.
Only modifies the columns that are duplicated within rows/columns or shared across rows/columns. The objective of this is that if pivot_table() raises an error referring to a column or row, the user will see (unless duplicated) the names they gave us.

Notes:

_get_duplicate_count() could be replaced by Counter() with some small refactoring. Wasn't sure is there is a preference to avoid importing it. Only saw Counter() being used in the tests.
Added tests for each case and assert to check that is returning the original column names.
If this solution makes sense I can add tests for the helper functions.

Happy to do any changes needed.

pandas/core/reshape/pivot.py

…osstab_dup_names

cuchoi · 2019-09-25T21:24:04Z

Anything I can do to move this PR forward?

WillAyd · 2019-11-07T21:03:29Z

@cuchoi if you merge master and repush will take another look

jbrockmendel · 2019-11-18T21:26:53Z

@cuchoi can you merge master

cuchoi · 2019-11-18T22:13:30Z

Done, thanks for taking the time to review!

pandas/tests/reshape/test_pivot.py

pandas/core/reshape/pivot.py

cuchoi · 2020-05-30T01:02:54Z

Green

jreback

can you add a whatsnew note in reshaping bug fix section for 1.1

jreback · 2020-05-31T23:05:35Z

pandas/tests/reshape/test_crosstab.py

+        assert result.index.names == ["foo", "foo"]
+        assert result.columns.names == ["bar_col"]
+
+        # Column names duplicated


can you split this into a separate test (and rename both appropriately)

Do you want "duplicated column names" as a separate test? In that case, I think it makes more sense to do three tests:

One for when the name is shared between rows and columns

Duplicated row names

Duplicated column names

Makes sense?

The other alternative is just renaming the test to test_crosstab_duplicated_row_and_col_names

Renamed the test. Let me know if you prefer it split into 3.

yes prefer this to be split

pandas/core/reshape/pivot.py

jreback · 2020-05-31T23:08:44Z

pandas/core/reshape/pivot.py

+
+    Parameters
+    ----------
+    names : list


instead of creating uniques names, can we simply map input names -> position (which by definition are unique); still returning a dict and renaming is all ok.

We can. My only issue with that if there are any errors when creating the DataFrame (line 618) or calling df.pivot_table (line 628) then the user would see unrecognizable column names for those errors.

jreback · 2020-06-09T22:51:32Z

@cuchoi pretty close on this, just a couple of comments.

jreback · 2020-07-17T11:28:39Z

can u merge master and resolve conflicts

simonjayhawkins · 2020-07-17T14:02:01Z

@jreback rebased + green if you wanted to get this in. I've had a brief look here, but having a long history and quite a bit of code added not in a position to say if this is ready.

jreback · 2020-07-17T14:04:37Z

thanks @simonjayhawkins yeah needs a look again will delay to 1.2

simonjayhawkins · 2020-08-01T13:54:18Z

@cuchoi can you move release note to 1.2

cuchoi · 2020-08-02T21:17:26Z

Moved to 1.2. Anything else that I can do?

WillAyd · 2020-09-10T18:59:20Z

@cuchoi can you fix the merge conflict and see if you can get CI green?

cuchoi · 2020-09-11T18:04:08Z

I merged and I am getting an error on "pandas.tests.io.test_parquet.TestParquetPyArrow"
AssertionError: DataFrame are different DataFrame shape mismatch [left]: (3, 2) [right]: (0, 0)

Not sure, but I don't think it is related to any changes I made. Is there a way to re run a test?

jbrockmendel · 2020-09-11T18:05:42Z

That test failure is unrelated, dont worry about it

jreback

pls merge master as well

jreback · 2020-09-19T21:15:12Z

pandas/tests/reshape/test_crosstab.py

+        assert result.index.names == ["foo", "foo"]
+        assert result.columns.names == ["bar_col"]
+
+        # Column names duplicated


yes prefer this to be split

jreback · 2020-09-19T21:17:44Z

pandas/core/reshape/pivot.py

+    # to prevent issues with duplicate columns/row names. GH Issue: #22529
+    shared_col_row_names = set(rownames).intersection(set(colnames))
+    row_names_mapper, unique_row_names = _build_names_mapper(
+        rownames, shared_col_row_names, "row"


why don't we hae the _build_names take rownames, colnams and 'row/col' ? i think a bit easier to understand (do the intersection inside), OR maybe just return all 4 elements (row_names_mapper, unique_row_names, col_names_mapper, uniques). either way, whatever is simpler.

arw2019

@cuchoi if you address @jreback comments & merge master it sounds like this is close to going in

simonjayhawkins · 2020-11-12T14:32:40Z

@cuchoi if you address @jreback comments & merge master it sounds like this is close to going in

jreback · 2020-11-18T18:34:03Z

nice PR, but moving off 1.2 as it needs to be updated for comments.

arw2019 · 2020-11-22T05:14:07Z

Closing in favor of #37997

cuchoi added 4 commits September 16, 2019 00:13

BUG: Check for duplicate names in columns and index when calling cros…

a748ccd

…stab

Updated test for duplicated names in crosstab

c5430ab

Flake8 compliance

5762bb6

Black formatting

989bd5a

cuchoi mentioned this pull request Sep 17, 2019

BUG: Check for duplicate names columns and index in crosstab #26717

Closed

4 tasks

WillAyd added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Sep 17, 2019

cuchoi added 4 commits September 19, 2019 23:28

Creates name mapping to overcome duplicated columns issues in pd.cros…

672bd4b

…stab()

Removed debug prints

1b67514

Removed empty line

4f2fa86

Improved comments

fbff182

String manipulation compatible with <3.6

6071db5

jreback requested changes Sep 20, 2019

View reviewed changes

pandas/core/reshape/pivot.py Show resolved Hide resolved

cuchoi added 3 commits September 22, 2019 13:19

Replaced custom method with _value_counts_arraylike

3d9c632

Resort imports

d27233f

Merge branch 'master' of https://github.com/pandas-dev/pandas into cr…

478d8dc

…osstab_dup_names

Merged master

1828a36

jbrockmendel reviewed Nov 19, 2019

View reviewed changes

pandas/tests/reshape/test_pivot.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Nov 19, 2019

View reviewed changes

pandas/core/reshape/pivot.py Outdated Show resolved Hide resolved

cuchoi added 2 commits May 29, 2020 17:54

Merged to master

d475edd

Added Set import from typing

85e010b

jreback requested changes May 31, 2020

View reviewed changes

cuchoi added 3 commits July 12, 2020 23:23

Merge remote-tracking branch 'upstream/master' into crosstab_dup_names

558493f

Renamed duplicated row/col names test

3ed74d9

Added what's new entry

e5c6cd9

Merge remote-tracking branch 'upstream/master' into crosstab_dup_names

d16b840

jreback added this to the 1.2 milestone Jul 17, 2020

Moved release note to 1.2

32aa475

cuchoi added 2 commits September 11, 2020 12:15

Merged master

820945e

Updated reference to value_counts_arraylike

dc1d5c5

Merge branch 'master' into crosstab_dup_names

0b002da

jreback requested changes Sep 19, 2020

View reviewed changes

arw2019 reviewed Oct 31, 2020

View reviewed changes

jreback removed this from the 1.2 milestone Nov 18, 2020

arw2019 mentioned this pull request Nov 22, 2020

BUG: crosstab with duplicate column or index labels #37997

Merged

5 tasks

arw2019 closed this Nov 22, 2020

Uh oh!

BUG: Check for duplicate names columns and index in crosstab #28474

BUG: Check for duplicate names columns and index in crosstab #28474

Uh oh!

Conversation

cuchoi commented Sep 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd commented Sep 17, 2019

Uh oh!

WillAyd commented Sep 17, 2019

Uh oh!

cuchoi commented Sep 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 17, 2019

Uh oh!

cuchoi commented Sep 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomAugspurger commented Sep 19, 2019

Uh oh!

cuchoi commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cuchoi commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cuchoi commented Sep 25, 2019

Uh oh!

WillAyd commented Nov 7, 2019

Uh oh!

jbrockmendel commented Nov 18, 2019

Uh oh!

cuchoi commented Nov 18, 2019

Uh oh!

Uh oh!

Uh oh!

cuchoi commented May 30, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback May 31, 2020

Choose a reason for hiding this comment

Uh oh!

cuchoi Jul 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cuchoi Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

jreback Sep 19, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback May 31, 2020

Choose a reason for hiding this comment

Uh oh!

cuchoi Jul 13, 2020

Choose a reason for hiding this comment

Uh oh!

jreback commented Jun 9, 2020

Uh oh!

jreback commented Jul 17, 2020

Uh oh!

simonjayhawkins commented Jul 17, 2020

Uh oh!

jreback commented Jul 17, 2020

Uh oh!

simonjayhawkins commented Aug 1, 2020

Uh oh!

cuchoi commented Aug 2, 2020

Uh oh!

WillAyd commented Sep 10, 2020

Uh oh!

cuchoi commented Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cuchoi commented Sep 17, 2019 •

edited

Loading

cuchoi commented Sep 17, 2019 •

edited

Loading

cuchoi commented Sep 18, 2019 •

edited

Loading

cuchoi commented Sep 20, 2019 •

edited

Loading

cuchoi commented Sep 20, 2019 •

edited

Loading

cuchoi Jul 13, 2020 •

edited

Loading

cuchoi commented Sep 11, 2020 •

edited

Loading