Skip to content

BUG: Check for duplicate names columns and index in crosstab #28474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 41 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
a748ccd
BUG: Check for duplicate names in columns and index when calling cros…
cuchoi Jun 7, 2019
c5430ab
Updated test for duplicated names in crosstab
cuchoi Jun 9, 2019
5762bb6
Flake8 compliance
cuchoi Jun 9, 2019
989bd5a
Black formatting
cuchoi Sep 17, 2019
672bd4b
Creates name mapping to overcome duplicated columns issues in pd.cros…
cuchoi Sep 20, 2019
1b67514
Removed debug prints
cuchoi Sep 20, 2019
4f2fa86
Removed empty line
cuchoi Sep 20, 2019
fbff182
Improved comments
cuchoi Sep 20, 2019
6071db5
String manipulation compatible with <3.6
cuchoi Sep 20, 2019
3d9c632
Replaced custom method with _value_counts_arraylike
cuchoi Sep 22, 2019
d27233f
Resort imports
cuchoi Sep 24, 2019
478d8dc
Merge branch 'master' of https://github.com/pandas-dev/pandas into cr…
cuchoi Sep 24, 2019
1828a36
Merged master
cuchoi Nov 18, 2019
3e9333d
Added docstring and type hints to _build_names_mapper
cuchoi Nov 19, 2019
fe6f7fd
Sorted imports
cuchoi Nov 19, 2019
9f4a00a
Added types to unique_names
cuchoi Nov 19, 2019
788ac2b
Updated tests
cuchoi Nov 21, 2019
354a180
Merge branch 'master' into crosstab_dup_names
cuchoi Nov 21, 2019
d938a96
Sorted imports
cuchoi Nov 21, 2019
d60308c
Sorted imports
cuchoi Nov 21, 2019
b759073
Merged master
cuchoi Nov 21, 2019
cf63d4c
Merge remote-tracking branch 'upstream/master' into crosstab_dup_names
cuchoi Jan 2, 2020
94e8175
Merged master
cuchoi Jan 6, 2020
c6ccab1
Merge remote-tracking branch 'upstream/master' into crosstab_dup_names
cuchoi Jan 9, 2020
142ade0
Merged master
cuchoi Jan 26, 2020
710ab63
Merged master
cuchoi Feb 12, 2020
cad2dea
Improved comment
cuchoi Apr 11, 2020
f2741db
Merged master
cuchoi Apr 11, 2020
f0261e9
Updated tests
cuchoi Apr 11, 2020
2733c17
Removed pd. reference in tests
cuchoi Apr 11, 2020
f1b19a6
Removed pd. reference in tests
cuchoi Apr 11, 2020
d475edd
Merged to master
cuchoi May 29, 2020
85e010b
Added Set import from typing
cuchoi May 29, 2020
558493f
Merge remote-tracking branch 'upstream/master' into crosstab_dup_names
cuchoi Jul 13, 2020
3ed74d9
Renamed duplicated row/col names test
cuchoi Jul 13, 2020
e5c6cd9
Added what's new entry
cuchoi Jul 13, 2020
d16b840
Merge remote-tracking branch 'upstream/master' into crosstab_dup_names
simonjayhawkins Jul 17, 2020
32aa475
Moved release note to 1.2
cuchoi Aug 2, 2020
820945e
Merged master
cuchoi Sep 11, 2020
dc1d5c5
Updated reference to value_counts_arraylike
cuchoi Sep 11, 2020
0b002da
Merge branch 'master' into crosstab_dup_names
cuchoi Sep 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -341,11 +341,11 @@ Groupby/resample/rolling
Reshaping
^^^^^^^^^

- Bug in :meth:`DataFrame.crosstab` when duplicated row or column names were used (:issue:`22529`)
- Bug in :meth:`DataFrame.pivot_table` with ``aggfunc='count'`` or ``aggfunc='sum'`` returning ``NaN`` for missing categories when pivoted on a ``Categorical``. Now returning ``0`` (:issue:`31422`)
- Bug in :func:`union_indexes` where input index names are not preserved in some cases. Affects :func:`concat` and :class:`DataFrame` constructor (:issue:`13475`)
- Bug in func :meth:`crosstab` when using multiple columns with ``margins=True`` and ``normalize=True`` (:issue:`35144`)
- Bug in :meth:`DataFrame.agg` with ``func={'name':<FUNC>}`` incorrectly raising ``TypeError`` when ``DataFrame.columns==['Name']`` (:issue:`36212`)
-

Sparse
^^^^^^
Expand Down
95 changes: 82 additions & 13 deletions pandas/core/reshape/pivot.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
List,
Optional,
Sequence,
Set,
Tuple,
Union,
cast,
Expand All @@ -19,6 +20,7 @@
from pandas.core.dtypes.common import is_integer_dtype, is_list_like, is_scalar
from pandas.core.dtypes.generic import ABCDataFrame, ABCSeries

import pandas.core.algorithms as algos
import pandas.core.common as com
from pandas.core.frame import _shared_docs
from pandas.core.groupby import Grouper
Expand Down Expand Up @@ -580,29 +582,39 @@ def crosstab(
b 0 1 0
c 0 0 0
"""
if values is None and aggfunc is not None:
raise ValueError("aggfunc cannot be used without values.")

if values is not None and aggfunc is None:
raise ValueError("values cannot be used without an aggfunc.")

index = com.maybe_make_list(index)
columns = com.maybe_make_list(columns)

rownames = _get_names(index, rownames, prefix="row")
colnames = _get_names(columns, colnames, prefix="col")

common_idx = None
pass_objs = [x for x in index + columns if isinstance(x, (ABCSeries, ABCDataFrame))]
if pass_objs:
common_idx = get_objs_combined_axis(pass_objs, intersect=True, sort=False)

data: Dict = {}
data.update(zip(rownames, index))
data.update(zip(colnames, columns))

if values is None and aggfunc is not None:
raise ValueError("aggfunc cannot be used without values.")
rownames = _get_names(index, rownames, prefix="row")
colnames = _get_names(columns, colnames, prefix="col")

if values is not None and aggfunc is None:
raise ValueError("values cannot be used without an aggfunc.")
# We create our own mapping of row and columns names
# to prevent issues with duplicate columns/row names. GH Issue: #22529
shared_col_row_names = set(rownames).intersection(set(colnames))
row_names_mapper, unique_row_names = _build_names_mapper(
rownames, shared_col_row_names, "row"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we hae the _build_names take rownames, colnams and 'row/col' ? i think a bit easier to understand (do the intersection inside), OR maybe just return all 4 elements (row_names_mapper, unique_row_names, col_names_mapper, uniques). either way, whatever is simpler.

)
col_names_mapper, unique_col_names = _build_names_mapper(
colnames, shared_col_row_names, "col"
)

from pandas import DataFrame

data = {
**dict(zip(unique_row_names, index)),
**dict(zip(unique_col_names, columns)),
}
df = DataFrame(data, index=common_idx)
original_df_cols = df.columns

Expand All @@ -615,8 +627,8 @@ def crosstab(

table = df.pivot_table(
["__dummy__"],
index=rownames,
columns=colnames,
index=unique_row_names,
columns=unique_col_names,
margins=margins,
margins_name=margins_name,
dropna=dropna,
Expand All @@ -635,6 +647,9 @@ def crosstab(
table, normalize=normalize, margins=margins, margins_name=margins_name
)

table = table.rename_axis(index=row_names_mapper, axis=0)
table = table.rename_axis(columns=col_names_mapper, axis=1)

return table


Expand Down Expand Up @@ -733,3 +748,57 @@ def _get_names(arrs, names, prefix: str = "row"):
names = list(names)

return names


def _build_names_mapper(
names: List[str], shared_col_row_names: Set[str], suffix: str
) -> Tuple[Dict[str, str], List[str]]:
"""
Given a list of row or column names, creates a mapper of unique names to
column/row names.

Parameters
----------
names : list
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of creating uniques names, can we simply map input names -> position (which by definition are unique); still returning a dict and renaming is all ok.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can. My only issue with that if there are any errors when creating the DataFrame (line 618) or calling df.pivot_table (line 628) then the user would see unrecognizable column names for those errors.

Names to be deduplicated.
shared_col_row_names : set or list
Values used both in rows and columns, so need additional deduplication.
suffix : str
Suffix to deduplicate values in shared_col_row_names

Returns
-------
names_mapper: dict
The keys are the unique names and the values are the original names.
unique_names: list
Unique names in the same order that names came in
"""
keys, counts = algos.value_counts_arraylike(names, dropna=False)
names_count = dict(zip(keys, counts))

names_mapper = {}
unique_names: List[str] = []
# We reverse the names so the numbers are in the order given by the user
for name in reversed(names):
mapped_name = name
name_count = names_count[name]

# Adds a number if name is duplicated within columns/rows
if name_count > 1:
mapped_name = "{mapped_name}_{name_count}".format(
mapped_name=mapped_name, name_count=name_count
)
names_count[name] -= 1

# Add suffix name if column is shared between column and rows
if name in shared_col_row_names:
mapped_name = "{mapped_name}_{suffix}".format(
mapped_name=mapped_name, suffix=suffix
)

names_mapper[mapped_name] = name

# Creates a list of the new names in the original order
unique_names.insert(0, mapped_name)

return names_mapper, unique_names
48 changes: 41 additions & 7 deletions pandas/tests/reshape/test_crosstab.py
Original file line number Diff line number Diff line change
Expand Up @@ -533,17 +533,51 @@ def test_crosstab_with_numpy_size(self):
)
tm.assert_frame_equal(result, expected)

def test_crosstab_dup_index_names(self):
# GH 13279
s = Series(range(3), name="foo")
def test_crosstab_duplicated_row_and_col_names(self):
# We test that duplicated row or column names do not produce issues
# GH Issue: #22529, GH#13279

result = crosstab(s, s)
expected_index = Index(range(3), name="foo")
expected = DataFrame(
np.eye(3, dtype=np.int64), index=expected_index, columns=expected_index
# Same name in both rows and columns
s1 = Series(range(3), name="foo")
s2 = s1 + 1
expected = crosstab(s1, s2.rename("bar")).rename_axis(
columns={"bar": "foo"}, axis=1
)
result = crosstab(s1, s2)
tm.assert_frame_equal(result, expected)
assert result.index.names == ["foo"]
assert result.columns.names == ["foo"]

# Row names duplicated
s1 = Series(range(3), name="foo")
s2 = s1 + 1
s3 = Series(range(3), name="bar_col")

expected = crosstab([s1, s2.rename("bar")], s3).rename_axis(
index={"bar": "foo"}, axis=0
)
result = crosstab([s1, s2], s3)

tm.assert_frame_equal(result, expected)

assert result.index.names == ["foo", "foo"]
assert result.columns.names == ["bar_col"]

# Column names duplicated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you split this into a separate test (and rename both appropriately)

Copy link
Author

@cuchoi cuchoi Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want "duplicated column names" as a separate test? In that case, I think it makes more sense to do three tests:

  • One for when the name is shared between rows and columns
  • Duplicated row names
  • Duplicated column names

Makes sense?

The other alternative is just renaming the test to test_crosstab_duplicated_row_and_col_names

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the test. Let me know if you prefer it split into 3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes prefer this to be split

s1 = Series(range(3), name="foo")
s2 = s1 + 1
s3 = Series(range(3), name="bar_row")

expected = crosstab(s3, [s1, s2.rename("bar")]).rename_axis(
columns={"bar": "foo"}, axis=1
)
result = crosstab(s3, [s1, s2])

tm.assert_frame_equal(result, expected)

assert result.index.names == ["bar_row"]
assert result.columns.names == ["foo", "foo"]

@pytest.mark.parametrize("names", [["a", ("b", "c")], [("a", "b"), "c"]])
def test_crosstab_tuple_name(self, names):
s1 = Series(range(3), name=names[0])
Expand Down