Skip to content

ENH: Allow keep="all" in duplicated method (#23251) #23252

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,7 @@ Other Enhancements
- :meth:`Index.to_frame` now supports overriding column name(s) (:issue:`22580`).
- New attribute :attr:`__git_version__` will return git commit sha of current build (:issue:`21295`).
- Compatibility with Matplotlib 3.0 (:issue:`22790`).
- :meth:`DataFrame.duplicated`, :meth:`Series.duplicated`, :meth:`Index.duplicated`, :meth:`MultiIndex.duplicated` now accept ``keep='all'`` to keep all duplicated values (:issue:`23251`)

.. _whatsnew_0240.api_breaking:

Expand Down
4 changes: 2 additions & 2 deletions pandas/_libs/hashtable_func_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -145,8 +145,8 @@ def duplicated_{{dtype}}({{scalar}}[:] values, object keep='first'):

kh_resize_{{ttype}}(table, min(n, _SIZE_HINT_LIMIT))

if keep not in ('last', 'first', False):
raise ValueError('keep must be either "first", "last" or False')
if keep not in ('last', 'first', 'all', False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change the RHS of this statement to be a set instead of a tuple? Also can you check if we are testing this raises somewhere already?

raise ValueError('keep must be either "first", "last", "all" or False')

if keep == 'last':
{{if dtype == 'object'}}
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -786,7 +786,7 @@ def duplicated(values, keep='first'):
occurrence.
- ``last`` : Mark duplicates as ``True`` except for the last
occurrence.
- False : Mark all duplicates as ``True``.
- ``all``, False : Mark all duplicates as ``True``.

Returns
-------
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -1252,6 +1252,9 @@ def drop_duplicates(self, keep='first', inplace=False):
if self.is_unique:
return self._shallow_copy()

if keep not in ['first', 'last', False]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw your comment in the OP but having a hard time understanding here - what's the reason you need to add these checks and cannot include all as part of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is breaking the symmetry between drop_duplicates and duplicated. Since False has the opposite effect in drop_duplicates (drop all dups) and duplicated (keep all dups). So the 'all' keyword has been blocked from the drop_duplicates because it wouldn't do what you expect. Lower down in the hashtable helpers the functions can then keep their symmetry if all is blocked at the top level, which is what this piece of code is supposed to do.

To differentiate, could raise a specific error relating to keep='all'. That probably be clearer as to why the statement was there.

raise ValueError('keep must be either "first", "last" or False')

duplicated = self.duplicated(keep=keep)
result = self[np.logical_not(duplicated)]
if inplace:
Expand Down
5 changes: 4 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -4414,6 +4414,9 @@ def drop_duplicates(self, subset=None, keep='first', inplace=False):
if self.empty:
return self.copy()

if keep not in ['first', 'last', False]:
raise ValueError('keep must be either "first", "last" or False')

inplace = validate_bool_kwarg(inplace, 'inplace')
duplicated = self.duplicated(subset, keep=keep)

Expand All @@ -4439,7 +4442,7 @@ def duplicated(self, subset=None, keep='first'):
first occurrence.
- ``last`` : Mark duplicates as ``True`` except for the
last occurrence.
- False : Mark all duplicates as ``True``.
- ``all``, False : Mark all duplicates as ``True``.

Returns
-------
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/indexes/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -4592,7 +4592,7 @@ def duplicated(self, keep='first'):
occurrence.
- 'last' : Mark duplicates as ``True`` except for the last
occurrence.
- ``False`` : Mark all duplicates as ``True``.
- 'all', ``False`` : Mark all duplicates as ``True``.

Examples
--------
Expand Down
2 changes: 1 addition & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -1600,7 +1600,7 @@ def duplicated(self, keep='first'):
occurrence.
- 'last' : Mark duplicates as ``True`` except for the last
occurrence.
- ``False`` : Mark all duplicates as ``True``.
- 'all', ``False`` : Mark all duplicates as ``True``.

Examples
--------
Expand Down
2 changes: 2 additions & 0 deletions pandas/tests/frame/test_duplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ def test_duplicated_do_not_fail_on_wide_dataframes():
@pytest.mark.parametrize('keep, expected', [
('first', Series([False, False, True, False, True])),
('last', Series([True, True, False, False, False])),
('all', Series([True, True, True, False, True])),
(False, Series([True, True, True, False, True]))
])
def test_duplicated_keep(keep, expected):
Expand All @@ -60,6 +61,7 @@ def test_duplicated_keep(keep, expected):
@pytest.mark.parametrize('keep, expected', [
('first', Series([False, False, True, False, True])),
('last', Series([True, True, False, False, False])),
('all', Series([True, True, True, False, True])),
(False, Series([True, True, True, False, True]))
])
def test_duplicated_nan_none(keep, expected):
Expand Down
3 changes: 2 additions & 1 deletion pandas/tests/indexes/multi/test_duplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,14 +214,15 @@ def f(a):
@pytest.mark.parametrize('keep, expected', [
('first', np.array([False, False, False, True, True, False])),
('last', np.array([False, True, True, False, False, False])),
('all', np.array([False, True, True, True, True, False])),
(False, np.array([False, True, True, True, True, False]))
])
def test_duplicated(idx_dup, keep, expected):
result = idx_dup.duplicated(keep=keep)
tm.assert_numpy_array_equal(result, expected)


@pytest.mark.parametrize('keep', ['first', 'last', False])
@pytest.mark.parametrize('keep', ['first', 'last', 'all', False])
def test_duplicated_large(keep):
# GH 9125
n, k = 200, 5000
Expand Down
2 changes: 2 additions & 0 deletions pandas/tests/series/test_duplicates.py
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ def test_drop_duplicates_bool(keep, expected):
@pytest.mark.parametrize('keep, expected', [
('first', Series([False, False, True, False, True], name='name')),
('last', Series([True, True, False, False, False], name='name')),
('all', Series([True, True, True, False, True], name='name')),
(False, Series([True, True, True, False, True], name='name'))
])
def test_duplicated_keep(keep, expected):
Expand All @@ -131,6 +132,7 @@ def test_duplicated_keep(keep, expected):
@pytest.mark.parametrize('keep, expected', [
('first', Series([False, False, True, False, True])),
('last', Series([True, True, False, False, False])),
('all', Series([True, True, True, False, True])),
(False, Series([True, True, True, False, True]))
])
def test_duplicated_nan_none(keep, expected):
Expand Down