Skip to content

BUG: duplicated() returns different results on consecutive runs! #46864

Open
@emsi

Description

@emsi

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

# Unfortunately I have to provide a downloadable dataframe example (51MB).
# I guess either the size or encoding of the data contributes to the problem.
# Dataset can be downloaded like this:
# !wget https://github.com/emsi/artifacts/raw/master/sources.jl

import joblib as jl

df = jl.load("sources.jl")

# multiple calls to duplicate() yield different results!
pd.Series([len(df[df.duplicated()]) for _ in range(500)]).describe()

Issue Description

When I call duplicated() on a given dataframe it returns different results like it has some stochasticity built in. The number of found duplicates differs. Sometimes it's 5507 sometimes its 5509. I even replicated this on another computer as I suspected some CPU/RAM problem.

>>> pd.Series([len(df[df.duplicated()]) for _ in range(500)]).describe()

count     500.000000
mean     5507.884000
std         0.957276
min      5507.000000
25%      5507.000000
50%      5507.000000
75%      5509.000000
max      5509.000000
dtype: float64

Expected Behavior

I'd expect duplicated() to produce reproducible results. :)

Installed Versions

INSTALLED VERSIONS

commit : 4bfe3d0
python : 3.9.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-100-generic
Version : #113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.2
numpy : 1.19.2
pytz : 2021.3
dateutil : 2.8.2
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.01.0
gcsfs : None
markupsafe : 2.0.1
matplotlib : 3.5.1
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
snappy : None
sqlalchemy : None
tables : 3.7.0
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    Bugduplicatedduplicated, drop_duplicates

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions