Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Unfortunately I have to provide a downloadable dataframe example (51MB).
# I guess either the size or encoding of the data contributes to the problem.
# Dataset can be downloaded like this:
# !wget https://github.com/emsi/artifacts/raw/master/sources.jl
import joblib as jl
df = jl.load("sources.jl")
# multiple calls to duplicate() yield different results!
pd.Series([len(df[df.duplicated()]) for _ in range(500)]).describe()
Issue Description
When I call duplicated()
on a given dataframe it returns different results like it has some stochasticity built in. The number of found duplicates differs. Sometimes it's 5507 sometimes its 5509. I even replicated this on another computer as I suspected some CPU/RAM problem.
>>> pd.Series([len(df[df.duplicated()]) for _ in range(500)]).describe()
count 500.000000
mean 5507.884000
std 0.957276
min 5507.000000
25% 5507.000000
50% 5507.000000
75% 5509.000000
max 5509.000000
dtype: float64
Expected Behavior
I'd expect duplicated()
to produce reproducible results. :)
Installed Versions
INSTALLED VERSIONS
commit : 4bfe3d0
python : 3.9.0.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-100-generic
Version : #113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.4.2
numpy : 1.19.2
pytz : 2021.3
dateutil : 2.8.2
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.7.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 8.0.1
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.01.0
gcsfs : None
markupsafe : 2.0.1
matplotlib : 3.5.1
numba : None
numexpr : 2.8.1
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.7.3
snappy : None
sqlalchemy : None
tables : 3.7.0
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : None
zstandard : None