Description
Code Sample
from random import randint
def bug_report(n=2000000, idmax=22750, prodmax=3414341):
ids = [randint(1, idmax) for _ in range(n)]
r = lambda: randint(1, prodmax)
prods = [(-1,-2,-3), (-1,-2,-3)] + [(r(), r(), r()) for _ in range(n-2)]
df = pd.DataFrame({'ids': ids, 'products': prods})
counts = df['products'].value_counts()
counts_idxs = counts[counts >= 2].index
idxs = df['products'].isin(counts_idxs)
return df[idxs]
Problem description
There are several ways to trigger the bug, either of them resulting in isin
returning all False
whereas some indexes should be True
.
Take the example above, we have the tuple (-1,-2,-3)
repeated twice, and it can be checked that both counts
and counts_idxs
are 2
and (-1,-2,-3)
, respectively. Then, independently from the rest of the products
, the resulting dataset from taking the idxs
from isin
should have, at least, 2 items. Calling the function as is, does not. Explanation, causes and possible solutions below:
Manually importing from pandas.core.algorithms import isin
and settings idxs = isin(df['products'], counts[counts >= 2].index)
results in the exact same behaviour.
I've tried to reproduce this same behaviour when not using tuples at all and I can't seem to succeed.
Proposed solution
This seems to be a regression in 0.20.x
as using latest 0.19.x
(0.19.2) works perfectly fine. Indeed, manually copying isin
from 0.19.x
and using it instead of 0.20.x
works. One can see that a particular if was reversed/erased in
https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L414
and
https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/algorithms.py#L144
https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/algorithms.py#L161
This results in 0.20.x
relying in numpy.in1d
whereas 0.19.x
used lib.ismember
, which is equivalent to htable.ismember_object
in 0.20.x
. One can confirm this becase:
htable = pandas._libs.hashtable
idxs = htable.ismember_object(df['products'].values, np.asarray(counts[counts >= 2].index))
df[idxs]
works fine, whereas
idxs = np.in1d(df['products'].values, np.asarray(counts[counts >= 2].index))
all_sets[idxs]
silently fails.
Now, either this is temporally fixed in pandas by not relying in in1d
or an issue is submitted to numpy (which I will do once I can take a look at in1d
and see what's happening). Also, one can solve it by not using tuples at all, and applying hash
beforehand, for example.
I've narrowed a bit more the problem and it is not only related to n
but also prodmax
:
Any combination with n > 1000001 && prodmax > 1986
produces and empty dataframe:
bug_report(n=1000001, prodmax=1987)
bug_report(n=1000001)
bug_report()
Whereas having n <= 1000000
or prodmax <= 1986
works just fine. Parameter values have been deduced from:
n
from https://github.com/pandas-dev/pandas/blob/master/pandas/core/algorithms.py#L414prodmax
by binary search:
def narrow():
start = 256
end = 2048
while start + 1 < end:
print(start, end)
df = bug_report_4(n=1000001, prodmax=(start + end) // 2)
if df.empty:
end = (start + end) // 2
else:
start = (start + end) // 2
return start, df.empty
narrow()
# (1896, False)
Output of pd.show_versions()
pandas: 0.20.3
pytest: 3.0.5
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.1.0
sphinx: 1.5.1
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.1
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.2
bs4: 4.5.3
html5lib: None
sqlalchemy: 1.1.5
pymysql: None
psycopg2: None
jinja2: 2.9.4
s3fs: None
pandas_gbq: None
pandas_datareader: None
This has been confirmed and tested in multiple pcs and environments, always Python 3.x