Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import os
# fetch the data, manually fixing data types and rounding errors. 2500 rows, 4 columns, 291kB
df = pd.read_json('https://pastebin.com/raw/3LMwkLTa')
df.DataDate = pd.to_datetime(df.DataDate * 1e6)
df.Expiration = pd.to_datetime(df.Expiration * 1e6)
df.Mid = round(df.Mid * 1000) / 1000
dfcols = list(df.columns.values)
# clean out the temp file if re-running
fn = 'testjson.h5'
try:
h5json.close()
except NameError:
pass
try:
os.unlink(fn)
except FileNotFoundError:
pass
h5json = pd.HDFStore(fn)
# write dataframe in existing order
df.to_hdf(h5json, 'df1', format='t', append=False, data_columns=True)
# write dataframe in different row order
df.sample(frac=1.0, random_state=4).to_hdf(h5json, 'df2', format='t', append=False, data_columns=True)
# GOOD: verify that the dataframes are indeed the same, once sorted deterministically
df1 = pd.read_hdf(h5json, 'df1')
df2 = pd.read_hdf(h5json, 'df2')
assert df1.sort_values(dfcols).equals(df2.sort_values(dfcols)), "The two dataframes, once sorted deterministically, are not identical"
# variables for constructing the search terms
keydate1 = "2005-04-01"
keydate2 = "2005-11-12"
where1=f'(UnderlyingSymbol="AAPL") & (DataDate <="{keydate1}") & (Expiration >= "{keydate1}")'
where2=f'(UnderlyingSymbol="AAPL") & (DataDate <="{keydate2}") & (Expiration >= "{keydate2}")'
# GOOD: using the first where term, the query results are the same
df1 = pd.read_hdf(h5json, 'df1', where=where1)
df2 = pd.read_hdf(h5json, 'df2', where=where1)
assert df1.sort_values(dfcols).equals(df2.sort_values(dfcols)), "The two dataframes selected on where1 are not identical"
# GOOD: using the second where term, the query results are the same
df1 = pd.read_hdf(h5json, 'df1', where=where2)
df2 = pd.read_hdf(h5json, 'df2', where=where2)
assert df1.sort_values(dfcols).equals(df2.sort_values(dfcols)), "The two dataframes selected on where2 are not identical"
# BAD: using the where terms ORed together, the results are DIFFERENT
where1or2 = f'({where1})|({where2})'
df1 = pd.read_hdf(h5json, 'df1', where=where1or2)
df2 = pd.read_hdf(h5json, 'df2', where=where1or2)
print(f'df1.shape: {df1.shape}, df2.shape: {df2.shape}')
assert df1.sort_values(dfcols).equals(df2.sort_values(dfcols)), "The two dataframes selected on (where1) OR (where2) are not identical"
Issue Description
When doing nested queries of the form
read_hdf(..., where='( (cond1) & (cond2) & (cond3) ) | ( (cond4) & (cond5) & (cond6) )')
the results can be different depending on the order in which the data is written.
There are some preconditions for this bug:
- it depends on the data. I cannot reproduce it with made up data (though I tried) but I can reliably reproduce it with the reduced real world data set used in the example. The bug seems sensitive to maybe the data type (in this case, dates, strings and floats) and the original order
- it only manifests when doing nested queries, i.e. ( term AND term AND term ) OR ( term AND term AND term)
Apologies that this is vague and the dataset is relatively large for a test case. I did try lots of things to reproduce it with a small dataset, but I could not.
The data included is a subset of some historical options data. The smallest I could get that was reproducible was 2500 rows x 4 columns. To make this bug copy-paste-able, the data set is loaded from a JSON representation of a data set hosted on pastebin. For posterity. it is also uploaded here: 3LMwkLTa.zip
For background, this bug arises because queries can be slow over large datasets, so I try combining them for performance reasons. It's not feasible to read the whole dataset into memory in order to deterministically sort it.
Expected Behavior
The search results should be the same, once deterministically sorted. The data is the same (as verified by .sort_values().equals()
) and so the same search terms over the same data should return the same results.
Installed Versions
INSTALLED VERSIONS
commit : 8dab54d
python : 3.10.4.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-110-generic
Version : #124-Ubuntu SMP Thu Apr 14 19:46:19 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.5.2
numpy : 1.23.3
pytz : 2022.2.1
dateutil : 2.8.2
setuptools : 59.6.0
pip : 22.0.2
Cython : 0.29.32
pytest : None
hypothesis : None
sphinx : 5.3.0
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.1
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 7.34.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : None
brotli : None
fastparquet : 2022.12.0
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.6.2
numba : None
numexpr : 2.8.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : 2022.11.0
scipy : 1.9.1
snappy : None
sqlalchemy : None
tables : 3.8.0
tabulate : None
xarray : 2022.12.0
xlrd : None
xlwt : None
zstandard : None
tzdata : None