Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
caleb ~ $ cat sample_read_wrong.tsv
CAS Chemical A B C D E
7440-63-3 Xenon -0.0000023692 0.000000098454 -0.000000000048314 0.00000000000001953 -3.42E-018
7439-90-9 Krypton -0.000000792 0.000000102624 -0.000000000055428 0.00000000000002187 -3.69E-018
7440-37-1 Argon 0.0000016196 0.000000081279 -0.000000000041263 0.00000000000001668 -2.76E-018
7440-01-9 Neon 0.0000023014 0.000000122527 -0.000000000097141 0.00000000000005386 -1.103E-017
In [2]: pd.read_csv('sample_read_wrong.tsv',delimiter='\t')
Out[2]:
CAS Chemical A ... C D E
0 7440-63-3 Xenon -2.369200e-06 ... -4.831400e-11 1.950000e-14 -3.420000e-18
1 7439-90-9 Krypton -7.920000e-07 ... -5.542800e-11 2.180000e-14 -3.690000e-18
2 7440-37-1 Argon 1.619600e-06 ... -4.126300e-11 1.660000e-14 -2.760000e-18
3 7440-01-9 Neon 2.301400e-06 ... -9.714100e-11 5.380000e-14 -1.103000e-17
[4 rows x 7 columns]
Problem description
Pandas is no longer reading a CVS file with numbers that look like "0.00000000000001953" correctly. In this case, Pandas reads that as 1.950000e-14 - a clear loss of precision. This appears to be a bug in the new "high precision" floating point parsing engine that was made the default in Pandas 1.2.0.
Expected Output
0 7440-63-3 Xenon -2.369200e-06 ... -4.831400e-11 1.953000e-14 -3.420000e-18
1 7439-90-9 Krypton -7.920000e-07 ... -5.542800e-11 2.187000e-14 -3.690000e-18
2 7440-37-1 Argon 1.619600e-06 ... -4.126300e-11 1.668000e-14 -2.760000e-18
3 7440-01-9 Neon 2.301400e-06 ... -9.714100e-11 5.386000e-14 -1.103000e-17
[4 rows x 7 columns]
Output of pd.show_versions()
INSTALLED VERSIONS
commit : 9d598a5
python : 3.8.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.7.0-2-amd64
Version : #1 SMP Debian 5.7.10-1 (2020-07-26)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8
pandas : 1.2.1
numpy : 1.19.2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.1.1
setuptools : 51.3.3
Cython : 0.29.21
pytest : 6.1.2
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.52.0
I was able to work around this with the {'float_precision': 'legacy'} option, but it is not great behavior and old versions of the library I wrote that experienced this issue will silently break.