Skip to content

BUG: Reading csv files with numbers with multiple leading zeros losses a lot of precision #39514

Open
@CalebBell

Description

@CalebBell
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

caleb ~ $ cat sample_read_wrong.tsv
CAS	Chemical	A	B	C	D	E
7440-63-3	Xenon	-0.0000023692	0.000000098454	-0.000000000048314	0.00000000000001953	-3.42E-018
7439-90-9	Krypton	-0.000000792	0.000000102624	-0.000000000055428	0.00000000000002187	-3.69E-018
7440-37-1	Argon	0.0000016196	0.000000081279	-0.000000000041263	0.00000000000001668	-2.76E-018
7440-01-9	Neon	0.0000023014	0.000000122527	-0.000000000097141	0.00000000000005386	-1.103E-017

In [2]: pd.read_csv('sample_read_wrong.tsv',delimiter='\t')                          
Out[2]: 
         CAS Chemical             A  ...             C             D             E
0  7440-63-3    Xenon -2.369200e-06  ... -4.831400e-11  1.950000e-14 -3.420000e-18
1  7439-90-9  Krypton -7.920000e-07  ... -5.542800e-11  2.180000e-14 -3.690000e-18
2  7440-37-1    Argon  1.619600e-06  ... -4.126300e-11  1.660000e-14 -2.760000e-18
3  7440-01-9     Neon  2.301400e-06  ... -9.714100e-11  5.380000e-14 -1.103000e-17

[4 rows x 7 columns]

Problem description

Pandas is no longer reading a CVS file with numbers that look like "0.00000000000001953" correctly. In this case, Pandas reads that as 1.950000e-14 - a clear loss of precision. This appears to be a bug in the new "high precision" floating point parsing engine that was made the default in Pandas 1.2.0.

Expected Output

0  7440-63-3    Xenon -2.369200e-06  ... -4.831400e-11  1.953000e-14 -3.420000e-18
1  7439-90-9  Krypton -7.920000e-07  ... -5.542800e-11  2.187000e-14 -3.690000e-18
2  7440-37-1    Argon  1.619600e-06  ... -4.126300e-11  1.668000e-14 -2.760000e-18
3  7440-01-9     Neon  2.301400e-06  ... -9.714100e-11  5.386000e-14 -1.103000e-17

[4 rows x 7 columns]

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 9d598a5
python : 3.8.7.final.0
python-bits : 64
OS : Linux
OS-release : 5.7.0-2-amd64
Version : #1 SMP Debian 5.7.10-1 (2020-07-26)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_CA.UTF-8
LOCALE : en_CA.UTF-8

pandas : 1.2.1
numpy : 1.19.2
pytz : 2020.5
dateutil : 2.8.1
pip : 20.1.1
setuptools : 51.3.3
Cython : 0.29.21
pytest : 6.1.2
hypothesis : None
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.1
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.5 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.52.0

I was able to work around this with the {'float_precision': 'legacy'} option, but it is not great behavior and old versions of the library I wrote that experienced this issue will silently break.

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO CSVread_csv, to_csvRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions