Skip to content

BUG: Pandas 1.1.3 read_csv raises a TypeError when dtype, and index_col are provided, and file has >1M rows  #37094

Closed
@mgeplf

Description

@mgeplf
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd
import numpy as np

ROWS = 1000001  #  <--------- with 1000000, it works

with open('out.dat', 'w') as fd:
    for i in range(ROWS):
        fd.write('%d\n' % i)

df = pd.read_csv('out.dat', names=['a'], dtype={'a': np.float64}, index_col=['a'])

Problem description

When ROWS = 1000001, I get the following traceback:

Traceback (most recent call last):
  File "try.py", line 10, in <module>
    df = pd.read_csv('out.dat', names=['a'], dtype={'a': np.float64}, index_col=['a'])
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 686, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 458, in _read
    data = parser.read(nrows)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1196, in read
    ret = self._engine.read(nrows)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 2231, in read
    index, names = self._make_index(data, alldata, names)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1677, in _make_index
    index = self._agg_index(index)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1770, in _agg_index
    arr, _ = self._infer_types(arr, col_na_values | col_na_fvalues)
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/io/parsers.py", line 1871, in _infer_types
    mask = algorithms.isin(values, list(na_values))
  File "/tmp/new_pandas/lib64/python3.6/site-packages/pandas/core/algorithms.py", line 443, in isin
    if np.isnan(values).any():
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Expected Output

With pandas 1.1.2, or ROWS = 1000000, it works fine.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : db08276 python : 3.6.3.final.0 python-bits : 64 OS : Linux OS-release : 3.10.0-957.38.3.el7.x86_64 Version : #1 SMP Mon Nov 11 12:01:33 EST 2019 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.3
setuptools : 50.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 7.16.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    AlgosNon-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diffRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions