Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
# Load csv attached to this issue
import pandas as pd
df = pd.read_csv("pandas-bug-reproducer.csv", header=0, index_col=False)
Issue Description
The read_csv
command results in the following message (this is ipython output, but it also happens non-interactively)
<ipython-input-61-2957767dea3a>:1: DtypeWarning: Columns (7) have mixed types. Specify dtype option on import or set low_memory=False.
Column 7 is then imported as strings, not floats.
I can work around this by using the methods in the hint, but this smells like a bug, as if I remove any line in the CSV, the issue disappears. If I replace the last line by a copy-paste of the one before, the bug also goes away.
It is quite tricky to create a small reproducer, so I am attaching the file here.
Replacing all text with "a" and values with "1" kept the issue, while making the data anonymous and very compressible:
pandas-bug-reproducer.zip
Expected Behavior
This message should not appear, and the data in column 7 should be imported as floating point values.
Moreover, changing the input csv by adding or removing random lines should not affect pandas's behavior.
Installed Versions
First version I tried
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.10.4.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-553.16.1.el8_10.x86_64
Version : #1 SMP Thu Aug 1 04:16:12 EDT 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.2
numpy : 1.26.4
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 69.2.0
pip : 24.2
Cython : 3.0.2
pytest : 8.2.2
hypothesis : None
sphinx : 7.3.7
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 4.9.3
html5lib : None
pymysql : 1.0.2
psycopg2 : None
jinja2 : 3.1.4
IPython : 8.4.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : 1.4.0
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.3.1
gcsfs : None
matplotlib : 3.7.3
numba : 0.60.0
numexpr : 2.10.0
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.14.1
sqlalchemy : 2.0.15
tables : None
tabulate : 0.9.0
xarray : None
xlrd : 2.0.1
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
Second version I tried
INSTALLED VERSIONS
commit : 0691c5c
python : 3.12.2
python-bits : 64
OS : Linux
OS-release : 4.18.0-553.16.1.el8_10.x86_64
Version : #1 SMP Thu Aug 1 04:16:12 EDT 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.3
numpy : 2.1.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.0
Cython : None
sphinx : 8.1.3
IPython : 8.28.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.4
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : None
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : None
pyqt5 : None