Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
from io import BytesIO
import pandas as pd
file_good = b"a;b\na;1,20\nb;22,3\n"
file_bad = file_good + b"c;1.234,56\n"
file_ints = b"a;b\na;1\nb;1.234,56\n"
# OK
df1 = pd.read_csv(BytesIO(file_good), sep=";", decimal=",", dtype={"b": float})
# NOT OK!
# raises: ValueError: could not convert string to float: '1,20'
# should raise: ValueError: could not convert string to float: '1.234,56'
df2 = pd.read_csv(BytesIO(file_bad), sep=";", decimal=",", dtype={"b": float})
# OK, correctly raises ValueError: could not convert string to float: '1.234,56'
df3 = pd.read_csv(BytesIO(file_ints), sep=";", decimal=",", dtype={"b": float})
Issue Description
When reading a csv file with comma as decimal separator but not specifying the thousands separator, the error message from read_csv is broken.
In the example, someone added a number with a "." as thousands separator in file_bad
. If previous rows contain correct numbers with comma as decimal separator, the later line with the "." suddenly causes a ValueError in a previous line. This should never happen. Here, the offending line is the new one with "." in it. Interestingly, if all previous floats don't have any decimal points (case "file_int"), then the error message is correct.
Im my case, I had a csv file with 600k lines. The real error was in line 550k, while the ValueError pointed me to line somewhere around 1k.
My issue was quickly solved by adding thousands="."
, but it took me some minutes to find the offending line (can be hard in large csv files, therefore correct ValueErrors are important)
Expected Behavior
file_bad should raise ValueError: could not convert string to float: "1.234,56" as in case 3
Installed Versions
Replace this line with the output of pd.show_versions()