Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
import pandas as pd
df = pd.DataFrame(data=[ "a"*127 ]*4000)
# Put two-byte utf-8 encoded character at the end of the chunk (the utf-8 encoding of "ą" is b'\xc4\x85')
df.iloc[2047] = "a"*127 + "ą"
df.to_csv("./bugtest.csv", index=False, header=False, encoding="utf-8")
df1 = pd.read_csv("./bugtest.csv", header=None, memory_map=True) # <-- this fails
Traceback (most recent call last):
File "/home/michal/.conda/envs/py39nlp/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
self._reader = parsers.TextReader(self.handles.handle, **kwds)
File "pandas/_libs/parsers.pyx", line 542, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 745, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 1917, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data
python-BaseException
Issue Description
This bug occurs when the end of the internal 256KB buffer falls inside an utf-8 encoded multibyte character. When memory_map=True, the csv parser uses _MMapWrapper.read()
method defined in common.py (L872):
def read(self, size: int = -1) -> str | bytes:
# CSV c-engine uses read instead of iterating
content: bytes = self.mmap.read(size)
if self.decode:
# memory mapping is applied before compression. Encoding should
# be applied to the de-compressed data.
return content.decode(self.encoding, errors=self.errors)
return content
As this function is called with size=256KB, it is clear that content
buffer can split a multibyte character. When it happens, the utf-8 codec raises "unexpected end of data" error.
The _MMapWrapper.read()
method was added in REGR: memory_map with non-UTF8 encoding #40994 , so the bug is present in Pandas 1.2.5 and newer versions.
Expected Behavior
Doesn't raise exception, produces the same result as pd.read_csv("./bugtest.csv", header=None, memory_map=False)
Installed Versions
pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None