Skip to content

BUG: REGR: read_csv with memory_map=True raises UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data #43540

Closed
@michal-gh

Description

@michal-gh

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame(data=[ "a"*127 ]*4000)
# Put two-byte utf-8 encoded character at the end of the chunk (the utf-8 encoding of "ą" is b'\xc4\x85')
df.iloc[2047] = "a"*127 + "ą"
df.to_csv("./bugtest.csv", index=False, header=False, encoding="utf-8")
df1 = pd.read_csv("./bugtest.csv", header=None, memory_map=True) # <-- this fails

Traceback (most recent call last):
  File "/home/michal/.conda/envs/py39nlp/lib/python3.9/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 69, in __init__
    self._reader = parsers.TextReader(self.handles.handle, **kwds)
  File "pandas/_libs/parsers.pyx", line 542, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 745, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 843, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 1917, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 262143: unexpected end of data
python-BaseException

Issue Description

This bug occurs when the end of the internal 256KB buffer falls inside an utf-8 encoded multibyte character. When memory_map=True, the csv parser uses _MMapWrapper.read() method defined in common.py (L872):

def read(self, size: int = -1) -> str | bytes:
        # CSV c-engine uses read instead of iterating
        content: bytes = self.mmap.read(size)
        if self.decode:
            # memory mapping is applied before compression. Encoding should
            # be applied to the de-compressed data.
            return content.decode(self.encoding, errors=self.errors)
        return content

As this function is called with size=256KB, it is clear that content buffer can split a multibyte character. When it happens, the utf-8 codec raises "unexpected end of data" error.

The _MMapWrapper.read() method was added in REGR: memory_map with non-UTF8 encoding #40994 , so the bug is present in Pandas 1.2.5 and newer versions.

Expected Behavior

Doesn't raise exception, produces the same result as pd.read_csv("./bugtest.csv", header=None, memory_map=False)

Installed Versions

pandas_datareader: None
bs4 : 4.9.3
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 5.0.0
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugIO CSVread_csv, to_csvRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions