Skip to content

BUG: read_csv with pyarrow engine treats newlines in quotes as newline when they should be ignored #56396

Open
@greerreNFL

Description

@greerreNFL

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

## default engine works ##
df = pd.read_csv('https://github.com/nflverse/nflverse-data/releases/download/pbp/play_by_play_2023.csv.gz', compression='gzip')

## using C engine works ##
df = pd.read_csv('https://github.com/nflverse/nflverse-data/releases/download/pbp/play_by_play_2023.csv.gz',engine='c',compression='gzip')

## using pyarrow engine fails ##
df = pd.read_csv('https://github.com/nflverse/nflverse-data/releases/download/pbp/play_by_play_2023.csv.gz',engine='pyarrow',compression='gzip')

Issue Description

CSVs that have newline breaks inside quotes can be read by the default python engine and the c engine

However, if using the pyarrow engine, the newline breaks inside the quotes will not be ignored and will instead be read as new lines, creating a csv parse error as the parser does not receive the same number of columns for each row:

df = pd.read_csv('https://github.com/nflverse/nflverse-data/releases/download/pbp/play_by_play_2023.csv.gz',engine='pyarrow',compression='gzip')
Traceback (most recent call last):
File "", line 1, in
File "/opt/homebrew/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 912, in read_csv
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 583, in _read
return parser.read(nrows)
^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1692, in read
df = self._engine.read() # type: ignore[attr-defined]
^^^^^^^^^^^^^^^^^^^
File "/opt/homebrew/lib/python3.11/site-packages/pandas/io/parsers/arrow_parser_wrapper.py", line 152, in read
table = pyarrow_csv.read_csv(
^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_csv.pyx", line 1262, in pyarrow._csv.read_csv
File "pyarrow/_csv.pyx", line 1271, in pyarrow._csv.read_csv
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: CSV parse error: Expected 372 columns, got 347: Penalty on TB-70-R.Hainsey, Offensive Holding, declined.",run,3,1,0,1,0,0,1,,,,,left,end,,,,,3,2 ...

Expected Behavior

The pyarrow engine should process the file like the python and c engines do

Installed Versions

Replace this line with the output of pd.show_versions()

Metadata

Metadata

Assignees

No one assigned

    Labels

    Arrowpyarrow functionalityBugIO CSVread_csv, to_csvUpstream issueIssue related to pandas dependency

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions