Skip to content

#59009: Added document and a test case for newlines_in_values case. #59754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions pandas/io/parsers/readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -247,7 +247,11 @@ class _read_shared(TypedDict, Generic[HashableT], total=False):
.. versionadded:: 1.4.0

The 'pyarrow' engine was added as an *experimental* engine, and some features
are unsupported, or may not work correctly, with this engine.
are unsupported, or may not work correctly, with this engine. For example,
the newlines_in_values in the ParseOptions of the pyarrow allows handling the
newline characters within values when parsing csv files. However, this is not
currently supported by Pandas. In this case, the 'csv' module in the pyarrow
should be used instead. For more information, refer to the example.
converters : dict of {{Hashable : Callable}}, optional
Functions for converting values in specified columns. Keys can either
be column labels or column indices.
Expand Down Expand Up @@ -545,12 +549,26 @@ class _read_shared(TypedDict, Generic[HashableT], total=False):
... parse_dates=[1, 2],
... date_format={{'col 2': '%d/%m/%Y', 'col 3': '%a %d %b %Y'}},
... ) # doctest: +SKIP

>>> df.dtypes # doctest: +SKIP
col 1 int64
col 2 datetime64[ns]
col 3 datetime64[ns]
dtype: object

The csv in the pyarrow must be used if the values in the file have
new line characters.

>>> from pyarrow import csv
>>> parse_options = csv.ParseOptions(newlines_in_values=True)
>>> table = csv.read_csv("example.csv", parse_options=parse_options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> from pyarrow import csv
>>> parse_options = csv.ParseOptions(newlines_in_values=True)
>>> table = csv.read_csv("example.csv", parse_options=parse_options)
>>> import io
>>> from pyarrow import csv
>>> rows = [{"text": "ab\ncd", "idx": idx} for idx in range(1_000_000)]
>>> df = pd.DataFrame(rows)
>>> source = io.BytesIO(df.to_string(index=False).encode())
>>> parse_options = csv.ParseOptions(newlines_in_values=True)
>>> table = csv.read_csv(source, parse_options=parse_options)

The doctest will fail because "example.csv" does not exist. We can change a bit or just skip the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing the example codes, I tried to skip the doctest because most of use cases with pyarrow are probably using csv files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty new to numpy contribution. If the example codes need to be modified as suggested, please let me know.

>>> df = table.to_pandas()
>>> df.head()
text idx
0 ab\ncd 0
1 ab\ncd 1
2 ab\ncd 2
3 ab\ncd 3
4 ab\ncd 4
""" # noqa: E501


Expand Down
15 changes: 15 additions & 0 deletions pandas/tests/io/parser/test_unsupported.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from pandas.errors import ParserError

import pandas._testing as tm
from pandas.core.frame import DataFrame

from pandas.io.parsers import read_csv
import pandas.io.parsers.readers as parsers
Expand Down Expand Up @@ -150,6 +151,20 @@ def test_pyarrow_engine(self):
with pytest.raises(ValueError, match=msg):
read_csv(StringIO(data), engine="pyarrow", **kwargs)

def test_pyarrow_newlines_in_values(self):
pytest.importorskip("pyarrow")
msg = (
"CSV parser got out of sync with chunker. "
"This can mean the data file contains cell values spanning multiple "
"lines; please consider enabling the option 'newlines_in_values'."
)
rows = [{"text": "ab\ncd", "idx": idx} for idx in range(1_000_000)]
df = DataFrame(rows)
df.to_csv("test.csv", index=False)
with pytest.raises(ValueError, match=msg):
read_csv("test.csv", engine="pyarrow")
os.unlink("test.csv")

def test_on_bad_lines_callable_python_or_pyarrow(self, all_parsers):
# GH 5686
# GH 54643
Expand Down
Loading