#59009: Added document and a test case for newlines_in_values case. #59754

wooseogchoi · 2024-09-09T01:58:27Z

[x ] closes BUG: Reading large CSV files with pyarrow when values contain newline character. #59009 (Replace xxxx with the GitHub issue number)
[ x] Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

yuanx749 · 2024-09-17T11:29:44Z

pandas/io/parsers/readers.py

+>>> from pyarrow import csv
+>>> parse_options = csv.ParseOptions(newlines_in_values=True)
+>>> table = csv.read_csv("example.csv", parse_options=parse_options)


Suggested change

>>> from pyarrow import csv

>>> parse_options = csv.ParseOptions(newlines_in_values=True)

>>> table = csv.read_csv("example.csv", parse_options=parse_options)

>>> import io

>>> from pyarrow import csv

>>> rows = [{"text": "ab\ncd", "idx": idx} for idx in range(1_000_000)]

>>> df = pd.DataFrame(rows)

>>> source = io.BytesIO(df.to_string(index=False).encode())

>>> parse_options = csv.ParseOptions(newlines_in_values=True)

>>> table = csv.read_csv(source, parse_options=parse_options)

The doctest will fail because "example.csv" does not exist. We can change a bit or just skip the test.

Instead of changing the example codes, I tried to skip the doctest because most of use cases with pyarrow are probably using csv files.

I am pretty new to numpy contribution. If the example codes need to be modified as suggested, please let me know.

…e_#59009

…ue_#59009 fixed errors from PR.

WillAyd · 2024-09-26T03:53:32Z

Hi @wooseogchoi thanks for taking a look at this. Regarding the CI failures - the Minimum Versions failure means that whatever the lowest version allowed of pyarrow doesn't properly fix this issue. Can you check to see what version of pyarrow this might have been fixed in?

If you can identify that you can use the pa_version_underXXpY sentinels from pandas.compat to skip the test for versions we cannot fix (you will see uses of that already throughout the test suite)

The errors on the circleci arm build seem unrelated - they may go away if you merge in the latest changes from main

…ue_#59009 merged from origin main

github-actions · 2024-10-29T00:07:34Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

mroeschke · 2024-10-29T20:37:30Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

wooseogchoi added 4 commits September 8, 2024 21:36

Added document and a test case for newlines_in_values case.

9ebb87b

removed traied spaces and changed test codes.

ccd14a8

fixed hook error

898365e

fixed unit-test failure.

7fdbf43

yuanx749 reviewed Sep 17, 2024

View reviewed changes

wooseogchoi added 5 commits September 22, 2024 21:57

Merge branch 'main' of https://github.com/pandas-dev/pandas into issu…

697c6fe

…e_#59009

Added document and a test case for newlines_in_values case.

25c2604

Merge branch 'main' of https://github.com/wooseogchoi/pandas into iss…

7b8e68b

…ue_#59009 fixed errors from PR.

fixed unit test

703654f

skip test if pyarrow cannot be imported.

ca34b24

wooseogchoi mentioned this pull request Sep 26, 2024

BUG: Reading large CSV files with pyarrow when values contain newline character. #59009

Open

3 tasks

wooseogchoi added 7 commits September 26, 2024 21:24

Added pyarrow version check

c4222ad

changed version

ff7dcba

pa_version_13p0

fb87e78

debugging test

84d8b6d

roll-back of compat

18618c7

Merge branch 'main' of https://github.com/wooseogchoi/pandas into iss…

b9bc0cb

…ue_#59009 merged from origin main

run unit test for pyarrow v. 18

fe946a2

github-actions bot added the Stale label Oct 29, 2024

mroeschke closed this Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#59009: Added document and a test case for newlines_in_values case. #59754

#59009: Added document and a test case for newlines_in_values case. #59754

wooseogchoi commented Sep 9, 2024

yuanx749 Sep 17, 2024

wooseogchoi Sep 24, 2024

wooseogchoi Sep 24, 2024

WillAyd commented Sep 26, 2024

github-actions bot commented Oct 29, 2024

mroeschke commented Oct 29, 2024

#59009: Added document and a test case for newlines_in_values case. #59754

#59009: Added document and a test case for newlines_in_values case. #59754

Conversation

wooseogchoi commented Sep 9, 2024

yuanx749 Sep 17, 2024

Choose a reason for hiding this comment

wooseogchoi Sep 24, 2024

Choose a reason for hiding this comment

wooseogchoi Sep 24, 2024

Choose a reason for hiding this comment

WillAyd commented Sep 26, 2024

github-actions bot commented Oct 29, 2024

mroeschke commented Oct 29, 2024