Skip to content

#59009: Added document and a test case for newlines_in_values case. #59754

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

wooseogchoi
Copy link
Contributor

Comment on lines 561 to 563
>>> from pyarrow import csv
>>> parse_options = csv.ParseOptions(newlines_in_values=True)
>>> table = csv.read_csv("example.csv", parse_options=parse_options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> from pyarrow import csv
>>> parse_options = csv.ParseOptions(newlines_in_values=True)
>>> table = csv.read_csv("example.csv", parse_options=parse_options)
>>> import io
>>> from pyarrow import csv
>>> rows = [{"text": "ab\ncd", "idx": idx} for idx in range(1_000_000)]
>>> df = pd.DataFrame(rows)
>>> source = io.BytesIO(df.to_string(index=False).encode())
>>> parse_options = csv.ParseOptions(newlines_in_values=True)
>>> table = csv.read_csv(source, parse_options=parse_options)

The doctest will fail because "example.csv" does not exist. We can change a bit or just skip the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of changing the example codes, I tried to skip the doctest because most of use cases with pyarrow are probably using csv files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty new to numpy contribution. If the example codes need to be modified as suggested, please let me know.

@WillAyd
Copy link
Member

WillAyd commented Sep 26, 2024

Hi @wooseogchoi thanks for taking a look at this. Regarding the CI failures - the Minimum Versions failure means that whatever the lowest version allowed of pyarrow doesn't properly fix this issue. Can you check to see what version of pyarrow this might have been fixed in?

If you can identify that you can use the pa_version_underXXpY sentinels from pandas.compat to skip the test for versions we cannot fix (you will see uses of that already throughout the test suite)

The errors on the circleci arm build seem unrelated - they may go away if you merge in the latest changes from main

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 29, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Reading large CSV files with pyarrow when values contain newline character.
4 participants