Skip to content

read_json engine keyword and pyarrow integration #49249

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 31 commits into from
Feb 10, 2023
Merged

Conversation

abkosar
Copy link
Contributor

@abkosar abkosar commented Oct 22, 2022

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice start

@abkosar
Copy link
Contributor Author

abkosar commented Oct 23, 2022

Thanks, I appreciate it 🙏🏻 Took me 11 days but finally starting to figure it out.

@mroeschke mroeschke added IO JSON read_json, to_json, json_normalize Arrow pyarrow functionality labels Oct 24, 2022
@abkosar
Copy link
Contributor Author

abkosar commented Nov 13, 2022

So I pushed my recent changes but after I successfully synced my fork, I started to have some problems when running the tests:

  1. Independent of the test I run I kept getting the following error:
ImportError while loading conftest '/Users/ardkosar1/Documents/personal_projects/pandas/pandas/conftest.py'.
pandas/conftest.py:595: in <module>
    "timedelta": tm.makeTimedeltaIndex(100),
pandas/_testing/__init__.py:401: in makeTimedeltaIndex
    return pd.timedelta_range(start="1 day", periods=k, freq=freq, name=name, **kwargs)
pandas/core/indexes/timedeltas.py:278: in timedelta_range
    tdarr = TimedeltaArray._generate_range(start, end, periods, freq, closed=closed)
pandas/core/arrays/timedeltas.py:271: in _generate_range
    start = Timedelta(start).as_unit("ns")
E   AttributeError: 'Timedelta' object has no attribute 'as_unit'
  1. I tried commenting out line 595 in pandas/conftest.py to see if I could run the tests like that and I started getting the following error:
ImportError while loading conftest '/Users/ardkosar1/Documents/personal_projects/pandas/pandas/conftest.py'.
pandas/conftest.py:628: in <module>
    idx = Index(pd.array(tm.makeStringIndex(100), dtype="string[pyarrow]"))
pandas/core/construction.py:322: in array
    dtype = registry.find(dtype) or dtype
pandas/core/dtypes/base.py:521: in find
    return dtype_type.construct_from_string(dtype)
pandas/core/arrays/string_.py:157: in construct_from_string
    return cls(storage="pyarrow")
pandas/core/arrays/string_.py:111: in __init__
    raise ImportError(
E   ImportError: pyarrow>=6.0.0 is required for PyArrow backed StringArray.

I think the pyarrow version in the pandas-dev environment is 2.0.0 so I tried upgrading pyarrow and then I was able to run my tests; however, I am not sure if I got reliable results. But I wanted to push my updates anyways just to get feedback and fix any issues.

@WillAyd I successfully added engine fixture to all the tests; however, since pyarrows read_json doesn't support any of the pandas.read_jsons arguments I had to add an xfail decorator to all the tests so if engine="pyarrow" it doesn't cause any errors when running the test suite.

@mroeschke
Copy link
Member

Independent of the test I run I kept getting the following error:

Generally you will need to rebuild your cython code each time you pull in code (python setup.py build_ext -j 4)

@mroeschke
Copy link
Member

Hi @abkosar Just following up to see if you need any further assistance on this PR. I would be interested in following up to this PR with #48957 so let us know if you need any help

@abkosar
Copy link
Contributor Author

abkosar commented Dec 1, 2022

Hey @mroeschke, Sorry for no updates. Work has been really busy the past two weeks, I haven't had the time to sit and concentrate on the issue. I will start working on it tonight. First I will address your latest comments and we can take it from there. Appreciate the support! Thanks!

@abkosar
Copy link
Contributor Author

abkosar commented Dec 3, 2022

And also I don't know why this says there are conflicts because I don't what it shows in my fork.

@mroeschke
Copy link
Member

And also I don't know why this says there are conflicts because I don't what it shows in my fork.

Did you happen to follow these update instructions? https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#updating-your-pull-request

@abkosar
Copy link
Contributor Author

abkosar commented Dec 13, 2022

Yep I checked it out but for me, it didn't work with git merge upstream/main, it worked with rebase.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice job - keep up the good work on this

@abkosar
Copy link
Contributor Author

abkosar commented Dec 22, 2022

I am trying to run the tests, but I keep getting an error, making it hard for me to run tests and update something. After each rebase I also do python setup.py build_ext -j 4 as per @mroeschke 's suggestion; however, I keep getting the following error:

ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --strict-data-files
  inifile: <path>/pandas/pyproject.toml
  rootdir: <path>/pandas

I also tried to recreate the development environment, but it didn't work.

@mroeschke
Copy link
Member

How are you running the tests? And to confirm you created your development environment according to https://pandas.pydata.org/pandas-docs/stable/development/contributing_environment.html?

@mroeschke
Copy link
Member

Hey @abkosar mind if I push some commits to your PR? I would be very interested to see this feature in the next major release tentatively scheduled for next month

@abkosar
Copy link
Contributor Author

abkosar commented Jan 6, 2023

Hey @mroeschke! Please go ahead, but I was going to do some changes so should I hold off?

@abkosar
Copy link
Contributor Author

abkosar commented Feb 2, 2023

Error: /home/runner/work/pandas/pandas/pandas/io/json/_json.py:486:PR03:pandas.read_json:Wrong parameters order. Actual: ('path_or_buf', 'orient', 'typ', 'dtype', 'convert_axes', 'convert_dates', 'keep_default_dates', 'precise_float', 'date_unit', 'encoding', 'encoding_errors', 'lines', 'chunksize', 'compression', 'nrows', 'storage_options', 'use_nullable_dtypes', 'engine'). Documented: ('path_or_buf', 'orient', 'typ', 'dtype', 'convert_axes', 'convert_dates', 'keep_default_dates', 'precise_float', 'date_unit', 'encoding', 'encoding_errors', 'engine', 'lines', 'chunksize', 'compression', 'nrows', 'storage_options', 'use_nullable_dtypes')

This is fixed too.

def test_read_datetime():
def test_read_jsonl_engine_pyarrow(json_dir_path, engine):
result = read_json(
os.path.join(json_dir_path, "line_delimited.json"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
os.path.join(json_dir_path, "line_delimited.json"),
datapath("io", "json", "data", "line_delimited.json")

And then you can replace json_dir_path with datapath and remove json_dir_path from pandas/tests/io/json/conftest.py

def engine(request):
if request.param == "pyarrow":
pytest.importorskip("pyarrow.json")
return request.param
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Instead of the else, could you do

if request.param == "pyarrow":
    pytest.importorskip(...)
return request.param

@mroeschke mroeschke added this to the 2.0 milestone Feb 9, 2023
@mroeschke
Copy link
Member

I pushed a few commits addressing my comments. I think this should be okay for 2.0. Mind taking a look @phofl

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good in general, some small comments.

Can you update https://pandas.pydata.org/docs/user_guide/io.html#reading-json as well?

@phofl
Copy link
Member

phofl commented Feb 9, 2023

We are also documenting all key-words in the user guide I think, right at the top of the section.

Otherwise lgtm

@phofl phofl dismissed WillAyd’s stale review February 10, 2023 12:26

Can't merge without this, the last open comment (documenting the xfails) was addressed

@phofl
Copy link
Member

phofl commented Feb 10, 2023

thx @mroeschke and thx @abkosar

@phofl phofl merged commit 94f9412 into pandas-dev:main Feb 10, 2023
@abkosar
Copy link
Contributor Author

abkosar commented Feb 10, 2023

Thanks @mroeschke for all your patience! Happy to be involved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Add engine keyword to read_json to enable reading from pyarrow
5 participants