read_json engine keyword and pyarrow integration #49249

abkosar · 2022-10-22T18:21:12Z

closes ENH: Add engine keyword to read_json to enable reading from pyarrow #48893
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

WillAyd

Nice start

pandas/tests/io/json/test_readlines.py

abkosar · 2022-10-23T03:29:18Z

Thanks, I appreciate it 🙏🏻 Took me 11 days but finally starting to figure it out.

pandas/io/json/arrow_json_parser_wrapper.py

pandas/io/json/_json.py

abkosar · 2022-11-13T07:38:13Z

So I pushed my recent changes but after I successfully synced my fork, I started to have some problems when running the tests:

Independent of the test I run I kept getting the following error:

ImportError while loading conftest '/Users/ardkosar1/Documents/personal_projects/pandas/pandas/conftest.py'.
pandas/conftest.py:595: in <module>
    "timedelta": tm.makeTimedeltaIndex(100),
pandas/_testing/__init__.py:401: in makeTimedeltaIndex
    return pd.timedelta_range(start="1 day", periods=k, freq=freq, name=name, **kwargs)
pandas/core/indexes/timedeltas.py:278: in timedelta_range
    tdarr = TimedeltaArray._generate_range(start, end, periods, freq, closed=closed)
pandas/core/arrays/timedeltas.py:271: in _generate_range
    start = Timedelta(start).as_unit("ns")
E   AttributeError: 'Timedelta' object has no attribute 'as_unit'

I tried commenting out line 595 in pandas/conftest.py to see if I could run the tests like that and I started getting the following error:

ImportError while loading conftest '/Users/ardkosar1/Documents/personal_projects/pandas/pandas/conftest.py'.
pandas/conftest.py:628: in <module>
    idx = Index(pd.array(tm.makeStringIndex(100), dtype="string[pyarrow]"))
pandas/core/construction.py:322: in array
    dtype = registry.find(dtype) or dtype
pandas/core/dtypes/base.py:521: in find
    return dtype_type.construct_from_string(dtype)
pandas/core/arrays/string_.py:157: in construct_from_string
    return cls(storage="pyarrow")
pandas/core/arrays/string_.py:111: in __init__
    raise ImportError(
E   ImportError: pyarrow>=6.0.0 is required for PyArrow backed StringArray.

I think the pyarrow version in the pandas-dev environment is 2.0.0 so I tried upgrading pyarrow and then I was able to run my tests; however, I am not sure if I got reliable results. But I wanted to push my updates anyways just to get feedback and fix any issues.

@WillAyd I successfully added engine fixture to all the tests; however, since pyarrows read_json doesn't support any of the pandas.read_jsons arguments I had to add an xfail decorator to all the tests so if engine="pyarrow" it doesn't cause any errors when running the test suite.

mroeschke · 2022-11-17T01:57:08Z

Independent of the test I run I kept getting the following error:

Generally you will need to rebuild your cython code each time you pull in code (python setup.py build_ext -j 4)

pandas/io/json/_json.py

pandas/io/json/arrow_json_parser_wrapper.py

mroeschke · 2022-12-01T22:10:17Z

Hi @abkosar Just following up to see if you need any further assistance on this PR. I would be interested in following up to this PR with #48957 so let us know if you need any help

abkosar · 2022-12-01T23:09:01Z

Hey @mroeschke, Sorry for no updates. Work has been really busy the past two weeks, I haven't had the time to sit and concentrate on the issue. I will start working on it tonight. First I will address your latest comments and we can take it from there. Appreciate the support! Thanks!

abkosar · 2022-12-03T04:14:36Z

And also I don't know why this says there are conflicts because I don't what it shows in my fork.

mroeschke · 2022-12-05T20:01:17Z

And also I don't know why this says there are conflicts because I don't what it shows in my fork.

Did you happen to follow these update instructions? https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#updating-your-pull-request

abkosar · 2022-12-13T02:46:10Z

Yep I checked it out but for me, it didn't work with git merge upstream/main, it worked with rebase.

WillAyd

nice job - keep up the good work on this

pandas/io/json/_json.py

pandas/tests/io/json/test_readlines.py

abkosar · 2022-12-22T04:18:24Z

I am trying to run the tests, but I keep getting an error, making it hard for me to run tests and update something. After each rebase I also do python setup.py build_ext -j 4 as per @mroeschke 's suggestion; however, I keep getting the following error:

ERROR: usage: pytest [options] [file_or_dir] [file_or_dir] [...]
pytest: error: unrecognized arguments: --strict-data-files
  inifile: <path>/pandas/pyproject.toml
  rootdir: <path>/pandas

I also tried to recreate the development environment, but it didn't work.

mroeschke · 2022-12-27T22:31:05Z

How are you running the tests? And to confirm you created your development environment according to https://pandas.pydata.org/pandas-docs/stable/development/contributing_environment.html?

mroeschke · 2023-01-06T01:04:28Z

Hey @abkosar mind if I push some commits to your PR? I would be very interested to see this feature in the next major release tentatively scheduled for next month

abkosar · 2023-01-06T21:59:29Z

Hey @mroeschke! Please go ahead, but I was going to do some changes so should I hold off?

abkosar · 2023-02-02T14:15:03Z

Error: /home/runner/work/pandas/pandas/pandas/io/json/_json.py:486:PR03:pandas.read_json:Wrong parameters order. Actual: ('path_or_buf', 'orient', 'typ', 'dtype', 'convert_axes', 'convert_dates', 'keep_default_dates', 'precise_float', 'date_unit', 'encoding', 'encoding_errors', 'lines', 'chunksize', 'compression', 'nrows', 'storage_options', 'use_nullable_dtypes', 'engine'). Documented: ('path_or_buf', 'orient', 'typ', 'dtype', 'convert_axes', 'convert_dates', 'keep_default_dates', 'precise_float', 'date_unit', 'encoding', 'encoding_errors', 'engine', 'lines', 'chunksize', 'compression', 'nrows', 'storage_options', 'use_nullable_dtypes')

This is fixed too.

- Added reason to fail to each pyarrow xfail test.

pandas/tests/io/json/test_readlines.py

- removed xfail fixture from conftest.py

pandas/io/json/_json.py

mroeschke · 2023-02-03T18:44:17Z

pandas/tests/io/json/test_readlines.py

-def test_read_datetime():
+def test_read_jsonl_engine_pyarrow(json_dir_path, engine):
+    result = read_json(
+        os.path.join(json_dir_path, "line_delimited.json"),


Suggested change

os.path.join(json_dir_path, "line_delimited.json"),

datapath("io", "json", "data", "line_delimited.json")

And then you can replace json_dir_path with datapath and remove json_dir_path from pandas/tests/io/json/conftest.py

mroeschke · 2023-02-03T18:45:20Z

pandas/tests/io/json/conftest.py

+def engine(request):
+    if request.param == "pyarrow":
+        pytest.importorskip("pyarrow.json")
+        return request.param


Nit: Instead of the else, could you do

if request.param == "pyarrow": pytest.importorskip(...) return request.param

mroeschke · 2023-02-09T21:53:47Z

I pushed a few commits addressing my comments. I think this should be okay for 2.0. Mind taking a look @phofl

phofl

looks good in general, some small comments.

Can you update https://pandas.pydata.org/docs/user_guide/io.html#reading-json as well?

pandas/io/json/_json.py

pandas/tests/io/json/test_pandas.py

phofl · 2023-02-09T23:04:06Z

We are also documenting all key-words in the user guide I think, right at the top of the section.

Otherwise lgtm

Can't merge without this, the last open comment (documenting the xfails) was addressed

phofl · 2023-02-10T12:26:42Z

thx @mroeschke and thx @abkosar

abkosar · 2023-02-10T14:50:15Z

Thanks @mroeschke for all your patience! Happy to be involved!

abkosar mentioned this pull request Oct 22, 2022

read_json engine argument integration #49041

Closed

5 tasks

WillAyd reviewed Oct 23, 2022

View reviewed changes

pandas/tests/io/json/test_readlines.py Outdated Show resolved Hide resolved

twoertwein reviewed Oct 23, 2022

View reviewed changes

pandas/io/json/arrow_json_parser_wrapper.py Outdated Show resolved Hide resolved

mroeschke added IO JSON read_json, to_json, json_normalize Arrow pyarrow functionality labels Oct 24, 2022

mroeschke reviewed Oct 24, 2022

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

mroeschke reviewed Oct 24, 2022

View reviewed changes

pandas/io/json/_json.py Show resolved Hide resolved

mroeschke reviewed Oct 24, 2022

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

abkosar force-pushed the main branch from 87a2b9a to 4f69b94 Compare November 13, 2022 07:28

mroeschke reviewed Nov 17, 2022

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

mroeschke reviewed Nov 17, 2022

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

mroeschke reviewed Nov 17, 2022

View reviewed changes

pandas/io/json/arrow_json_parser_wrapper.py Outdated Show resolved Hide resolved

mroeschke reviewed Nov 17, 2022

View reviewed changes

pandas/io/json/arrow_json_parser_wrapper.py Outdated Show resolved Hide resolved

abkosar added a commit to abkosar/pandas that referenced this pull request Dec 3, 2022

ENH: read_json engine keyword and pyarrow integration (pandas-dev#49249)

5adf8a3

abkosar force-pushed the main branch from 4f69b94 to 5adf8a3 Compare December 3, 2022 04:11

abkosar force-pushed the main branch from 5adf8a3 to 8011856 Compare December 13, 2022 02:45

abkosar force-pushed the main branch from cb04f60 to b85057e Compare December 13, 2022 07:24

WillAyd previously requested changes Dec 14, 2022

View reviewed changes

pandas/io/json/_json.py Outdated Show resolved Hide resolved

pandas/tests/io/json/test_readlines.py Outdated Show resolved Hide resolved

pandas/tests/io/json/test_readlines.py Outdated Show resolved Hide resolved

abkosar added 2 commits February 1, 2023 14:52

Merge branch 'pandas-dev:main' into main

569ab9b

Merge branch 'pandas-dev:main' into main

4dc9adc

abkosar added 2 commits February 2, 2023 09:49

- Added the logic for skipping test if pyarrow is not installed.

38bc7db

- Added reason to fail to each pyarrow xfail test.

Added else statement to conftest engine

bed15df

mroeschke reviewed Feb 2, 2023

View reviewed changes

pandas/tests/io/json/test_readlines.py Outdated Show resolved Hide resolved

abkosar added 5 commits February 2, 2023 13:17

Merge branch 'main' into main

a29e96a

- removed xfail decorators

cdfd747

- removed xfail fixture from conftest.py

Merge branch 'main' into test-fixes

c70c0b4

Merge branch 'main' into main

fe2b3ef

Merge branch 'main' into main

228ca64

mroeschke reviewed Feb 3, 2023

View reviewed changes

pandas/io/json/_json.py Show resolved Hide resolved

mroeschke reviewed Feb 3, 2023

View reviewed changes

abkosar and others added 3 commits February 4, 2023 17:38

Merge branch 'pandas-dev:main' into main

d1acc94

Merge remote-tracking branch 'upstream/main' into abkosar/main

0885f07

add whatsnew, address comments

ab7af44

mroeschke added this to the 2.0 milestone Feb 9, 2023

phofl requested changes Feb 9, 2023

View reviewed changes

pandas/io/json/_json.py Show resolved Hide resolved

pandas/io/json/_json.py Outdated Show resolved Hide resolved

pandas/tests/io/json/test_pandas.py Show resolved Hide resolved

mroeschke added 2 commits February 9, 2023 14:34

Merge remote-tracking branch 'upstream/main' into abkosar/main

c9cde9e

address review

8c96553

mroeschke added 2 commits February 9, 2023 15:25

Add note about param

9cbf598

Add test with lines=false

c59310b

phofl approved these changes Feb 10, 2023

View reviewed changes

phofl merged commit 94f9412 into pandas-dev:main Feb 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_json engine keyword and pyarrow integration #49249

read_json engine keyword and pyarrow integration #49249

abkosar commented Oct 22, 2022 •

edited

Loading

WillAyd left a comment

abkosar commented Oct 23, 2022

abkosar commented Nov 13, 2022

mroeschke commented Nov 17, 2022

mroeschke commented Dec 1, 2022

abkosar commented Dec 1, 2022

abkosar commented Dec 3, 2022

mroeschke commented Dec 5, 2022

abkosar commented Dec 13, 2022 •

edited

Loading

WillAyd left a comment

abkosar commented Dec 22, 2022 •

edited

Loading

mroeschke commented Dec 27, 2022

mroeschke commented Jan 6, 2023

abkosar commented Jan 6, 2023

abkosar commented Feb 2, 2023

mroeschke Feb 3, 2023

mroeschke Feb 3, 2023

mroeschke commented Feb 9, 2023

phofl left a comment

phofl commented Feb 9, 2023

phofl commented Feb 10, 2023

abkosar commented Feb 10, 2023

	os.path.join(json_dir_path, "line_delimited.json"),
	datapath("io", "json", "data", "line_delimited.json")

read_json engine keyword and pyarrow integration #49249

read_json engine keyword and pyarrow integration #49249

Conversation

abkosar commented Oct 22, 2022 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

abkosar commented Oct 23, 2022

abkosar commented Nov 13, 2022

mroeschke commented Nov 17, 2022

mroeschke commented Dec 1, 2022

abkosar commented Dec 1, 2022

abkosar commented Dec 3, 2022

mroeschke commented Dec 5, 2022

abkosar commented Dec 13, 2022 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

abkosar commented Dec 22, 2022 • edited Loading

mroeschke commented Dec 27, 2022

mroeschke commented Jan 6, 2023

abkosar commented Jan 6, 2023

abkosar commented Feb 2, 2023

mroeschke Feb 3, 2023

Choose a reason for hiding this comment

mroeschke Feb 3, 2023

Choose a reason for hiding this comment

mroeschke commented Feb 9, 2023

phofl left a comment

Choose a reason for hiding this comment

phofl commented Feb 9, 2023

phofl commented Feb 10, 2023

abkosar commented Feb 10, 2023

abkosar commented Oct 22, 2022 •

edited

Loading

abkosar commented Dec 13, 2022 •

edited

Loading

abkosar commented Dec 22, 2022 •

edited

Loading