-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Adding json line parsing to pd.read_json #9180 #13351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 13 commits
47c60a6
a8dd0ef
f71d011
fc865c4
e8f10ea
3c796a9
6861a71
c76dafe
b20798a
f547b0d
ae19f04
ac7b687
f7c3bbf
37252c6
e635318
32a2f8d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1466,6 +1466,7 @@ with optional parameters: | |
- ``force_ascii`` : force encoded string to be ASCII, default True. | ||
- ``date_unit`` : The time unit to encode to, governs timestamp and ISO8601 precision. One of 's', 'ms', 'us' or 'ns' for seconds, milliseconds, microseconds and nanoseconds respectively. Default 'ms'. | ||
- ``default_handler`` : The handler to call if an object cannot otherwise be converted to a suitable format for JSON. Takes a single argument, which is the object to convert, and returns a serializable object. | ||
- ``lines`` : If ``records`` orient, then will write each record per line as json. | ||
|
||
Note ``NaN``'s, ``NaT``'s and ``None`` will be converted to ``null`` and ``datetime`` objects will be converted based on the ``date_format`` and ``date_unit`` parameters. | ||
|
||
|
@@ -1656,6 +1657,8 @@ is ``None``. To explicitly force ``Series`` parsing, pass ``typ=series`` | |
None. By default the timestamp precision will be detected, if this is not desired | ||
then pass one of 's', 'ms', 'us' or 'ns' to force timestamp precision to | ||
seconds, milliseconds, microseconds or nanoseconds respectively. | ||
- ``lines`` : reads file as one json object per line. | ||
- ``encoding`` : The encoding to use to decode py3 bytes. | ||
|
||
The parser will raise one of ``ValueError/TypeError/AssertionError`` if the JSON is not parseable. | ||
|
||
|
@@ -1845,6 +1848,25 @@ into a flat table. | |
|
||
json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']]) | ||
|
||
Line delimited json | ||
''''''''''''''''''' | ||
|
||
.. versionadded:: 0.19.0 | ||
|
||
pandas is able to read and write line-delimited json files that are common in data processing pipelines | ||
using Hadoop or Spark. | ||
|
||
.. ipython:: python | ||
|
||
import pandas as pd | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. don't need the import here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ah good catch. |
||
jsonl = ''' | ||
{"a":1,"b":2} | ||
{"a":3,"b":4} | ||
''' | ||
df = pd.read_json(jsonl, lines=True) | ||
df | ||
df.to_json(orient='records', lines=True) | ||
|
||
HTML | ||
---- | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -254,6 +254,7 @@ Other enhancements | |
|
||
.. _whatsnew_0190.api: | ||
|
||
|
||
API changes | ||
~~~~~~~~~~~ | ||
|
||
|
@@ -271,7 +272,7 @@ API changes | |
- ``__setitem__`` will no longer apply a callable rhs as a function instead of storing it. Call ``where`` directly to get the previous behavior. (:issue:`13299`) | ||
- Passing ``Period`` with multiple frequencies to normal ``Index`` now returns ``Index`` with ``object`` dtype (:issue:`13664`) | ||
- ``PeriodIndex.fillna`` with ``Period`` has different freq now coerces to ``object`` dtype (:issue:`13664`) | ||
|
||
- The ``pd.read_json`` and ``DataFrame.to_json`` has gained support for reading and writing json lines with ``lines`` option (:issue:`9180`) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a pointer to the new docs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. will do |
||
|
||
.. _whatsnew_0190.api.tolist: | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1016,7 +1016,7 @@ def __setstate__(self, state): | |
|
||
def to_json(self, path_or_buf=None, orient=None, date_format='epoch', | ||
double_precision=10, force_ascii=True, date_unit='ms', | ||
default_handler=None): | ||
default_handler=None, lines=False): | ||
""" | ||
Convert the object to a JSON string. | ||
|
||
|
@@ -1064,6 +1064,13 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch', | |
Handler to call if object cannot otherwise be converted to a | ||
suitable format for JSON. Should receive a single argument which is | ||
the object to convert and return a serialisable object. | ||
lines : boolean, defalut False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. defalut -> default |
||
If 'orient' is 'records' write out line delimited json format. Will | ||
throw ValueError if incorrect 'orient' since others are not list | ||
like. | ||
|
||
.. versionadded:: 0.19.0 | ||
|
||
|
||
Returns | ||
------- | ||
|
@@ -1076,7 +1083,8 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch', | |
date_format=date_format, | ||
double_precision=double_precision, | ||
force_ascii=force_ascii, date_unit=date_unit, | ||
default_handler=default_handler) | ||
default_handler=default_handler, | ||
lines=lines) | ||
|
||
def to_hdf(self, path_or_buf, key, **kwargs): | ||
"""Activate the HDFStore. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,10 +7,10 @@ | |
|
||
import pandas.json as _json | ||
from pandas.tslib import iNaT | ||
from pandas.compat import long, u | ||
from pandas.compat import StringIO, long, u | ||
from pandas import compat, isnull | ||
from pandas import Series, DataFrame, to_datetime | ||
from pandas.io.common import get_filepath_or_buffer | ||
from pandas.io.common import get_filepath_or_buffer, _get_handle | ||
from pandas.core.common import AbstractMethodError | ||
from pandas.formats.printing import pprint_thing | ||
|
||
|
@@ -22,7 +22,11 @@ | |
|
||
def to_json(path_or_buf, obj, orient=None, date_format='epoch', | ||
double_precision=10, force_ascii=True, date_unit='ms', | ||
default_handler=None): | ||
default_handler=None, lines=False): | ||
|
||
if lines and orient != 'records': | ||
raise ValueError( | ||
"'lines' keyword only valid when 'orient' is records") | ||
|
||
if isinstance(obj, Series): | ||
s = SeriesWriter( | ||
|
@@ -37,6 +41,22 @@ def to_json(path_or_buf, obj, orient=None, date_format='epoch', | |
else: | ||
raise NotImplementedError("'obj' should be a Series or a DataFrame") | ||
|
||
if lines and s[0] == '[' and s[-1] == ']': # Determine we have a JSON | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add a comment on what you are doing / why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. more so than the comments on the side? (Determining if we have a JSON list to turn to lines) I guess I could say that if we are giving a differen json object, then line delimited json doesn't make sense. |
||
s = s[1:-1] # list to turn to lines | ||
num_open_brackets_seen = 0 | ||
commas_to_replace = [] | ||
for idx, char in enumerate(s): # iter through to find all | ||
if char == ',': # commas that should be \n | ||
if num_open_brackets_seen == 0: | ||
commas_to_replace.append(idx) | ||
elif char == '{': | ||
num_open_brackets_seen += 1 | ||
elif char == '}': | ||
num_open_brackets_seen -= 1 | ||
s_arr = np.array(list(s)) # Turn to an array to set | ||
s_arr[commas_to_replace] = '\n' # all commas at once. | ||
s = ''.join(s_arr) | ||
|
||
if isinstance(path_or_buf, compat.string_types): | ||
with open(path_or_buf, 'w') as fh: | ||
fh.write(s) | ||
|
@@ -105,7 +125,8 @@ def _format_axes(self): | |
|
||
def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, | ||
convert_axes=True, convert_dates=True, keep_default_dates=True, | ||
numpy=False, precise_float=False, date_unit=None): | ||
numpy=False, precise_float=False, date_unit=None, encoding=None, | ||
lines=False): | ||
""" | ||
Convert a JSON string to pandas object | ||
|
||
|
@@ -178,13 +199,23 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, | |
is to try and detect the correct precision, but if this is not desired | ||
then pass one of 's', 'ms', 'us' or 'ns' to force parsing only seconds, | ||
milliseconds, microseconds or nanoseconds respectively. | ||
lines : boolean, default False | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. doesn't this also need records format? |
||
Read the file as a json object per line. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a versionadded 0.18.2 |
||
|
||
.. versionadded:: 0.19.0 | ||
|
||
encoding : str, default is 'utf-8' | ||
The encoding to use to decode py3 bytes. | ||
|
||
.. versionadded:: 0.19.0 | ||
|
||
Returns | ||
------- | ||
result : Series or DataFrame | ||
""" | ||
|
||
filepath_or_buffer, _, _ = get_filepath_or_buffer(path_or_buf) | ||
filepath_or_buffer, _, _ = get_filepath_or_buffer(path_or_buf, | ||
encoding=encoding) | ||
if isinstance(filepath_or_buffer, compat.string_types): | ||
try: | ||
exists = os.path.exists(filepath_or_buffer) | ||
|
@@ -195,7 +226,7 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, | |
exists = False | ||
|
||
if exists: | ||
with open(filepath_or_buffer, 'r') as fh: | ||
with _get_handle(filepath_or_buffer, 'r', encoding=encoding) as fh: | ||
json = fh.read() | ||
else: | ||
json = filepath_or_buffer | ||
|
@@ -204,6 +235,12 @@ def read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, | |
else: | ||
json = filepath_or_buffer | ||
|
||
if lines: | ||
# If given a json lines file, we break the string into lines, add | ||
# commas and put it in a json list to make a valid json object. | ||
lines = list(StringIO(json.strip())) | ||
json = u'[' + u','.join(lines) + u']' | ||
|
||
obj = None | ||
if typ == 'frame': | ||
obj = FrameParser(json, orient, dtype, convert_axes, convert_dates, | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -948,6 +948,58 @@ def test_tz_range_is_utc(self): | |
df = DataFrame({'DT': dti}) | ||
self.assertEqual(dfexp, pd.json.dumps(df, iso_dates=True)) | ||
|
||
def test_read_jsonl(self): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you add some tests that assert ValueError if invalid combination of lines=True and orient? |
||
# GH9180 | ||
result = read_json('{"a": 1, "b": 2}\n{"b":2, "a" :1}\n', lines=True) | ||
expected = DataFrame([[1, 2], [1, 2]], columns=['a', 'b']) | ||
assert_frame_equal(result, expected) | ||
|
||
def test_to_jsonl(self): | ||
# GH9180 | ||
df = DataFrame([[1, 2], [1, 2]], columns=['a', 'b']) | ||
result = df.to_json(orient="records", lines=True) | ||
expected = '{"a":1,"b":2}\n{"a":1,"b":2}' | ||
self.assertEqual(result, expected) | ||
|
||
def test_latin_encoding(self): | ||
if compat.PY2: | ||
self.assertRaisesRegexp( | ||
TypeError, '\[unicode\] is not implemented as a table column') | ||
return | ||
|
||
values = [[b'E\xc9, 17', b'', b'a', b'b', b'c'], | ||
[b'E\xc9, 17', b'a', b'b', b'c'], | ||
[b'EE, 17', b'', b'a', b'b', b'c'], | ||
[b'E\xc9, 17', b'\xf8\xfc', b'a', b'b', b'c'], | ||
[b'', b'a', b'b', b'c'], | ||
[b'\xf8\xfc', b'a', b'b', b'c'], | ||
[b'A\xf8\xfc', b'', b'a', b'b', b'c'], | ||
[np.nan, b'', b'b', b'c'], | ||
[b'A\xf8\xfc', np.nan, b'', b'b', b'c']] | ||
|
||
def _try_decode(x, encoding='latin-1'): | ||
try: | ||
return x.decode(encoding) | ||
except AttributeError: | ||
return x | ||
|
||
# not sure how to remove latin-1 from code in python 2 and 3 | ||
values = [[_try_decode(x) for x in y] for y in values] | ||
|
||
examples = [] | ||
for dtype in ['category', object]: | ||
for val in values: | ||
examples.append(pandas.Series(val, dtype=dtype)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The pandas here is not defined, and can just be removed I think (reason for travis failure) |
||
|
||
def roundtrip(s, encoding='latin-1'): | ||
with ensure_clean('test.json') as path: | ||
s.to_json(path, encoding=encoding) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am confused because it is already used here (encoding keyword), while I don't see it in the docstring/signature of to_json There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that is a good point! |
||
retr = read_json(path, encoding=encoding) | ||
assert_series_equal(s, retr, check_categorical=False) | ||
|
||
for s in examples: | ||
roundtrip(s) | ||
|
||
|
||
if __name__ == '__main__': | ||
import nose | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose writing should have encoding as well........?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, encodings just confuse people =P