Skip to content

ENH: Added to_json_schema #14904

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 4, 2017
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions ci/requirements-2.7.pip
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ pathlib
backports.lzma
py
PyCrypto
mock
ipython
1 change: 1 addition & 0 deletions ci/requirements-3.5.run
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ pymysql
psycopg2
s3fs
beautifulsoup4
ipython
1 change: 1 addition & 0 deletions ci/requirements-3.6.run
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ pymysql
beautifulsoup4
s3fs
xarray
ipython
1 change: 1 addition & 0 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ JSON
:toctree: generated/

json_normalize
build_table_schema

.. currentmodule:: pandas

Expand Down
120 changes: 120 additions & 0 deletions doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2033,6 +2033,126 @@ using Hadoop or Spark.
df
df.to_json(orient='records', lines=True)


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you may want a ref tag here

.. _io.table_schema:

Table Schema
''''''''''''
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ❤️ these additions to the docs.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versionadded 0.20.0

.. versionadded:: 0.20.0

`Table Schema`_ is a spec for describing tabular datasets as a JSON
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think orient=table is just fine, its not ambiguous at all.

object. The JSON includes information on the field names, types, and
other attributes. You can use the orient ``table`` to build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

orient='table'

a JSON string with two fields, ``schema`` and ``data``.

.. ipython:: python

df = pd.DataFrame(
{'A': [1, 2, 3],
'B': ['a', 'b', 'c'],
'C': pd.date_range('2016-01-01', freq='d', periods=3),
}, index=pd.Index(range(3), name='idx'))
df
df.to_json(orient='table', date_format="iso")

The ``schema`` field contains the ``fields`` key, which itself contains
a list of column name to type pairs, including the ``Index`` or ``MultiIndex``
(see below for a list of types).
The ``schema`` field also contains a ``primaryKey`` field if the (Multi)index
is unique.

The second field, ``data``, contains the serialized data with the ``records``
orient.
The index is included, and any datetimes are ISO 8601 formatted, as required
by the Table Schema spec.

The full list of types supported are described in the Table Schema
spec. This table shows the mapping from pandas types:

============== =================
Pandas type Table Schema type
============== =================
int64 integer
float64 number
bool boolean
datetime64[ns] datetime
timedelta64[ns] duration
categorical any
object str
=============== =================

A few notes on the generated table schema:

- The ``schema`` object contains a ``pandas_version`` field. This contains
the version of pandas' dialect of the schema, and will be incremented
with each revision.
- All dates are converted to UTC when serializing. Even timezone naïve values,
which are treated as UTC with an offset of 0.

.. ipython:: python:

from pandas.io.json import build_table_schema
s = pd.Series(pd.date_range('2016', periods=4))
build_table_schema(s)

- datetimes with a timezone (before serializing), include an additional field
``tz`` with the time zone name (e.g. ``'US/Central'``).

.. ipython:: python

s_tz = pd.Series(pd.date_range('2016', periods=12,
tz='US/Central'))
build_table_schema(s_tz)

- Periods are converted to timestamps before serialization, and so have the
same behavior of being converted to UTC. In addition, periods will contain
and additional field ``freq`` with the period's frequency, e.g. ``'A-DEC'``

.. ipython:: python

s_per = pd.Series(1, index=pd.period_range('2016', freq='A-DEC',
periods=4))
build_table_schema(s_per)

- Categoricals use the ``any`` type and an ``enum`` constraint listing
the set of possible values. Additionally, an ``ordered`` field is included

.. ipython:: python

s_cat = pd.Series(pd.Categorical(['a', 'b', 'a']))
build_table_schema(s_cat)

- A ``primaryKey`` field, containing an array of labels, is included
*if the index is unique*:

.. ipython:: python

s_dupe = pd.Series([1, 2], index=[1, 1])
build_table_schema(s_dupe)

- The ``primaryKey`` behavior is the same with MultiIndexes, but in this
case the ``primaryKey`` is an array:

.. ipython:: python

s_multi = pd.Series(1, index=pd.MultiIndex.from_product([('a', 'b'),
(0, 1)]))
build_table_schema(s_multi)

- The default naming roughly follows these rules:

+ For series, the ``object.name`` is used. If that's none, then the
name is ``values``
+ For DataFrames, the stringified version of the column name is used
+ For ``Index`` (not ``MultiIndex``), ``index.name`` is used, with a
fallback to ``index`` if that is None.
+ For ``MultiIndex``, ``mi.names`` is used. If any level has no name,
then ``level_<i>`` is used.


_Table Schema: http://specs.frictionlessdata.io/json-table-schema/

HTML
----

Expand Down
20 changes: 20 additions & 0 deletions doc/source/options.rst
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,9 @@ display.width 80 Width of the display in characters.
IPython qtconsole, or IDLE do not run in a
terminal and hence it is not possible
to correctly detect the width.
display.html.table_schema False Whether to publish a Table Schema
representation for frontends that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, didn't realize we have a html. namespace here. and of course we have display.notebook.html_repr which should prob also be under html namespace. I am liking this display.html.* namespace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this introduces a new namespace for just this (as there is an 'html' namespace, but that is not under display, that is just pd.options.html, see the html.border below this).

Maybe we could also call it 'notebook' to have it different than the 'html' (although that is probably a bad name because it is not necessarily restricted to notebook?)

support it.
html.border 1 A ``border=value`` attribute is
inserted in the ``<table>`` tag
for the DataFrame HTML repr.
Expand Down Expand Up @@ -507,3 +510,20 @@ Enabling ``display.unicode.ambiguous_as_wide`` lets pandas to figure these chara

pd.set_option('display.unicode.east_asian_width', False)
pd.set_option('display.unicode.ambiguous_as_wide', False)

.. _options.table_schema:

Table Schema Display
--------------------

.. versionadded:: 0.20.0

``DataFrame`` and ``Series`` will publish a Table Schema representation
by default. False by default, this can be enabled globally with the
``display.html.table_schema`` option:

.. ipython:: python

pd.set_option('display.html.table_schema', True)

Only ``'display.max_rows'`` are serialized and published.
35 changes: 35 additions & 0 deletions doc/source/whatsnew/v0.20.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Highlights include:
- Building pandas for development now requires ``cython >= 0.23`` (:issue:`14831`)
- The ``.ix`` indexer has been deprecated, see :ref:`here <whatsnew_0200.api_breaking.deprecate_ix>`
- Switched the test framework to `pytest`_ (:issue:`13097`)
- A new orient for JSON serialization, ``orient='table'``, that uses the Table Schema spec, see :ref: `here <whatsnew_0200.enhancements.table_schema>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worthwhile to mention better ipython integration here?


.. _pytest: http://doc.pytest.org/en/latest/

Expand Down Expand Up @@ -154,6 +155,40 @@ New Behavior:

df[df.chromosomes != '1'].groupby('chromosomes', sort=False).sum()

.. _whatsnew_0200.enhancements.table_schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a ref tag here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can add this to highltes as well

Table Schema Output
^^^^^^^^^^^^^^^^^^^

The new orient ``'table'`` for :meth:`DataFrame.to_json`
will generate a `Table Schema`_ compatible string representation of
the data.

.. ipython:: python

df = pd.DataFrame(
{'A': [1, 2, 3],
'B': ['a', 'b', 'c'],
'C': pd.date_range('2016-01-01', freq='d', periods=3),
}, index=pd.Index(range(3), name='idx'))
df
df.to_json(orient='table')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will raise an error, as you didn't specify date_format='iso'.
But it seems a bit unfortunate that it should be also passed as well, when using the table orient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't like it either. How about this: I change the default date_format to None in core/generic.py. Then for the table schema writer the default is iso, and in every other writer the default is iso. I just didn't want to ignore the user's input.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above about this. you can make the default date_format=None then check / validate in the code itself.



See :ref:`IO: Table Schema for more<io.table_schema>`.

Additionally, the repr for ``DataFrame`` and ``Series`` can now publish
this JSON Table schema representation of the Series or DataFrame if you are
using IPython (or another frontend like `nteract`_ using the Jupyter messaging
protocol).
This gives frontends like the Jupyter notebook and `nteract`_
more flexiblity in how they display pandas objects, since they have
more information about the data.
You must enable this by setting the ``display.html.table_schema`` option to True.

.. _Table Schema: http://specs.frictionlessdata.io/json-table-schema/
.. _nteract: http://nteract.io/

.. _whatsnew_0200.enhancements.other:

Other enhancements
Expand Down
9 changes: 9 additions & 0 deletions pandas/core/config_init.py
Original file line number Diff line number Diff line change
Expand Up @@ -164,6 +164,13 @@
(default: False)
"""

pc_table_schema_doc = """
: boolean
Whether to publish a Table Schema representation for frontends
that support it.
(default: False)
"""

pc_line_width_deprecation_warning = """\
line_width has been deprecated, use display.width instead (currently both are
identical)
Expand Down Expand Up @@ -339,6 +346,8 @@ def mpl_style_cb(key):
validator=is_bool)
cf.register_option('latex.longtable', False, pc_latex_longtable,
validator=is_bool)
cf.register_option('html.table_schema', False, pc_table_schema_doc,
validator=is_bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you changed the docs, but not this


cf.deprecate_option('display.line_width',
msg=pc_line_width_deprecation_warning,
Expand Down
86 changes: 82 additions & 4 deletions pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import operator
import weakref
import gc
import json

import numpy as np
import pandas.lib as lib
Expand Down Expand Up @@ -129,6 +130,37 @@ def __init__(self, data, axes=None, copy=False, dtype=None,
object.__setattr__(self, '_data', data)
object.__setattr__(self, '_item_cache', {})

def _ipython_display_(self):
try:
from IPython.display import display
except ImportError:
return None

# Series doesn't define _repr_html_ or _repr_latex_
latex = self._repr_latex_() if hasattr(self, '_repr_latex_') else None
html = self._repr_html_() if hasattr(self, '_repr_html_') else None
table_schema = self._repr_table_schema_()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the config.get_option check happen here so it doesn't end up as:

"application/vnd.dataresource+json": None

in the resulting output?

Oh nevermind, I see the if v in the dict comprehension.

# We need the inital newline since we aren't going through the
# usual __repr__. See
# https://github.com/pandas-dev/pandas/pull/14904#issuecomment-277829277
text = "\n" + repr(self)

reprs = {"text/plain": text, "text/html": html, "text/latex": latex,
"application/vnd.dataresource+json": table_schema}
reprs = {k: v for k, v in reprs.items() if v}
display(reprs, raw=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the best way to test this to be mocking IPython.display.display?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, when I was viewing this the codecov extension was showing this segment as not covered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! We didn't have IPython installed in the build that runs the coverage report, so it was skipped. Just pushed a commit adding it.


def _repr_table_schema_(self):
"""
Not a real Jupyter special repr method, but we use the same
naming convention.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😄 one step towards general adoption I think. 😉

"""
if config.get_option("display.html.table_schema"):
data = self.head(config.get_option('display.max_rows'))
payload = json.loads(data.to_json(orient='table'),
object_pairs_hook=collections.OrderedDict)
return payload

def _validate_dtype(self, dtype):
""" validate the passed dtype """

Expand Down Expand Up @@ -1094,7 +1126,7 @@ def __setstate__(self, state):
strings before writing.
"""

def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
def to_json(self, path_or_buf=None, orient=None, date_format=None,
double_precision=10, force_ascii=True, date_unit='ms',
default_handler=None, lines=False):
"""
Expand Down Expand Up @@ -1129,10 +1161,17 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',
- index : dict like {index -> {column -> value}}
- columns : dict like {column -> {index -> value}}
- values : just the values array
- table : dict like {'schema': {schema}, 'data': {data}}
describing the data, and the data component is
like ``orient='records'``.

date_format : {'epoch', 'iso'}
.. versionchanged:: 0.20.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation is not fully correct I think, should line up with 'like'


date_format : {None, 'epoch', 'iso'}
Type of date conversion. `epoch` = epoch milliseconds,
`iso`` = ISO8601, default is epoch.
`iso` = ISO8601. The default depends on the `orient`. For
`orient='table'`, the default is `'iso'`. For all other orients,
the default is `'epoch'`.
double_precision : The number of decimal places to use when encoding
floating point values, default 10.
force_ascii : force encoded string to be ASCII, default True.
Expand All @@ -1151,14 +1190,53 @@ def to_json(self, path_or_buf=None, orient=None, date_format='epoch',

.. versionadded:: 0.19.0


Returns
-------
same type as input object with filtered info axis

See Also
--------
pd.read_json

Examples
--------

>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
... index=['row 1', 'row 2'],
... columns=['col 1', 'col 2'])
>>> df.to_json(orient='split')
'{"columns":["col 1","col 2"],
"index":["row 1","row 2"],
"data":[["a","b"],["c","d"]]}'

Encoding/decoding a Dataframe using ``'index'`` formatted JSON:

>>> df.to_json(orient='index')
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'

Encoding/decoding a Dataframe using ``'records'`` formatted JSON.
Note that index labels are not preserved with this encoding.

>>> df.to_json(orient='records')
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice examples!


Encoding with Table Schema

>>> df.to_json(orient='table')
'{"schema": {"fields": [{"name": "index", "type": "string"},
{"name": "col 1", "type": "string"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the version be here?

{"name": "col 2", "type": "string"}],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to regen this example (iwth orient='table')

"primaryKey": "index",
"pandas_version": "0.20.0"},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is primaryKey standard here? (as opposed to primary_key), are any other keys camelCase?

"data": [{"index": "row 1", "col 1": "a", "col 2": "b"},
{"index": "row 2", "col 1": "c", "col 2": "d"}]}'
"""

from pandas.io import json
if date_format is None and orient == 'table':
date_format = 'iso'
elif date_format is None:
date_format = 'epoch'
return json.to_json(path_or_buf=path_or_buf, obj=self, orient=orient,
date_format=date_format,
double_precision=double_precision,
Expand Down
3 changes: 2 additions & 1 deletion pandas/io/json/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from .json import to_json, read_json, loads, dumps # noqa
from .normalize import json_normalize # noqa
from .table_schema import build_table_schema # noqa

del json, normalize # noqa
del json, normalize, table_schema # noqa
Loading