Skip to content

Commit d3ec8b5

Browse files
committed
doc updates
1 parent decffcf commit d3ec8b5

File tree

6 files changed

+53
-35
lines changed

6 files changed

+53
-35
lines changed

ci/requirements-3.6_DOC.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,6 @@ echo "[install DOC_BUILD deps]"
66

77
pip install pandas-gbq
88

9-
conda install -n pandas -c conda-forge feather-format pyarrow nbsphinx pandoc
9+
conda install -n pandas -c conda-forge feather-format pyarrow nbsphinx pandoc fastparquet
1010

1111
conda install -n pandas -c r r rpy2 --yes

doc/source/io.rst

+18-18
Original file line numberDiff line numberDiff line change
@@ -213,7 +213,7 @@ buffer_lines : int, default None
213213
.. deprecated:: 0.19.0
214214

215215
Argument removed because its value is not respected by the parser
216-
216+
217217
compact_ints : boolean, default False
218218

219219
.. deprecated:: 0.19.0
@@ -4093,7 +4093,7 @@ control compression: ``complevel`` and ``complib``.
40934093
``complevel`` specifies if and how hard data is to be compressed.
40944094
``complevel=0`` and ``complevel=None`` disables
40954095
compression and ``0<complevel<10`` enables compression.
4096-
4096+
40974097
``complib`` specifies which compression library to use. If nothing is
40984098
specified the default library ``zlib`` is used. A
40994099
compression library usually optimizes for either good
@@ -4108,9 +4108,9 @@ control compression: ``complevel`` and ``complib``.
41084108
- `blosc <http://www.blosc.org/>`_: Fast compression and decompression.
41094109

41104110
.. versionadded:: 0.20.2
4111-
4111+
41124112
Support for alternative blosc compressors:
4113-
4113+
41144114
- `blosc:blosclz <http://www.blosc.org/>`_ This is the
41154115
default compressor for ``blosc``
41164116
- `blosc:lz4
@@ -4559,28 +4559,30 @@ Parquet
45594559

45604560
.. versionadded:: 0.21.0
45614561

4562-
Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data
4563-
frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a
4564-
variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance.
4562+
`Parquet <https://parquet.apache.org/`__ provides a partitioned binary columnar serialization for data frames. It is designed to
4563+
make reading and writing data frames efficient, and to make sharing data across data analysis
4564+
languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible
4565+
while still maintaining good read performance.
45654566

4566-
Parquet is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas
4567-
dtypes, including extension dtypes such as categorical and datetime with tz.
4567+
Parquet is designed to faithfully serialize and de-serialize ``DataFrame`` s, supporting all of the pandas
4568+
dtypes, including extension dtypes such as datetime with tz.
45684569

45694570
Several caveats.
45704571

45714572
- The format will NOT write an ``Index``, or ``MultiIndex`` for the ``DataFrame`` and will raise an
4572-
error if a non-default one is provided. You can simply ``.reset_index()`` in order to store the index.
4573+
error if a non-default one is provided. You can simply ``.reset_index(drop=True)`` in order to store the index.
45734574
- Duplicate column names and non-string columns names are not supported
4575+
- Categorical dtypes are currently not-supported (for ``pyarrow``).
45744576
- Non supported types include ``Period`` and actual python object types. These will raise a helpful error message
45754577
on an attempt at serialization.
45764578

4579+
You can specifiy an ``engine`` to direct the serialization, defaulting to ``pyarrow`` and controlled by the ``pd.options.io.parquet``.
45774580
See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/>`__
45784581

45794582
.. note::
45804583

45814584
These engines are very similar and should read/write nearly identical parquet format files.
45824585
These libraries differ by having different underlying dependencies (``fastparquet`` by using ``numba``, while ``pyarrow`` uses a c-library).
4583-
TODO: differing options to write non-standard columns & null treatment
45844586

45854587
.. ipython:: python
45864588
@@ -4589,10 +4591,9 @@ See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and
45894591
'c': np.arange(3, 6).astype('u1'),
45904592
'd': np.arange(4.0, 7.0, dtype='float64'),
45914593
'e': [True, False, True],
4592-
'f': pd.Categorical(list('abc')),
4593-
'g': pd.date_range('20130101', periods=3),
4594-
'h': pd.date_range('20130101', periods=3, tz='US/Eastern'),
4595-
'i': pd.date_range('20130101', periods=3, freq='ns')})
4594+
'f': pd.date_range('20130101', periods=3),
4595+
'g': pd.date_range('20130101', periods=3, tz='US/Eastern'),
4596+
'h': pd.date_range('20130101', periods=3, freq='ns')})
45964597
45974598
df
45984599
df.dtypes
@@ -4608,10 +4609,9 @@ Read from a parquet file.
46084609

46094610
.. ipython:: python
46104611
4611-
result = pd.read_parquet('example_pa.parquet')
4612-
result = pd.read_parquet('example_fp.parquet')
4612+
result = pd.read_parquet('example_pa.parquet', engine='pyarrow')
4613+
result = pd.read_parquet('example_fp.parquet', engine='fastparquet')
46134614
4614-
# we preserve dtypes
46154615
result.dtypes
46164616
46174617
.. ipython:: python

pandas/core/frame.py

+5-4
Original file line numberDiff line numberDiff line change
@@ -1601,7 +1601,7 @@ def to_feather(self, fname):
16011601
def to_parquet(self, fname, engine=None, compression='snappy',
16021602
**kwargs):
16031603
"""
1604-
write out the binary parquet for DataFrames
1604+
Write a DataFrame to the binary parquet format.
16051605
16061606
.. versionadded:: 0.21.0
16071607
@@ -1611,11 +1611,12 @@ def to_parquet(self, fname, engine=None, compression='snappy',
16111611
string file path
16121612
engine : str, optional
16131613
The parquet engine, one of {'pyarrow', 'fastparquet'}
1614-
if None, will use the option: `io.parquet.engine`
1614+
If None, will use the option: `io.parquet.engine`, which
1615+
defaults to 'pyarrow'
16151616
compression : str, optional, default 'snappy'
16161617
compression method, includes {'gzip', 'snappy', 'brotli'}
1617-
kwargs passed to the engine
1618-
1618+
kwargs
1619+
Additional keyword arguments passed to the engine
16191620
"""
16201621
from pandas.io.parquet import to_parquet
16211622
to_parquet(self, fname, engine,

pandas/io/feather_format.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ def _try_import():
1919
"you can install via conda\n"
2020
"conda install feather-format -c conda-forge\n"
2121
"or via pip\n"
22-
"pip install feather-format\n")
22+
"pip install -U feather-format\n")
2323

2424
try:
2525
feather.__version__ >= LooseVersion('0.3.1')
@@ -29,7 +29,7 @@ def _try_import():
2929
"you can install via conda\n"
3030
"conda install feather-format -c conda-forge"
3131
"or via pip\n"
32-
"pip install feather-format\n")
32+
"pip install -U feather-format\n")
3333

3434
return feather
3535

pandas/io/parquet.py

+8-6
Original file line numberDiff line numberDiff line change
@@ -36,15 +36,15 @@ def __init__(self):
3636
"you can install via conda\n"
3737
"conda install pyarrow -c conda-forge\n"
3838
"\nor via pip\n"
39-
"pip install pyarrow\n")
39+
"pip install -U pyarrow\n")
4040

4141
if LooseVersion(pyarrow.__version__) < '0.4.1':
4242
raise ImportError("pyarrow >= 0.4.1 is required for parquet"
4343
"support\n\n"
4444
"you can install via conda\n"
4545
"conda install pyarrow -c conda-forge\n"
4646
"\nor via pip\n"
47-
"pip install pyarrow\n")
47+
"pip install -U pyarrow\n")
4848

4949
self.api = pyarrow
5050

@@ -72,15 +72,15 @@ def __init__(self):
7272
"you can install via conda\n"
7373
"conda install fastparquet -c conda-forge\n"
7474
"\nor via pip\n"
75-
"pip install fastparquet")
75+
"pip install -U fastparquet")
7676

7777
if LooseVersion(fastparquet.__version__) < '0.1.0':
7878
raise ImportError("fastparquet >= 0.1.0 is required for parquet "
7979
"support\n\n"
8080
"you can install via conda\n"
8181
"conda install fastparquet -c conda-forge\n"
8282
"\nor via pip\n"
83-
"pip install fastparquet")
83+
"pip install -U fastparquet")
8484

8585
self.api = fastparquet
8686

@@ -109,10 +109,12 @@ def to_parquet(df, path, engine=None, compression='snappy', **kwargs):
109109
File path
110110
engine : str, optional
111111
The parquet engine, one of {'pyarrow', 'fastparquet'}
112-
if None, will use the option: `io.parquet.engine`
112+
If None, will use the option: `io.parquet.engine`, which
113+
defaults to 'pyarrow'
113114
compression : str, optional, default 'snappy'
114115
compression method, includes {'gzip', 'snappy', 'brotli'}
115-
kwargs are passed to the engine
116+
kwargs
117+
Additional keyword arguments passed to the engine
116118
"""
117119

118120
impl = get_engine(engine)

pandas/tests/io/test_parquet.py

+19-4
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,20 @@ def df_compat():
5454
return pd.DataFrame({'A': [1, 2, 3], 'B': 'foo'})
5555

5656

57+
@pytest.fixture
58+
def df_cross_compat():
59+
df = pd.DataFrame({'a': list('abc'),
60+
'b': list(range(1, 4)),
61+
'c': np.arange(3, 6).astype('u1'),
62+
'd': np.arange(4.0, 7.0, dtype='float64'),
63+
'e': [True, False, True],
64+
'f': pd.date_range('20130101', periods=3),
65+
'g': pd.date_range('20130101', periods=3,
66+
tz='US/Eastern'),
67+
'h': pd.date_range('20130101', periods=3, freq='ns')})
68+
return df
69+
70+
5771
def test_invalid_engine(df_compat):
5872

5973
with pytest.raises(ValueError):
@@ -87,21 +101,22 @@ def test_options_fp(df_compat, fp):
87101

88102

89103
@pytest.mark.xfail(reason="fp does not ignore pa index __index_level_0__")
90-
def test_cross_engine_pa_fp(df_compat, pa, fp):
104+
def test_cross_engine_pa_fp(df_cross_compat, pa, fp):
91105
# cross-compat with differing reading/writing engines
92106

93-
df = df_compat
107+
df = df_cross_compat
94108
with tm.ensure_clean() as path:
95109
df.to_parquet(path, engine=pa, compression=None)
96110

97111
result = read_parquet(path, engine=fp, compression=None)
98112
tm.assert_frame_equal(result, df)
99113

100114

101-
def test_cross_engine_fp_pa(df_compat, pa, fp):
115+
@pytest.mark.xfail(reason="pyarrow reading fp in some cases")
116+
def test_cross_engine_fp_pa(df_cross_compat, pa, fp):
102117
# cross-compat with differing reading/writing engines
103118

104-
df = df_compat
119+
df = df_cross_compat
105120
with tm.ensure_clean() as path:
106121
df.to_parquet(path, engine=fp, compression=None)
107122

0 commit comments

Comments
 (0)