@@ -213,7 +213,7 @@ buffer_lines : int, default None
213
213
.. deprecated :: 0.19.0
214
214
215
215
Argument removed because its value is not respected by the parser
216
-
216
+
217
217
compact_ints : boolean, default False
218
218
219
219
.. deprecated :: 0.19.0
@@ -4093,7 +4093,7 @@ control compression: ``complevel`` and ``complib``.
4093
4093
``complevel `` specifies if and how hard data is to be compressed.
4094
4094
``complevel=0 `` and ``complevel=None `` disables
4095
4095
compression and ``0<complevel<10 `` enables compression.
4096
-
4096
+
4097
4097
``complib `` specifies which compression library to use. If nothing is
4098
4098
specified the default library ``zlib `` is used. A
4099
4099
compression library usually optimizes for either good
@@ -4108,9 +4108,9 @@ control compression: ``complevel`` and ``complib``.
4108
4108
- `blosc <http://www.blosc.org/ >`_: Fast compression and decompression.
4109
4109
4110
4110
.. versionadded :: 0.20.2
4111
-
4111
+
4112
4112
Support for alternative blosc compressors:
4113
-
4113
+
4114
4114
- `blosc:blosclz <http://www.blosc.org/ >`_ This is the
4115
4115
default compressor for ``blosc ``
4116
4116
- `blosc:lz4
@@ -4559,28 +4559,30 @@ Parquet
4559
4559
4560
4560
.. versionadded :: 0.21.0
4561
4561
4562
- Parquet provides a sharded binary columnar serialization for data frames. It is designed to make reading and writing data
4563
- frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a
4564
- variety of compression techniques to shrink the file size as much as possible while still maintaining good read performance.
4562
+ `Parquet <https://parquet.apache.org/ `__ provides a partitioned binary columnar serialization for data frames. It is designed to
4563
+ make reading and writing data frames efficient, and to make sharing data across data analysis
4564
+ languages easy. Parquet can use a variety of compression techniques to shrink the file size as much as possible
4565
+ while still maintaining good read performance.
4565
4566
4566
- Parquet is designed to faithfully serialize and de-serialize DataFrames , supporting all of the pandas
4567
- dtypes, including extension dtypes such as categorical and datetime with tz.
4567
+ Parquet is designed to faithfully serialize and de-serialize `` DataFrame `` s , supporting all of the pandas
4568
+ dtypes, including extension dtypes such as datetime with tz.
4568
4569
4569
4570
Several caveats.
4570
4571
4571
4572
- The format will NOT write an ``Index ``, or ``MultiIndex `` for the ``DataFrame `` and will raise an
4572
- error if a non-default one is provided. You can simply ``.reset_index() `` in order to store the index.
4573
+ error if a non-default one is provided. You can simply ``.reset_index(drop=True ) `` in order to store the index.
4573
4574
- Duplicate column names and non-string columns names are not supported
4575
+ - Categorical dtypes are currently not-supported (for ``pyarrow ``).
4574
4576
- Non supported types include ``Period `` and actual python object types. These will raise a helpful error message
4575
4577
on an attempt at serialization.
4576
4578
4579
+ You can specifiy an ``engine `` to direct the serialization, defaulting to ``pyarrow `` and controlled by the ``pd.options.io.parquet ``.
4577
4580
See the documentation for `pyarrow <http://arrow.apache.org/docs/python/ `__ and `fastparquet <https://fastparquet.readthedocs.io/en/latest/ >`__
4578
4581
4579
4582
.. note ::
4580
4583
4581
4584
These engines are very similar and should read/write nearly identical parquet format files.
4582
4585
These libraries differ by having different underlying dependencies (``fastparquet `` by using ``numba ``, while ``pyarrow `` uses a c-library).
4583
- TODO: differing options to write non-standard columns & null treatment
4584
4586
4585
4587
.. ipython :: python
4586
4588
@@ -4589,10 +4591,9 @@ See the documentation for `pyarrow <http://arrow.apache.org/docs/python/`__ and
4589
4591
' c' : np.arange(3 , 6 ).astype(' u1' ),
4590
4592
' d' : np.arange(4.0 , 7.0 , dtype = ' float64' ),
4591
4593
' e' : [True , False , True ],
4592
- ' f' : pd.Categorical(list (' abc' )),
4593
- ' g' : pd.date_range(' 20130101' , periods = 3 ),
4594
- ' h' : pd.date_range(' 20130101' , periods = 3 , tz = ' US/Eastern' ),
4595
- ' i' : pd.date_range(' 20130101' , periods = 3 , freq = ' ns' )})
4594
+ ' f' : pd.date_range(' 20130101' , periods = 3 ),
4595
+ ' g' : pd.date_range(' 20130101' , periods = 3 , tz = ' US/Eastern' ),
4596
+ ' h' : pd.date_range(' 20130101' , periods = 3 , freq = ' ns' )})
4596
4597
4597
4598
df
4598
4599
df.dtypes
@@ -4608,10 +4609,9 @@ Read from a parquet file.
4608
4609
4609
4610
.. ipython :: python
4610
4611
4611
- result = pd.read_parquet(' example_pa.parquet' )
4612
- result = pd.read_parquet(' example_fp.parquet' )
4612
+ result = pd.read_parquet(' example_pa.parquet' , engine = ' pyarrow ' )
4613
+ result = pd.read_parquet(' example_fp.parquet' , engine = ' fastparquet ' )
4613
4614
4614
- # we preserve dtypes
4615
4615
result.dtypes
4616
4616
4617
4617
.. ipython :: python
0 commit comments