Skip to content

Commit 6d5e78e

Browse files
authored
DOC: Add PyArrow user guide (#51371)
* DOC: Add PyArrow user guide * Address review * Use code block
1 parent 8a40960 commit 6d5e78e

File tree

3 files changed

+166
-0
lines changed

3 files changed

+166
-0
lines changed

doc/source/reference/arrays.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,8 @@ values.
113113

114114
ArrowDtype
115115

116+
For more information, please see the :ref:`PyArrow user guide <pyarrow>`
117+
116118
.. _api.arrays.datetime:
117119

118120
Datetimes

doc/source/user_guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ Guides
6464
dsintro
6565
basics
6666
io
67+
pyarrow
6768
indexing
6869
advanced
6970
merging

doc/source/user_guide/pyarrow.rst

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
.. _pyarrow:
2+
3+
{{ header }}
4+
5+
*********************
6+
PyArrow Functionality
7+
*********************
8+
9+
pandas can utilize `PyArrow <https://arrow.apache.org/docs/python/index.html>`__ to extend functionality and improve the performance
10+
of various APIs. This includes:
11+
12+
* More extensive `data types <https://arrow.apache.org/docs/python/api/datatypes.html>`__ compared to NumPy
13+
* Missing data support (NA) for all data types
14+
* Performant IO reader integration
15+
* Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF)
16+
17+
To use this functionality, please ensure you have :ref:`installed the minimum supported PyArrow version. <install.optional_dependencies>`
18+
19+
20+
Data Structure Integration
21+
--------------------------
22+
23+
A :class:`Series`, :class:`Index`, or the columns of a :class:`DataFrame` can be directly backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray`
24+
which is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by
25+
``[pyarrow]``, e.g. ``"int64[pyarrow]""`` into the ``dtype`` parameter
26+
27+
.. ipython:: python
28+
29+
ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]")
30+
ser
31+
32+
idx = pd.Index([True, None], dtype="bool[pyarrow]")
33+
idx
34+
35+
df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")
36+
df
37+
38+
For PyArrow types that accept parameters, you can pass in a PyArrow type with those parameters
39+
into :class:`ArrowDtype` to use in the ``dtype`` parameter.
40+
41+
.. ipython:: python
42+
43+
import pyarrow as pa
44+
list_str_type = pa.list_(pa.string())
45+
ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type))
46+
ser
47+
48+
from datetime import time
49+
idx = pd.Index([time(12, 30), None], dtype=pd.ArrowDtype(pa.time64("us")))
50+
idx
51+
52+
from decimal import Decimal
53+
decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2))
54+
data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]]
55+
df = pd.DataFrame(data, dtype=decimal_type)
56+
df
57+
58+
If you already have an :external+pyarrow:py:class:`pyarrow.Array` or :external+pyarrow:py:class:`pyarrow.ChunkedArray`,
59+
you can pass it into :class:`.arrays.ArrowExtensionArray` to construct the associated :class:`Series`, :class:`Index`
60+
or :class:`DataFrame` object.
61+
62+
.. ipython:: python
63+
64+
pa_array = pa.array([{"1": "2"}, {"10": "20"}, None])
65+
ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))
66+
ser
67+
68+
To retrieve a pyarrow :external+pyarrow:py:class:`pyarrow.ChunkedArray` from a :class:`Series` or :class:`Index`, you can call
69+
the pyarrow array constructor on the :class:`Series` or :class:`Index`.
70+
71+
.. ipython:: python
72+
73+
ser = pd.Series([1, 2, None], dtype="uint8[pyarrow]")
74+
pa.array(ser)
75+
76+
idx = pd.Index(ser)
77+
pa.array(idx)
78+
79+
Operations
80+
----------
81+
82+
PyArrow data structure integration is implemented through pandas' :class:`~pandas.api.extensions.ExtensionArray` :ref:`interface <extending.extension-types>`;
83+
therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionality
84+
is accelerated with PyArrow `compute functions <https://arrow.apache.org/docs/python/api/compute.html>`__ where available. This includes:
85+
86+
* Numeric aggregations
87+
* Numeric arithmetic
88+
* Numeric rounding
89+
* Logical and comparison functions
90+
* String functionality
91+
* Datetime functionality
92+
93+
The following are just some examples of operations that are accelerated by native PyArrow compute functions.
94+
95+
.. ipython:: python
96+
97+
ser = pd.Series([-1.545, 0.211, None], dtype="float32[pyarrow]")
98+
ser.mean()
99+
ser + ser
100+
ser > (ser + 1)
101+
102+
ser.dropna()
103+
ser.isna()
104+
ser.fillna(0)
105+
106+
ser_str = pd.Series(["a", "b", None], dtype="string[pyarrow]")
107+
ser_str.str.startswith("a")
108+
109+
from datetime import datetime
110+
pa_type = pd.ArrowDtype(pa.timestamp("ns"))
111+
ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type)
112+
ser_dt.dt.strftime("%Y-%m")
113+
114+
I/O Reading
115+
-----------
116+
117+
PyArrow also provides IO reading functionality that has been integrated into several pandas IO readers. The following
118+
functions provide an ``engine`` keyword that can dispatch to PyArrow to accelerate reading from an IO source.
119+
120+
* :func:`read_csv`
121+
* :func:`read_json`
122+
* :func:`read_orc`
123+
* :func:`read_feather`
124+
125+
.. ipython:: python
126+
127+
import io
128+
data = io.StringIO("""a,b,c
129+
1,2.5,True
130+
3,4.5,False
131+
""")
132+
df = pd.read_csv(data, engine="pyarrow")
133+
df
134+
135+
By default, these functions and all other IO reader functions return NumPy-backed data. These readers can return
136+
PyArrow-backed data by specifying the parameter ``use_nullable_dtypes=True`` **and** the global configuration option ``"mode.dtype_backend"``
137+
set to ``"pyarrow"``. A reader does not need to set ``engine="pyarrow"`` to necessarily return PyArrow-backed data.
138+
139+
.. ipython:: python
140+
141+
import io
142+
data = io.StringIO("""a,b,c,d,e,f,g,h,i
143+
1,2.5,True,a,,,,,
144+
3,4.5,False,b,6,7.5,True,a,
145+
""")
146+
with pd.option_context("mode.dtype_backend", "pyarrow"):
147+
df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True)
148+
df_pyarrow.dtypes
149+
150+
To simplify specifying ``use_nullable_dtypes=True`` in several functions, you can set a global option ``nullable_dtypes``
151+
to ``True``. You will still need to set the global configuration option ``"mode.dtype_backend"`` to ``pyarrow``.
152+
153+
.. code-block:: ipython
154+
155+
In [1]: pd.set_option("mode.dtype_backend", "pyarrow")
156+
157+
In [2]: pd.options.mode.nullable_dtypes = True
158+
159+
Several non-IO reader functions can also use the ``"mode.dtype_backend"`` option to return PyArrow-backed data including:
160+
161+
* :func:`to_numeric`
162+
* :meth:`DataFrame.convert_dtypes`
163+
* :meth:`Series.convert_dtypes`

0 commit comments

Comments
 (0)