-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: Add PyArrow user guide #51371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
DOC: Add PyArrow user guide #51371
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
e6cba28
DOC: Add PyArrow user guide
mroeschke cb7a99e
Merge remote-tracking branch 'upstream/main' into doc/pyarrow/userguide
mroeschke 852be73
Address review
mroeschke baad8c7
Merge remote-tracking branch 'upstream/main' into doc/pyarrow/userguide
mroeschke e1539e6
Use code block
mroeschke File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -64,6 +64,7 @@ Guides | |
dsintro | ||
basics | ||
io | ||
pyarrow | ||
indexing | ||
advanced | ||
merging | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,163 @@ | ||
.. _pyarrow: | ||
|
||
{{ header }} | ||
|
||
********************* | ||
PyArrow Functionality | ||
********************* | ||
|
||
pandas can utilize `PyArrow <https://arrow.apache.org/docs/python/index.html>`__ to extend functionality and improve the performance | ||
of various APIs. This includes: | ||
|
||
* More extensive `data types <https://arrow.apache.org/docs/python/api/datatypes.html>`__ compared to NumPy | ||
* Missing data support (NA) for all data types | ||
* Performant IO reader integration | ||
* Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF) | ||
|
||
To use this functionality, please ensure you have :ref:`installed the minimum supported PyArrow version. <install.optional_dependencies>` | ||
|
||
|
||
Data Structure Integration | ||
-------------------------- | ||
|
||
A :class:`Series`, :class:`Index`, or the columns of a :class:`DataFrame` can be directly backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray` | ||
which is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by | ||
``[pyarrow]``, e.g. ``"int64[pyarrow]""`` into the ``dtype`` parameter | ||
|
||
.. ipython:: python | ||
|
||
ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]") | ||
ser | ||
|
||
idx = pd.Index([True, None], dtype="bool[pyarrow]") | ||
idx | ||
|
||
df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]") | ||
df | ||
|
||
For PyArrow types that accept parameters, you can pass in a PyArrow type with those parameters | ||
into :class:`ArrowDtype` to use in the ``dtype`` parameter. | ||
|
||
.. ipython:: python | ||
|
||
import pyarrow as pa | ||
list_str_type = pa.list_(pa.string()) | ||
ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type)) | ||
ser | ||
|
||
from datetime import time | ||
idx = pd.Index([time(12, 30), None], dtype=pd.ArrowDtype(pa.time64("us"))) | ||
idx | ||
|
||
from decimal import Decimal | ||
decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2)) | ||
data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]] | ||
df = pd.DataFrame(data, dtype=decimal_type) | ||
df | ||
|
||
If you already have an :external+pyarrow:py:class:`pyarrow.Array` or :external+pyarrow:py:class:`pyarrow.ChunkedArray`, | ||
you can pass it into :class:`.arrays.ArrowExtensionArray` to construct the associated :class:`Series`, :class:`Index` | ||
or :class:`DataFrame` object. | ||
|
||
.. ipython:: python | ||
|
||
pa_array = pa.array([{"1": "2"}, {"10": "20"}, None]) | ||
ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array)) | ||
ser | ||
|
||
To retrieve a pyarrow :external+pyarrow:py:class:`pyarrow.ChunkedArray` from a :class:`Series` or :class:`Index`, you can call | ||
the pyarrow array constructor on the :class:`Series` or :class:`Index`. | ||
|
||
.. ipython:: python | ||
|
||
ser = pd.Series([1, 2, None], dtype="uint8[pyarrow]") | ||
pa.array(ser) | ||
|
||
idx = pd.Index(ser) | ||
pa.array(idx) | ||
|
||
Operations | ||
---------- | ||
|
||
PyArrow data structure integration is implemented through pandas' :class:`~pandas.api.extensions.ExtensionArray` :ref:`interface <extending.extension-types>`; | ||
therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionality | ||
is accelerated with PyArrow `compute functions <https://arrow.apache.org/docs/python/api/compute.html>`__ where available. This includes: | ||
|
||
* Numeric aggregations | ||
* Numeric arithmetic | ||
* Numeric rounding | ||
* Logical and comparison functions | ||
* String functionality | ||
* Datetime functionality | ||
|
||
datapythonista marked this conversation as resolved.
Show resolved
Hide resolved
|
||
The following are just some examples of operations that are accelerated by native PyArrow compute functions. | ||
|
||
.. ipython:: python | ||
|
||
ser = pd.Series([-1.545, 0.211, None], dtype="float32[pyarrow]") | ||
ser.mean() | ||
ser + ser | ||
ser > (ser + 1) | ||
|
||
ser.dropna() | ||
ser.isna() | ||
ser.fillna(0) | ||
|
||
ser_str = pd.Series(["a", "b", None], dtype="string[pyarrow]") | ||
ser_str.str.startswith("a") | ||
|
||
from datetime import datetime | ||
pa_type = pd.ArrowDtype(pa.timestamp("ns")) | ||
ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type) | ||
ser_dt.dt.strftime("%Y-%m") | ||
|
||
I/O Reading | ||
----------- | ||
|
||
PyArrow also provides IO reading functionality that has been integrated into several pandas IO readers. The following | ||
functions provide an ``engine`` keyword that can dispatch to PyArrow to accelerate reading from an IO source. | ||
|
||
* :func:`read_csv` | ||
* :func:`read_json` | ||
* :func:`read_orc` | ||
* :func:`read_feather` | ||
|
||
.. ipython:: python | ||
|
||
import io | ||
data = io.StringIO("""a,b,c | ||
1,2.5,True | ||
3,4.5,False | ||
""") | ||
df = pd.read_csv(data, engine="pyarrow") | ||
df | ||
|
||
By default, these functions and all other IO reader functions return NumPy-backed data. These readers can return | ||
PyArrow-backed data by specifying the parameter ``use_nullable_dtypes=True`` **and** the global configuration option ``"mode.dtype_backend"`` | ||
set to ``"pyarrow"``. A reader does not need to set ``engine="pyarrow"`` to necessarily return PyArrow-backed data. | ||
|
||
.. ipython:: python | ||
|
||
import io | ||
data = io.StringIO("""a,b,c,d,e,f,g,h,i | ||
1,2.5,True,a,,,,, | ||
3,4.5,False,b,6,7.5,True,a, | ||
""") | ||
with pd.option_context("mode.dtype_backend", "pyarrow"): | ||
df_pyarrow = pd.read_csv(data, use_nullable_dtypes=True) | ||
df_pyarrow.dtypes | ||
|
||
To simplify specifying ``use_nullable_dtypes=True`` in several functions, you can set a global option ``nullable_dtypes`` | ||
to ``True``. You will still need to set the global configuration option ``"mode.dtype_backend"`` to ``pyarrow``. | ||
|
||
.. code-block:: ipython | ||
|
||
In [1]: pd.set_option("mode.dtype_backend", "pyarrow") | ||
|
||
In [2]: pd.options.mode.nullable_dtypes = True | ||
|
||
Several non-IO reader functions can also use the ``"mode.dtype_backend"`` option to return PyArrow-backed data including: | ||
|
||
* :func:`to_numeric` | ||
* :meth:`DataFrame.convert_dtypes` | ||
* :meth:`Series.convert_dtypes` |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.