Skip to content

Commit e637b42

Browse files
ENH: Implement pandas.read_iceberg (#61383)
1 parent 41968a5 commit e637b42

19 files changed

+396
-37
lines changed

ci/deps/actions-310-minimum_versions.yaml

+4-3
Original file line numberDiff line numberDiff line change
@@ -28,10 +28,10 @@ dependencies:
2828
- beautifulsoup4=4.12.3
2929
- bottleneck=1.3.6
3030
- fastparquet=2024.2.0
31-
- fsspec=2024.2.0
31+
- fsspec=2023.12.2
3232
- html5lib=1.1
3333
- hypothesis=6.84.0
34-
- gcsfs=2024.2.0
34+
- gcsfs=2023.12.2
3535
- jinja2=3.1.3
3636
- lxml=4.9.2
3737
- matplotlib=3.8.3
@@ -42,14 +42,15 @@ dependencies:
4242
- openpyxl=3.1.2
4343
- psycopg2=2.9.6
4444
- pyarrow=10.0.1
45+
- pyiceberg=0.7.1
4546
- pymysql=1.1.0
4647
- pyqt=5.15.9
4748
- pyreadstat=1.2.6
4849
- pytables=3.8.0
4950
- python-calamine=0.1.7
5051
- pytz=2023.4
5152
- pyxlsb=1.0.10
52-
- s3fs=2024.2.0
53+
- s3fs=2023.12.2
5354
- scipy=1.12.0
5455
- sqlalchemy=2.0.0
5556
- tabulate=0.9.0

ci/deps/actions-310.yaml

+4-3
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,10 @@ dependencies:
2626
- beautifulsoup4>=4.12.3
2727
- bottleneck>=1.3.6
2828
- fastparquet>=2024.2.0
29-
- fsspec>=2024.2.0
29+
- fsspec>=2023.12.2
3030
- html5lib>=1.1
3131
- hypothesis>=6.84.0
32-
- gcsfs>=2024.2.0
32+
- gcsfs>=2023.12.2
3333
- jinja2>=3.1.3
3434
- lxml>=4.9.2
3535
- matplotlib>=3.8.3
@@ -40,14 +40,15 @@ dependencies:
4040
- openpyxl>=3.1.2
4141
- psycopg2>=2.9.6
4242
- pyarrow>=10.0.1
43+
- pyiceberg>=0.7.1
4344
- pymysql>=1.1.0
4445
- pyqt>=5.15.9
4546
- pyreadstat>=1.2.6
4647
- pytables>=3.8.0
4748
- python-calamine>=0.1.7
4849
- pytz>=2023.4
4950
- pyxlsb>=1.0.10
50-
- s3fs>=2024.2.0
51+
- s3fs>=2023.12.2
5152
- scipy>=1.12.0
5253
- sqlalchemy>=2.0.0
5354
- tabulate>=0.9.0

ci/deps/actions-311-downstream_compat.yaml

+4-3
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@ dependencies:
2727
- beautifulsoup4>=4.12.3
2828
- bottleneck>=1.3.6
2929
- fastparquet>=2024.2.0
30-
- fsspec>=2024.2.0
30+
- fsspec>=2023.12.2
3131
- html5lib>=1.1
3232
- hypothesis>=6.84.0
33-
- gcsfs>=2024.2.0
33+
- gcsfs>=2023.12.2
3434
- jinja2>=3.1.3
3535
- lxml>=4.9.2
3636
- matplotlib>=3.8.3
@@ -41,14 +41,15 @@ dependencies:
4141
- openpyxl>=3.1.2
4242
- psycopg2>=2.9.6
4343
- pyarrow>=10.0.1
44+
- pyiceberg>=0.7.1
4445
- pymysql>=1.1.0
4546
- pyqt>=5.15.9
4647
- pyreadstat>=1.2.6
4748
- pytables>=3.8.0
4849
- python-calamine>=0.1.7
4950
- pytz>=2023.4
5051
- pyxlsb>=1.0.10
51-
- s3fs>=2024.2.0
52+
- s3fs>=2023.12.2
5253
- scipy>=1.12.0
5354
- sqlalchemy>=2.0.0
5455
- tabulate>=0.9.0

ci/deps/actions-311.yaml

+4-3
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,10 @@ dependencies:
2626
- beautifulsoup4>=4.12.3
2727
- bottleneck>=1.3.6
2828
- fastparquet>=2024.2.0
29-
- fsspec>=2024.2.0
29+
- fsspec>=2023.12.2
3030
- html5lib>=1.1
3131
- hypothesis>=6.84.0
32-
- gcsfs>=2024.2.0
32+
- gcsfs>=2023.12.2
3333
- jinja2>=3.1.3
3434
- lxml>=4.9.2
3535
- matplotlib>=3.8.3
@@ -41,13 +41,14 @@ dependencies:
4141
- openpyxl>=3.1.2
4242
- psycopg2>=2.9.6
4343
- pyarrow>=10.0.1
44+
- pyiceberg>=0.7.1
4445
- pymysql>=1.1.0
4546
- pyreadstat>=1.2.6
4647
- pytables>=3.8.0
4748
- python-calamine>=0.1.7
4849
- pytz>=2023.4
4950
- pyxlsb>=1.0.10
50-
- s3fs>=2024.2.0
51+
- s3fs>=2023.12.2
5152
- scipy>=1.12.0
5253
- sqlalchemy>=2.0.0
5354
- tabulate>=0.9.0

ci/deps/actions-312.yaml

+4-3
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,10 @@ dependencies:
2626
- beautifulsoup4>=4.12.3
2727
- bottleneck>=1.3.6
2828
- fastparquet>=2024.2.0
29-
- fsspec>=2024.2.0
29+
- fsspec>=2023.12.2
3030
- html5lib>=1.1
3131
- hypothesis>=6.84.0
32-
- gcsfs>=2024.2.0
32+
- gcsfs>=2023.12.2
3333
- jinja2>=3.1.3
3434
- lxml>=4.9.2
3535
- matplotlib>=3.8.3
@@ -41,13 +41,14 @@ dependencies:
4141
- openpyxl>=3.1.2
4242
- psycopg2>=2.9.6
4343
- pyarrow>=10.0.1
44+
- pyiceberg>=0.7.1
4445
- pymysql>=1.1.0
4546
- pyreadstat>=1.2.6
4647
- pytables>=3.8.0
4748
- python-calamine>=0.1.7
4849
- pytz>=2023.4
4950
- pyxlsb>=1.0.10
50-
- s3fs>=2024.2.0
51+
- s3fs>=2023.12.2
5152
- scipy>=1.12.0
5253
- sqlalchemy>=2.0.0
5354
- tabulate>=0.9.0

ci/deps/actions-313.yaml

+3-3
Original file line numberDiff line numberDiff line change
@@ -27,10 +27,10 @@ dependencies:
2727
- blosc>=1.21.3
2828
- bottleneck>=1.3.6
2929
- fastparquet>=2024.2.0
30-
- fsspec>=2024.2.0
30+
- fsspec>=2023.12.2
3131
- html5lib>=1.1
3232
- hypothesis>=6.84.0
33-
- gcsfs>=2024.2.0
33+
- gcsfs>=2023.12.2
3434
- jinja2>=3.1.3
3535
- lxml>=4.9.2
3636
- matplotlib>=3.8.3
@@ -48,7 +48,7 @@ dependencies:
4848
- python-calamine>=0.1.7
4949
- pytz>=2023.4
5050
- pyxlsb>=1.0.10
51-
- s3fs>=2024.2.0
51+
- s3fs>=2023.12.2
5252
- scipy>=1.12.0
5353
- sqlalchemy>=2.0.0
5454
- tabulate>=0.9.0

doc/source/getting_started/install.rst

+5-4
Original file line numberDiff line numberDiff line change
@@ -299,7 +299,7 @@ Dependency Minimum Versi
299299
Other data sources
300300
^^^^^^^^^^^^^^^^^^
301301

302-
Installable with ``pip install "pandas[hdf5, parquet, feather, spss, excel]"``
302+
Installable with ``pip install "pandas[hdf5, parquet, iceberg, feather, spss, excel]"``
303303

304304
====================================================== ================== ================ ==========================================================
305305
Dependency Minimum Version pip extra Notes
@@ -308,6 +308,7 @@ Dependency Minimum Version pip ex
308308
`zlib <https://github.com/madler/zlib>`__ hdf5 Compression for HDF5
309309
`fastparquet <https://github.com/dask/fastparquet>`__ 2024.2.0 - Parquet reading / writing (pyarrow is default)
310310
`pyarrow <https://github.com/apache/arrow>`__ 10.0.1 parquet, feather Parquet, ORC, and feather reading / writing
311+
`PyIceberg <https://py.iceberg.apache.org/>`__ 0.7.1 iceberg Apache Iceberg reading
311312
`pyreadstat <https://github.com/Roche/pyreadstat>`__ 1.2.6 spss SPSS files (.sav) reading
312313
`odfpy <https://github.com/eea/odfpy>`__ 1.4.1 excel Open document format (.odf, .ods, .odt) reading / writing
313314
====================================================== ================== ================ ==========================================================
@@ -328,10 +329,10 @@ Installable with ``pip install "pandas[fss, aws, gcp]"``
328329
============================================ ================== =============== ==========================================================
329330
Dependency Minimum Version pip extra Notes
330331
============================================ ================== =============== ==========================================================
331-
`fsspec <https://github.com/fsspec>`__ 2024.2.0 fss, gcp, aws Handling files aside from simple local and HTTP (required
332+
`fsspec <https://github.com/fsspec>`__ 2023.12.2 fss, gcp, aws Handling files aside from simple local and HTTP (required
332333
dependency of s3fs, gcsfs).
333-
`gcsfs <https://github.com/fsspec/gcsfs>`__ 2024.2.0 gcp Google Cloud Storage access
334-
`s3fs <https://github.com/fsspec/s3fs>`__ 2024.2.0 aws Amazon S3 access
334+
`gcsfs <https://github.com/fsspec/gcsfs>`__ 2023.12.2 gcp Google Cloud Storage access
335+
`s3fs <https://github.com/fsspec/s3fs>`__ 2023.12.2 aws Amazon S3 access
335336
============================================ ================== =============== ==========================================================
336337

337338
Clipboard

doc/source/reference/io.rst

+9
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,15 @@ Parquet
156156
read_parquet
157157
DataFrame.to_parquet
158158

159+
Iceberg
160+
~~~~~~~
161+
.. autosummary::
162+
:toctree: api/
163+
164+
read_iceberg
165+
166+
.. warning:: ``read_iceberg`` is experimental and may change without warning.
167+
159168
ORC
160169
~~~
161170
.. autosummary::

doc/source/user_guide/io.rst

+97
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
2929
binary,`HDF5 Format <https://support.hdfgroup.org/documentation/hdf5/latest/_intro_h_d_f5.html>`__, :ref:`read_hdf<io.hdf5>`, :ref:`to_hdf<io.hdf5>`
3030
binary,`Feather Format <https://github.com/wesm/feather>`__, :ref:`read_feather<io.feather>`, :ref:`to_feather<io.feather>`
3131
binary,`Parquet Format <https://parquet.apache.org/>`__, :ref:`read_parquet<io.parquet>`, :ref:`to_parquet<io.parquet>`
32+
binary,`Apache Iceberg <https://iceberg.apache.org/>`__, :ref:`read_iceberg<io.iceberg>` , NA
3233
binary,`ORC Format <https://orc.apache.org/>`__, :ref:`read_orc<io.orc>`, :ref:`to_orc<io.orc>`
3334
binary,`Stata <https://en.wikipedia.org/wiki/Stata>`__, :ref:`read_stata<io.stata_reader>`, :ref:`to_stata<io.stata_writer>`
3435
binary,`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__, :ref:`read_sas<io.sas_reader>` , NA
@@ -5403,6 +5404,102 @@ The above example creates a partitioned dataset that may look like:
54035404
except OSError:
54045405
pass
54055406
5407+
.. _io.iceberg:
5408+
5409+
Iceberg
5410+
-------
5411+
5412+
.. versionadded:: 3.0.0
5413+
5414+
Apache Iceberg is a high performance open-source format for large analytic tables.
5415+
Iceberg enables the use of SQL tables for big data while making it possible for different
5416+
engines to safely work with the same tables at the same time.
5417+
5418+
Iceberg support predicate pushdown and column pruning, which are available to pandas
5419+
users via the ``row_filter`` and ``selected_fields`` parameters of the :func:`~pandas.read_iceberg`
5420+
function. This is convenient to extract from large tables a subset that fits in memory asa
5421+
pandas ``DataFrame``.
5422+
5423+
Internally, pandas uses PyIceberg_ to query Iceberg.
5424+
5425+
.. _PyIceberg: https://py.iceberg.apache.org/
5426+
5427+
A simple example loading all data from an Iceberg table ``my_table`` defined in the
5428+
``my_catalog`` catalog.
5429+
5430+
.. code-block:: python
5431+
5432+
df = pd.read_iceberg("my_table", catalog_name="my_catalog")
5433+
5434+
Catalogs must be defined in the ``.pyiceberg.yaml`` file, usually in the home directory.
5435+
It is possible to to change properties of the catalog definition with the
5436+
``catalog_properties`` parameter:
5437+
5438+
.. code-block:: python
5439+
5440+
df = pd.read_iceberg(
5441+
"my_table",
5442+
catalog_name="my_catalog",
5443+
catalog_properties={"s3.secret-access-key": "my_secret"},
5444+
)
5445+
5446+
It is also possible to fully specify the catalog in ``catalog_properties`` and not provide
5447+
a ``catalog_name``:
5448+
5449+
.. code-block:: python
5450+
5451+
df = pd.read_iceberg(
5452+
"my_table",
5453+
catalog_properties={
5454+
"uri": "http://127.0.0.1:8181",
5455+
"s3.endpoint": "http://127.0.0.1:9000",
5456+
},
5457+
)
5458+
5459+
To create the ``DataFrame`` with only a subset of the columns:
5460+
5461+
.. code-block:: python
5462+
5463+
df = pd.read_iceberg(
5464+
"my_table",
5465+
catalog_name="my_catalog",
5466+
selected_fields=["my_column_3", "my_column_7"]
5467+
)
5468+
5469+
This will execute the function faster, since other columns won't be read. And it will also
5470+
save memory, since the data from other columns won't be loaded into the underlying memory of
5471+
the ``DataFrame``.
5472+
5473+
To fetch only a subset of the rows, we can do it with the ``limit`` parameter:
5474+
5475+
.. code-block:: python
5476+
5477+
df = pd.read_iceberg(
5478+
"my_table",
5479+
catalog_name="my_catalog",
5480+
limit=100,
5481+
)
5482+
5483+
This will create a ``DataFrame`` with 100 rows, assuming there are at least this number in
5484+
the table.
5485+
5486+
To fetch a subset of the rows based on a condition, this can be done using the ``row_filter``
5487+
parameter:
5488+
5489+
.. code-block:: python
5490+
5491+
df = pd.read_iceberg(
5492+
"my_table",
5493+
catalog_name="my_catalog",
5494+
row_filter="distance > 10.0",
5495+
)
5496+
5497+
Reading a particular snapshot is also possible providing the snapshot ID as an argument to
5498+
``snapshot_id``.
5499+
5500+
More information about the Iceberg format can be found in the `Apache Iceberg official
5501+
page <https://iceberg.apache.org/>`__.
5502+
54065503
.. _io.orc:
54075504

54085505
ORC

doc/source/whatsnew/v3.0.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ Other enhancements
7878
- :py:class:`frozenset` elements in pandas objects are now natively printed (:issue:`60690`)
7979
- Add ``"delete_rows"`` option to ``if_exists`` argument in :meth:`DataFrame.to_sql` deleting all records of the table before inserting data (:issue:`37210`).
8080
- Added half-year offset classes :class:`HalfYearBegin`, :class:`HalfYearEnd`, :class:`BHalfYearBegin` and :class:`BHalfYearEnd` (:issue:`60928`)
81+
- Added support to read from Apache Iceberg tables with the new :func:`read_iceberg` function (:issue:`61383`)
8182
- Errors occurring during SQL I/O will now throw a generic :class:`.DatabaseError` instead of the raw Exception type from the underlying driver manager library (:issue:`60748`)
8283
- Implemented :meth:`Series.str.isascii` and :meth:`Series.str.isascii` (:issue:`59091`)
8384
- Improved deprecation message for offset aliases (:issue:`60820`)

environment.yml

+4-3
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,10 @@ dependencies:
2929
- beautifulsoup4>=4.12.3
3030
- bottleneck>=1.3.6
3131
- fastparquet>=2024.2.0
32-
- fsspec>=2024.2.0
32+
- fsspec>=2023.12.2
3333
- html5lib>=1.1
3434
- hypothesis>=6.84.0
35-
- gcsfs>=2024.2.0
35+
- gcsfs>=2023.12.2
3636
- ipython
3737
- pickleshare # Needed for IPython Sphinx directive in the docs GH#60429
3838
- jinja2>=3.1.3
@@ -44,13 +44,14 @@ dependencies:
4444
- odfpy>=1.4.1
4545
- psycopg2>=2.9.6
4646
- pyarrow>=10.0.1
47+
- pyiceberg>=0.7.1
4748
- pymysql>=1.1.0
4849
- pyreadstat>=1.2.6
4950
- pytables>=3.8.0
5051
- python-calamine>=0.1.7
5152
- pytz>=2023.4
5253
- pyxlsb>=1.0.10
53-
- s3fs>=2024.2.0
54+
- s3fs>=2023.12.2
5455
- scipy>=1.12.0
5556
- sqlalchemy>=2.0.0
5657
- tabulate>=0.9.0

pandas/__init__.py

+2
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,7 @@
164164
read_stata,
165165
read_sas,
166166
read_spss,
167+
read_iceberg,
167168
)
168169

169170
from pandas.io.json._normalize import json_normalize
@@ -319,6 +320,7 @@
319320
"read_fwf",
320321
"read_hdf",
321322
"read_html",
323+
"read_iceberg",
322324
"read_json",
323325
"read_orc",
324326
"read_parquet",

0 commit comments

Comments
 (0)