Skip to content

ENH: IO support for R data files with pandas.read_rdata and DataFrame.to_rdata #40884

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 13 commits into from
Closed
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -360,6 +360,8 @@ zlib Compression for HDF5
fastparquet 0.4.0 Parquet reading / writing
pyarrow 0.15.0 Parquet, ORC, and feather reading / writing
pyreadstat SPSS files (.sav) reading
pyreadr R files (.RData, .rda, .rds) reading / writing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParfaitG I didn't follow the entire thread, but we do not want to add these deps generally. IIRC you had a much simpler way (to just link in the c-code to read the format). that would be much more prefereable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR relies on a new dependency, pyreadr, (available in conda) for default engine option. Please advise how to add to builds for pytests.

this would be ok. we do not want to add r as a dep for even testing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to maintenance and licensing, the consensus was to use pyreadr as a soft dep like pyreadstat for read_spss. Understood about not adding R as dep. For rscript engine, I am suggesting we use R via subprocess call similar to backends in io.clipboard. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).

However, unless I am mistaken the CI tests does have R installed. I am getting results and fixing fails in test_rscript.py on Linux/Windows/Mac builds which checks for Rscript (nothing yet for test_pyreadr.py).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyreadr as a soft dep like pyreadstat for read_spss.

yes this is ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rscript engine, I am suggesting we use R via subprocess call similar to backends in io.clipboard. (Can serve as use case for Python/R arrow project). But this PR is set up to easily remove either engine (i.e., separate classes and tests).

I really don't want to complicate our CI any more. So i don't want this. Instead use simple frames for the expected return values).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will adjust code and docs to exclusively run pyreadr. Should I add pyreadr entries to the three yaml files in pandas/tree/master/ci/deps where pyreadstat is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return a dict of DataFrames like pyreadr does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Will adjust code and docs to exclusively run pyreadr. Should I add pyreadr entries to the three yaml files in pandas/tree/master/ci/deps where pyreadstat is also included? Also, because .RData and .rda can potentially have more than one named data frame (unlike .rds), we may have to return a dict of DataFrames like pyreadr does.

yes you want to add it by default so its used. but also one build should not have it so that if its not installed it skips properly.

we may have to return a dict of DataFrames like pyreadr does.

that would be ok, we do this in read_html, just clearly document & make the signature reflect this.

Rscript R files (.RData, .rda, .rds) reading / writing
========================= ================== =============================================================

Access data in the cloud
Expand Down
300 changes: 300 additions & 0 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ The pandas I/O API is a set of top level ``reader`` functions accessed like
binary;`Parquet Format <https://parquet.apache.org/>`__;:ref:`read_parquet<io.parquet>`;:ref:`to_parquet<io.parquet>`
binary;`ORC Format <https://orc.apache.org/>`__;:ref:`read_orc<io.orc>`;
binary;`Msgpack <https://msgpack.org/>`__;:ref:`read_msgpack<io.msgpack>`;:ref:`to_msgpack<io.msgpack>`
binary;`R <https://www.r-project.org/>`__;:ref:`read_rdata<io.rdata_reader>`;:ref:`to_rdata<io.rdata_writer>`
binary;`Stata <https://en.wikipedia.org/wiki/Stata>`__;:ref:`read_stata<io.stata_reader>`;:ref:`to_stata<io.stata_writer>`
binary;`SAS <https://en.wikipedia.org/wiki/SAS_(software)>`__;:ref:`read_sas<io.sas_reader>`;
binary;`SPSS <https://en.wikipedia.org/wiki/SPSS>`__;:ref:`read_spss<io.spss_reader>`;
Expand Down Expand Up @@ -5903,6 +5904,304 @@ respective functions from ``pandas-gbq``.

Full documentation can be found `here <https://pandas-gbq.readthedocs.io/>`__.


.. _io.rdata:

R data format
-------------

.. _io.rdata_reader:

Reading R data
''''''''''''''

.. versionadded:: 1.3.0

The top-level function ``read_rdata`` will read the native serialization types
in the R language and environment. For .RData and its synonymous shorthand, .rda,
that can hold multiple R objects, method will return a ``dict`` of ``DataFrames``.
For .rds types that only contains a single R object, method will return a single
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reference that you can point to for this format (e.g. docs)

``DataFrame``.

.. note::

Since *any* R object can be saved in these types, this method will only return
data.frame objects or objects coercible to data.frames including matrices,
tibbles, and data.tables even 3D arrays. Depending on engine used, either
an error raises for non-data.frame objects or such objects are ignored.

For example, consider the following generated data.frames in R using samples from
US EPA, UK BGCI, and NOAA pubilc data:

.. code-block:: r

ghg_df <- data.frame(
gas = c("Carbon dioxide", "Methane", "Nitrous oxide",
"Fluorinated gases", "Total"),
year = c(2018, 2018, 2018, 2018, 2018),
emissions = c(5424.88150213288, 634.457127078267, 434.528555376666,
182.782432461777, 6676.64961704959),
row.names = c(141:145),
stringsAsFactors = FALSE
)

saveRDS(ghg_df, file="ghg_df.rds")

plants_df <- data.frame(
plant_group = c("Pteridophytes", "Pteridophytes", "Pteridophytes",
"Pteridophytes", "Pteridophytes"),
status = c("Data Deficient", "Extinct", "Not Threatened",
"Possibly Threatened", "Threatened"),
count = c(398, 65, 1294, 408, 1275),
row.names = c(16:20),
stringsAsFactors = FALSE
)

saveRDS(plants_df, file="plants_df.rds")

sea_ice_df_new <- data.frame(
year = c(2016, 2017, 2018, 2019, 2020),
mo = c(12, 12, 12, 12, 12),
data.type = c("Goddard", "Goddard", "Goddard", "Goddard", "NRTSI-G"),
region = c("S", "S", "S", "S", "S"),
extent = c(8.28, 9.48, 9.19, 9.41, 10.44),
area = c(5.51, 6.23, 5.59, 6.59, 6.5),
row.names = c(1012:1016),
stringsAsFactors = FALSE
)

saveRDS(sea_ice_df, file="sea_ice_df.rds")

save(ghg_df, plants_df, sea_ice_df, file="env_data_dfs.rda")

Then in pandas you can read the .rds or .rda files:

.. ipython:: python
:suppress:

rel_path = os.path.join("..", "pandas", "tests", "io", "data", "rdata")
file_path = os.path.abspath(rel_path)

.. ipython:: python

rds_file = os.path.join(file_path, "ghg_df.rds")
ghg_df = pd.read_rdata(rds_file).tail()
ghg_df

rda_file = os.path.join(file_path, "env_data_dfs.rda")
env_dfs = pd.read_rdata(rda_file)
env_dfs

To ignore the rownames of data.frame, use option ``rownames=False``:

.. ipython:: python

rds_file = os.path.join(file_path, "plants_df.rds")
plants_df = pd.read_rdata(rds_file, rownames=False).tail()
plants_df


To select specific objects in .rda, pass a list of names into ``select_frames``:

.. ipython:: python

rda_file = os.path.join(file_path, "env_data_dfs.rda")
env_dfs = pd.read_rdata(rda_file, select_frames=["sea_ice_df"])
env_dfs

To read from URL, pass link directly into method:

.. ipython:: python

url = ("https://github.com/hadley/nycflights13/"
"blob/master/data/airlines.rda?raw=true")

airlines = pd.read_rdata(url, file_format="rda")
airlines

To read from a file-like object, read object in argument, ``path_or_buffer``:

.. ipython:: python

rds_file = os.path.join(file_path, "sea_ice_df.rds")
with open(rds_file, "rb") as f:
sea_ice_df = pd.read_rdata(f.read(), file_format="rds")

sea_ice_df

With ``rscript`` as ``engine``, a direct command line call to Rscript is run
to read data natively in R and transfer content with several options of ``mode``.

.. note::

If you do not have R installed and attempt to use the ``rscript`` ``engine``,
then an ``ImportError`` will raise. Do note: Rscript must be recognized as a
top-level command on machine. Hence, R's bin folder must be in Path environment
variable for the OS. If Rscript is not recognized even if you have R installed,
you will receive same ``ImportError``.

- For the ``csv`` mode (default), no other package in R is required.
Data types are adhered in this data exchange following a text approach.

- For the ``feather`` mode, the ``arrow`` package in R must be installed.
Additionally, the counterpart ``pyarrow`` package in Python must be
installed. This binary approach allows faster data exchange than text approach.

- For the ``parquet`` mode, again the ``arrow`` package in R must be installed.
and again ``pyarrow`` package in Python must be installed. Similarly, this
binary approach allows faster data exchange than text approach.

- For the ``sqlite`` mode, the ``RSQLite`` package in R (part of DBI family of
database APIs) must be installed with no additional package needed for Python.
This database approach ensures data type integrity.

.. ipython:: python

rds_file = os.path.join(file_path, "plants_df.rds")
plants_df = pd.read_rdata(rds_file, engine="rscript", mode="csv").tail()
plants_df

.. note::

The above selected options for ``mode`` will not generate such formats but
uses them under the hood in disk transfer of data between R and Python.


.. _io.rdata_writer:

Writing R data
''''''''''''''

.. versionadded:: 1.3.0

The method :func:`~pandas.core.frame.DataFrame.to_rdata` will write a DataFrame
or multiple DataFrames into R data files (.Rdata, .rda, and .rds).

For single object in rds type:

.. ipython:: python

plants_df.to_rdata("plants_df.rds")

For multiple objects in RData or rda types using the ``rscript`` engine,
use the ``other_frames`` argument and be sure to provide ``rda_names`` for all
DataFrames:

.. ipython:: python

plants_df.to_rdata(
"env_dfs.rda",
engine="rscript",
other_frames=[ghg_df, sea_ice_df],
rda_names=["plants_df", "ghg_df", "sea_ice_df"]
)

With either engine, pandas index will not map into R rownames. Using the default
``index=True`` will output an index column or multiple columns for MultiIndex.

.. ipython:: python

(ghg_df.rename_axis(None)
.to_rdata("ghg_df.rds", engine="rscript")
)
pd.read_rdata("ghg_df.rds").tail()

Otherwise, use ``index=False``:

.. ipython:: python

(ghg_df.rename_axis(None)
.to_rdata("ghg_df.rds", engine="rscript", index=False)
)
pd.read_rdata("ghg_df.rds").tail()

With both engines, the default compression of R data files will be ``gzip``.
Notice the different sizes of compressed and uncompressed files:

.. ipython:: python

plants_df.to_rdata("plants_df_uncomp.rds", compress=False)

os.stat("plants_df.rds").st_size
os.stat("plants_df_uncomp.rds").st_size

The ``rscript`` engine supports all listed compression types including:
``gzip``, ``bzip2``, and ``xz``.

Additionally, with ``rscript`` engine, data files can be written in ascii (text)
rather than default binary with ``ascii`` argument:

.. ipython:: python

sea_ice_df.to_rdata("sea_ice_df_ascii.rda", engine="rscript",
ascii=True, compress=False)

with open("sea_ice_df_ascii.rda", "r") as f:
for i in range(10):
line = next(f).strip()
print(line)

.. ipython:: python
:suppress:

os.remove("ghg_df.rds")
os.remove("plants_df.rds")
os.remove("env_dfs.rda")
os.remove("plants_df_uncomp.rds")
os.remove("sea_ice_df_ascii.rda")

Once exported, the single DataFrame can be read back in R or multiple DataFrames
loaded in R:

.. code-block:: r

plants_df <- readRDS("plants_df.rds")
tail(plants_df, 5)
plant_group status count
16 Pteridophytes Data Deficient 398
17 Pteridophytes Extinct 65
18 Pteridophytes Not Threatened 1294
19 Pteridophytes Possibly Threatened 408
20 Pteridophytes Threatened 1275


load("env_dfs.rda")
eapply(.GlobalEnv, tail, 5)
$plants_df
plant_group status count
16 Pteridophytes Data Deficient 398
17 Pteridophytes Extinct 65
18 Pteridophytes Not Threatened 1294
19 Pteridophytes Possibly Threatened 408
20 Pteridophytes Threatened 1275

$sea_ice_df
year mo data.type region extent area
1012 2016 12 Goddard S 8.28 5.51
1013 2017 12 Goddard S 9.48 6.23
1014 2018 12 Goddard S 9.19 5.59
1015 2019 12 Goddard S 9.41 6.59
1016 2020 12 NRTSI-G S 10.44 6.50

$ghg_df
gas year emissions
141 Carbon dioxide 2018 5424.8815
142 Methane 2018 634.4571
143 Nitrous oxide 2018 434.5286
144 Fluorinated gases 2018 182.7824
145 Total 2018 6676.6496

For more information of ``pyreadr`` engine, see main page of `pyreadr`_ package for
further notes on support and limitations. For more information of R serialization
data types, see docs on `rds`_ and `rda`_ data files.

.. _pyreadr: https://github.com/ofajardo/pyreadr

.. _rds: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/readRDS

.. _rda: https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/save


.. _io.stata:

Stata format
Expand Down Expand Up @@ -5958,6 +6257,7 @@ outside of this range, the variable is cast to ``int16``.
115 dta file format. Attempting to write *Stata* dta files with strings
longer than 244 characters raises a ``ValueError``.


.. _io.stata_reader:

Reading from Stata format
Expand Down
Loading