ENH: IO support for R data files with C extension #41386

ParfaitG · 2021-05-08T18:46:09Z

follows up on ENH: IO support for R data files with pandas.read_rdata and DataFrame.to_rdata #40884
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry
user_guide/io entry

Proposed RData I/O module interfaces to the C library: librdata.

Overall, this PR includes following changes:

setup.py: pandas/setup.py (new rdata section at the bottom)
librdata: pandas/_libs/src/librdata (C and header files and iconv scripts)
rdata IO: pandas/io/rdata (Cython and Python scripts)
frame.py: pandas/core/frame.py (for DataFrame.to_rdata)
tests: pandas/tests/io/test_rdata.py
tests data: pandas/tests/io/data/rdata (R data files in gzip compression)
docs: librdata license, user_guide/io.rst, whatsnew/v1.3.0.rst

Note: special handling of iconv, a system resource built-in to Unix machines, is required:

For Linux, to centralize from different locations, the iconv.h header of the GNU C library is included.
For Mac, setup.py points to system folders, /usr/include and /usr/lib, which may differ from users installs.
For Windows , since iconv is not built-in, two counterpart files (.h and .c script) were added from this repo, win-iconv, where its readme indicates code is placed in the public domain.

…rame.to_rdata

…skip

….rst

jreback

@ParfaitG is this a different format than pyarrow reads/writes? (on the r side) sorry it obviously is, what i mean is, are there utilities / code in the pyarrow r project to read write this format?

doc/source/user_guide/io.rst

jreback · 2021-05-18T14:46:08Z

doc/source/user_guide/io.rst

+.. ipython:: python
+
+   rda_file = os.path.join(file_path, "env_data_dfs.rda")
+   env_dfs = pd.read_rdata(rda_file, select_frames=["sea_ice_df"])


yeah for read_hdf we do this by forcing the user to have a key. are these always an ordered list? or a keyed list?

ParfaitG · 2021-05-18T18:51:18Z

@jreback - These binary formats (RData, rda, rds) are part of base R (or the standard library of R) as native serialization types. Methods to create them ship with every installation of R. The R Core Team maintain these types. AFAIK, while the arrow package of R and pyarrow of Python can read text files like csv and json and compress to gz and bzip2 types, they do not read or write to these R data formats. From what I see, the only binary types Arrow supports are parquet and feather. And from McKinney and Wickham blogs, I believe these formats serve as alternatives for their speed in read/write and language-independent interoperability.

ParfaitG · 2021-06-19T14:46:16Z

What is the status here? I see no further response. I can rebase but this ENH may not make it into v1.3. I fully understand that this PR with new C extension is pretty involved and we want to be conservative on large enhancements in pandas code base.

jreback · 2021-06-20T01:22:29Z

@ParfaitG have been busy but will take a look

bashtage

Some comments on setup.

bashtage · 2021-06-21T08:40:53Z

setup.py

+    if name == "io.rdata._rdata" and is_platform_mac():
+        # non-conda builds must adjust paths to libiconv .h and lib dirs
+        include = [
+            os.path.join(os.environ["CONDA_PREFIX"], "include"),


Should you check if CONDA_PREFIX is defined? If it isn't, presumably this isn't conda.

Thanks for this catch. Turns out this if block is not needed. I removed it.

bashtage · 2021-06-21T08:41:43Z

setup.py

@@ -364,6 +370,12 @@ def run(self):
    # https://github.com/pandas-dev/pandas/issues/35559
    extra_compile_args.append("-Wno-error=unreachable-code")

+    # rdata requires system iconv library
+    os.environ["DYLD_LIBRARY_PATH"] = ""


Overriding the users DYLC_LIBRARY_PATH feels like the wrong solution. Why is this needed? Does this change this path for the duration of the console session, even outside of the setup run?

This line is not needed and has been removed.

bashtage · 2021-06-21T08:41:59Z

setup.py

@@ -364,6 +370,12 @@ def run(self):
    # https://github.com/pandas-dev/pandas/issues/35559
    extra_compile_args.append("-Wno-error=unreachable-code")

+    # rdata requires system iconv library
+    os.environ["DYLD_LIBRARY_PATH"] = ""
+    rdata_includes = ["/usr/include"]


Presumably these should be no on Windows

Correct. This section is under the if is_platform_mac() condition.

jreback

is it possible to push all of the c-code to a separate package? why is this a good (or not good solution in this case)? i am concerned that this is just creating even bigger wheels and more maintenance for all users.

ParfaitG · 2021-06-22T02:31:03Z

If you recall, we explored a Python package for this IO module to read R data files but it had a restrictive license. See previous PR in my first checked item at top. That package was a wrapper to this same C library that we directly interfaced with here, avoiding another soft dependency for pandas.

Regarding size, this librdata proclaims itself as a small, lightweight C library. Its source files with two added scripts for Windows totals 177 KB (slightly larger than ujson at 175 KB). Also, the only compiled Cython _rdata module totals 186 KB for Python 3.9 on Linux, likely about the same for Py 3.7/3.8. By comparison, SAS's compiled Cython module totals 256 KB. Corresponding _rdata.c comes to 533 KB (with SAS's _sas.c at 1.0 MB).

github-actions · 2021-08-18T00:02:38Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

ParfaitG added 30 commits April 11, 2021 12:53

ENH: Add IO support for R data files with pandas.read_rdata and DataF…

d1d3e4f

…rame.to_rdata

Fix rebase issues in whatsnew and type style in frame.py

de848dd

Fix skipif logic for test params, move package checks, add to test_api

3379fa1

Refactor from built-in filter, add encoding to subprocess and locale …

966cb78

…skip

Fix tests for OS newline and mypy, mark xfail, use default mode in io…

22c7ade

….rst

Added needed test skips and fixed io docs ref in whatsnew

8b1aa9c

Merge remote-tracking branch 'upstream/master' into rdata_io

41f817f

Remove rscript implementation from code, tests, and docs

2341dff

Merge remote-tracking branch 'upstream/master' into rdata_io

1f8f033

Fix duplicate entry in ci dep yaml

a5983e0

Refactor to handle binary content, add datetime notes in docs

e78bf6e

Merge remote-tracking branch 'upstream/master' into rdata_io

1475281

Merge remote-tracking branch 'upstream/master' into rdata_io

7e0c152

ENH: IO support for R data files with C extension

bd7dde6

Move C src files to _libs directory

140ea04

Adjust C src files to conform to cpplint

1ef9e9a

Fix C src warnings raised as compiled errors

770b810

Merge remote-tracking branch 'upstream/master' into rdata_c

d2f3746

Remove pyreadr listing in yml files and docs

5ce5c05

Fix C src warnings, syntax, and add unix_iconv.h

f9a23cd

Fix docstring issue and add mac_iconv.h

952889f

Adjust Cython scripts to fix write rdata for Windows, revert Mac iconv

4a0cf89

Merge remote-tracking branch 'upstream/master' into rdata_c

83bc859

Remove quotes in include iconv.h line of C source

09f2005

Add liconv to extra_link_args for Mac OS build

a9da74a

Slight fix to liconv in extra_link_args for Mac OS build

40862c5

Adjust rdata include_dirs for libiconv on Mac OS

6396819

Add library_dirs to find libiconv on Mac OS

749a04e

Merge remote-tracking branch 'upstream/master' into rdata_c

e862057

Resolve rdata extension name for compilation

f5ab7cd

ParfaitG added 6 commits May 16, 2021 18:26

Merge remote-tracking branch 'upstream/master' into rdata_c

ab06b2b

Replace integer for float in timestamps to fit 32-bit limit

7299ee5

Use C long long for large timevalue to work on 32 and 64-bit

6a35bfa

Adjust timestamps in test to work on 32 and 64-bit machines

7b35651

Add skip for 32-bit in dtypes test

0ab02ec

Merge remote-tracking branch 'upstream/master' into rdata_c

fa3dbc1

ParfaitG requested review from bashtage and jreback May 18, 2021 13:58

jreback added the IO Data IO issues that don't fit into a more specific label label May 18, 2021

jreback requested changes May 18, 2021

View reviewed changes

ParfaitG added 2 commits May 18, 2021 23:05

Adjust rdata section of user_guide/io.rst docs

a51f8de

Merge remote-tracking branch 'upstream/master' into rdata_c

835e998

simonjayhawkins added the Enhancement label May 25, 2021

bashtage reviewed Jun 21, 2021

View reviewed changes

ParfaitG added 4 commits June 21, 2021 14:58

Fix merge conflicts

0e3dc79

Adjust setup.py per comments

67613aa

Merge remote-tracking branch 'upstream/master' into rdata_c

dd26eb9

Remove conda prefix condition for mac in setup.py

dc56c82

jreback requested changes Jun 21, 2021

View reviewed changes

ParfaitG added 3 commits June 23, 2021 14:20

Merge remote-tracking branch 'upstream/master' into rdata_c

c0f6c68

Remove extraneous lines

a56cf38

Merge remote-tracking branch 'upstream/master' into rdata_c

dde183b

github-actions bot added the Stale label Aug 18, 2021

ParfaitG closed this Aug 18, 2021

ParfaitG deleted the rdata_c branch August 18, 2021 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: IO support for R data files with C extension #41386

ENH: IO support for R data files with C extension #41386

ParfaitG commented May 8, 2021 •

edited

Loading

jreback left a comment •

edited

Loading

jreback May 18, 2021

ParfaitG commented May 18, 2021

ParfaitG commented Jun 19, 2021

jreback commented Jun 20, 2021

bashtage left a comment

bashtage Jun 21, 2021

ParfaitG Jun 22, 2021

bashtage Jun 21, 2021

ParfaitG Jun 22, 2021

bashtage Jun 21, 2021

ParfaitG Jun 22, 2021

jreback left a comment

ParfaitG commented Jun 22, 2021

github-actions bot commented Aug 18, 2021

ENH: IO support for R data files with C extension #41386

ENH: IO support for R data files with C extension #41386

Conversation

ParfaitG commented May 8, 2021 • edited Loading

Proposed RData I/O module interfaces to the C library: librdata.

jreback left a comment • edited Loading

Choose a reason for hiding this comment

jreback May 18, 2021

Choose a reason for hiding this comment

ParfaitG commented May 18, 2021

ParfaitG commented Jun 19, 2021

jreback commented Jun 20, 2021

bashtage left a comment

Choose a reason for hiding this comment

bashtage Jun 21, 2021

Choose a reason for hiding this comment

ParfaitG Jun 22, 2021

Choose a reason for hiding this comment

bashtage Jun 21, 2021

Choose a reason for hiding this comment

ParfaitG Jun 22, 2021

Choose a reason for hiding this comment

bashtage Jun 21, 2021

Choose a reason for hiding this comment

ParfaitG Jun 22, 2021

Choose a reason for hiding this comment

jreback left a comment

Choose a reason for hiding this comment

ParfaitG commented Jun 22, 2021

github-actions bot commented Aug 18, 2021

ParfaitG commented May 8, 2021 •

edited

Loading

jreback left a comment •

edited

Loading