Skip to content

ENH: IO support for R data files with C extension #41386

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 51 commits into from

Conversation

ParfaitG
Copy link
Contributor

@ParfaitG ParfaitG commented May 8, 2021

Proposed RData I/O module interfaces to the C library: librdata.

Overall, this PR includes following changes:

  • setup.py: pandas/setup.py (new rdata section at the bottom)
  • librdata: pandas/_libs/src/librdata (C and header files and iconv scripts)
  • rdata IO: pandas/io/rdata (Cython and Python scripts)
  • frame.py: pandas/core/frame.py (for DataFrame.to_rdata)
  • tests: pandas/tests/io/test_rdata.py
  • tests data: pandas/tests/io/data/rdata (R data files in gzip compression)
  • docs: librdata license, user_guide/io.rst, whatsnew/v1.3.0.rst

Note: special handling of iconv, a system resource built-in to Unix machines, is required:

  • For Linux, to centralize from different locations, the iconv.h header of the GNU C library is included.
  • For Mac, setup.py points to system folders, /usr/include and /usr/lib, which may differ from users installs.
  • For Windows , since iconv is not built-in, two counterpart files (.h and .c script) were added from this repo, win-iconv, where its readme indicates code is placed in the public domain.

ParfaitG added 30 commits April 11, 2021 12:53
@ParfaitG ParfaitG requested review from bashtage and jreback May 18, 2021 13:58
@jreback jreback added the IO Data IO issues that don't fit into a more specific label label May 18, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ParfaitG is this a different format than pyarrow reads/writes? (on the r side) sorry it obviously is, what i mean is, are there utilities / code in the pyarrow r project to read write this format?

.. ipython:: python

rda_file = os.path.join(file_path, "env_data_dfs.rda")
env_dfs = pd.read_rdata(rda_file, select_frames=["sea_ice_df"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah for read_hdf we do this by forcing the user to have a key. are these always an ordered list? or a keyed list?

@ParfaitG
Copy link
Contributor Author

@jreback - These binary formats (RData, rda, rds) are part of base R (or the standard library of R) as native serialization types. Methods to create them ship with every installation of R. The R Core Team maintain these types. AFAIK, while the arrow package of R and pyarrow of Python can read text files like csv and json and compress to gz and bzip2 types, they do not read or write to these R data formats. From what I see, the only binary types Arrow supports are parquet and feather. And from McKinney and Wickham blogs, I believe these formats serve as alternatives for their speed in read/write and language-independent interoperability.

@ParfaitG
Copy link
Contributor Author

What is the status here? I see no further response. I can rebase but this ENH may not make it into v1.3. I fully understand that this PR with new C extension is pretty involved and we want to be conservative on large enhancements in pandas code base.

@jreback
Copy link
Contributor

jreback commented Jun 20, 2021

@ParfaitG have been busy but will take a look

Copy link
Contributor

@bashtage bashtage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on setup.

setup.py Outdated
if name == "io.rdata._rdata" and is_platform_mac():
# non-conda builds must adjust paths to libiconv .h and lib dirs
include = [
os.path.join(os.environ["CONDA_PREFIX"], "include"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you check if CONDA_PREFIX is defined? If it isn't, presumably this isn't conda.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this catch. Turns out this if block is not needed. I removed it.

setup.py Outdated
@@ -364,6 +370,12 @@ def run(self):
# https://github.com/pandas-dev/pandas/issues/35559
extra_compile_args.append("-Wno-error=unreachable-code")

# rdata requires system iconv library
os.environ["DYLD_LIBRARY_PATH"] = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overriding the users DYLC_LIBRARY_PATH feels like the wrong solution. Why is this needed? Does this change this path for the duration of the console session, even outside of the setup run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is not needed and has been removed.

@@ -364,6 +370,12 @@ def run(self):
# https://github.com/pandas-dev/pandas/issues/35559
extra_compile_args.append("-Wno-error=unreachable-code")

# rdata requires system iconv library
os.environ["DYLD_LIBRARY_PATH"] = ""
rdata_includes = ["/usr/include"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably these should be no on Windows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. This section is under the if is_platform_mac() condition.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to push all of the c-code to a separate package? why is this a good (or not good solution in this case)? i am concerned that this is just creating even bigger wheels and more maintenance for all users.

@ParfaitG
Copy link
Contributor Author

If you recall, we explored a Python package for this IO module to read R data files but it had a restrictive license. See previous PR in my first checked item at top. That package was a wrapper to this same C library that we directly interfaced with here, avoiding another soft dependency for pandas.

Regarding size, this librdata proclaims itself as a small, lightweight C library. Its source files with two added scripts for Windows totals 177 KB (slightly larger than ujson at 175 KB). Also, the only compiled Cython _rdata module totals 186 KB for Python 3.9 on Linux, likely about the same for Py 3.7/3.8. By comparison, SAS's compiled Cython module totals 256 KB. Corresponding _rdata.c comes to 533 KB (with SAS's _sas.c at 1.0 MB).

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Aug 18, 2021
@ParfaitG ParfaitG closed this Aug 18, 2021
@ParfaitG ParfaitG deleted the rdata_c branch August 18, 2021 00:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants