Google Summer of Code

Introduction

This is the GSoC'16 ideas page for pandas.

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It has become a centerpiece of the PyData stack.

This page lists a number of ideas for Google Summer of Code projects for pandas, plus gives some pointers for potential GSoC students on how to get started with contributing and putting together their application.

Guidelines & requirements

Pandas participates in GSoC 2016 under the umbrella of Python Software Foundation / NUMFocus.

PSF student guidelines: http://wiki.python.org/moin/SummerOfCode/Expectations

Advice on writing a proposal (written with the Mailman project in mind, but generally applicable): http://wiki.list.org/DEV/SPAM

We expect from students that they're at least comfortable with Python (intermediate level). Some projects may also require Cython or C/C++ skills. Knowing how to use Git is also important; this can be learned before the official start of GSoC if needed though.

Advice

Potential candidates should to take a look at the guidelines on how to contribute to pandas, see the documentations here. Making a small enhancement/bugfix/documentation fix/etc (does not need to be related to your proposal) to pandas before applying for the GSoC is a requirement from the PSF; it can help you get some idea how things would work during the GSoC.

Start on your proposal early, post a draft to the mailing list and iterate based on the feedback you receive. This will not only improve the quality of your proposal, but also help you find a suitable mentor.

Projects

IO on R datasets

Difficulty	Involves
Intermediate	python, cython, c, c++
Add `to_r_data_set`, `from_r_data_set` or similar, a performant interface to reading/writing R datasets directly (to/from disk). Some references here, and here, and here, and conversations here and here. Couple of issues to deal with. `picklr` is not license compatible, though uses `xdr` which would be nice as its a built in python module. Whereas `ReadStat` would need substantial wrapping. Might be easy to translate the julia code almost directly to cython. Mentor will be devasena prasad

sparse enhancement / fixes

Difficulty	Involves
Intermediate	python, cython
Implement some missing features in sparse, see the open issues here. Sparse needs some TLC in interactions with other pandas objects and compatibility.

weights support in most stat functions

Difficulty	Involves
Novice	python, cython
see issue here, basically adding a `weights` arg for things like `.mean`, etc. This is a relatively straightforward project and would be suitable for a newcomer to pandas.

IO

Difficulty	Involves
Intermediate	python, cython, cyavor, parquet-cpp
Adding input-output connectors & support for `to_` and `from_` for these binary formats (to use existing libraries to actually read/write; this item is for integration/shipping within pandas). Requires some knowledge of the outside library and a bit of pandas internals. These libraries will allow pandas to interact better with some of the big data eco-system. These could be done as multiple independent projects.

avro support
parquet support
BSON?
RData

numba

Difficulty	Involves
Intermediate	python, cython, numba

Construction of a general interface (possibly via a numba extension), to allow automatic direction of code to numba via .applys.
Ahead-of-time to generate code for groupby (and other algos), rather than direct templating. These are two separate projects and requires some pandas internal knowledge as well as familiarity with numba

period

Difficulty	Involves
Advanced	python, cython

Implement as sub-class of IntervalIndex
Make an extension dtype, and integrate as a first class object in the pandas. A pretty deep knowledge of pandas internals is needed. This could actually be two separate somewhat independent projects.

datetime

Difficulty	Involves
Advanced	python, cython
Support for non-ns dtypes. This requires deep knowledge of pandas internals is needed. It would be helpful to understand `numpy` dtype mechanisms. A master issue is here

`pd.String` dtype

Difficulty	Involves
Advanced	python, cython
Construct a new dtype to support string operations (moved from `object`). An understanding of `Categorical` is required, as well as deep knowledge of pandas internals is needed. It would be helpful to understand `numpy` dtype mechanisms.

enhance meta-data propogation

Difficulty	Involves
Advanced	python
Allow more meta-data to be attached to pandas objects, and propogate in the common cases. here is a discussion of this

`Panel` / `xarray` integration

Difficulty	Involves
Intermediate	python, pandas, xarray
Flesh out additional features that are needed to fully support `Panel` operations and implement in `xarray`. See if can port some selected operations from `xarray` back to pandas. This would be a relatively straightforward project and would be suitable for a newcomer to pandas.

unit dtype / support

Difficulty	Involves
Intermediate	python, pandas, matplotlib
Deep knowledge of pandas internals is needed. It would be helpful to understand `numpy` dtype mechanisms.
There would be possible collaboration with `matplotlib`. here is more of a discussion

Novice level bug fixes / enhancements

Difficulty	Involves
Novice	python, pandas
a list of novice level bug fixes / enhancements

Intermediate level bug fixes / enhancements

Difficulty	Involves
Novice	python, pandas
a list of intermediate level bug fixes / enhancements

Advanced level bug fixes / enhancements

Difficulty	Involves
Advanced	python, pandas, cython
a list of advanced level bug fixes / enhancements

libpandas refactor

Difficulty	Involves
Advanced	python, pandas, c++
An internal refactor of pandas preserving the user-facing API and the developer API (numpy). requires a deep knowledge of c++ and pandas internals.

Google Summer of Code

Introduction

Guidelines & requirements

Advice

Projects

IO on R datasets

sparse enhancement / fixes

weights support in most stat functions

IO

numba

period

datetime

pd.String dtype

enhance meta-data propogation

Panel / xarray integration

unit dtype / support

Novice level bug fixes / enhancements

Intermediate level bug fixes / enhancements

Advanced level bug fixes / enhancements

libpandas refactor

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

For Documentation Authors

For Developers

For maintainers

Clone this wiki locally

`pd.String` dtype

`Panel` / `xarray` integration