-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Add a Roadmap #27478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Add a Roadmap #27478
Changes from all commits
Commits
Show all changes
30 commits
Select commit
Hold shift + click to select a range
965ecd1
added roadmap
TomAugspurger 98656c8
added roadmap
TomAugspurger 12f1f67
update roadmap
TomAugspurger fb844ae
move to development
TomAugspurger c640c73
Merge remote-tracking branch 'upstream' into roadmap
TomAugspurger c310370
indexing
TomAugspurger d5573bb
Merge remote-tracking branch 'upstream' into roadmap
TomAugspurger 4aef936
typos
TomAugspurger 200ac63
numba
TomAugspurger 8dbd981
reword
TomAugspurger 9ac38f0
arrow
TomAugspurger d2883c4
Merge remote-tracking branch 'upstream' into roadmap
TomAugspurger 8c65297
Merge remote-tracking branch 'upstream' into roadmap
TomAugspurger 4e1af82
Intro
TomAugspurger 5702a18
cleanup
TomAugspurger 755a5e4
case
TomAugspurger a549cf7
str
TomAugspurger da01cb4
added evolution
TomAugspurger b52d6b9
typos
TomAugspurger bf1338b
missing function
TomAugspurger fb6980c
scope and ML
TomAugspurger 85cf5ee
Merge remote-tracking branch 'upstream/master' into roadmap
TomAugspurger 65653ee
add note on in / out
TomAugspurger c3b5b5f
Update doc/source/development/roadmap.rst
TomAugspurger d3c9424
Update doc/source/development/roadmap.rst
TomAugspurger a10f78c
Update doc/source/development/roadmap.rst
TomAugspurger 6a05c2b
link to tracker
TomAugspurger ce5a2e0
numba link
TomAugspurger 7ac38b5
Merge remote-tracking branch 'upstream/master' into roadmap
TomAugspurger ecdffeb
fix link
TomAugspurger File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,3 +16,4 @@ Development | |
internals | ||
extending | ||
developer | ||
roadmap |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
.. _roadmap: | ||
|
||
======= | ||
Roadmap | ||
======= | ||
|
||
This page provides an overview of the major themes in pandas' development. Each of | ||
these items requires a relatively large amount of effort to implement. These may | ||
be achieved more quickly with dedicated funding or interest from contributors. | ||
|
||
An item being on the roadmap does not mean that it will *necessarily* happen, even | ||
with unlimited funding. During the implementation period we may discover issues | ||
preventing the adoption of the feature. | ||
|
||
Additionally, an item *not* being on the roadmap does not exclude it from inclusion | ||
in pandas. The roadmap is intended for larger, fundamental changes to the project that | ||
are likely to take months or years of developer time. Smaller-scoped items will continue | ||
to be tracked on our `issue tracker <https://github.com/pandas-dev/pandas/issues>`__. | ||
|
||
See :ref:`roadmap.evolution` for proposing changes to this document. | ||
|
||
Extensibility | ||
------------- | ||
|
||
Pandas :ref:`extending.extension-types` allow for extending NumPy types with custom | ||
data types and array storage. Pandas uses extension types internally, and provides | ||
an interface for 3rd-party libraries to define their own custom data types. | ||
|
||
Many parts of pandas still unintentionally convert data to a NumPy array. | ||
These problems are especially pronounced for nested data. | ||
|
||
We'd like to improve the handling of extension arrays throughout the library, | ||
making their behavior more consistent with the handling of NumPy arrays. We'll do this | ||
by cleaning up pandas' internals and adding new methods to the extension array interface. | ||
|
||
String data type | ||
---------------- | ||
|
||
Currently, pandas stores text data in an ``object`` -dtype NumPy array. | ||
The current implementation has two primary drawbacks: First, ``object`` -dtype | ||
is not specific to strings: any Python object can be stored in an ``object`` -dtype | ||
array, not just strings. Second: this is not efficient. The NumPy memory model | ||
isn't especially well-suited to variable width text data. | ||
|
||
To solve the first issue, we propose a new extension type for string data. This | ||
will initially be opt-in, with users explicitly requesting ``dtype="string"``. | ||
The array backing this string dtype may initially be the current implementation: | ||
an ``object`` -dtype NumPy array of Python strings. | ||
|
||
To solve the second issue (performance), we'll explore alternative in-memory | ||
array libraries (for example, Apache Arrow). As part of the work, we may | ||
need to implement certain operations expected by pandas users (for example | ||
the algorithm used in, ``Series.str.upper``). That work may be done outside of | ||
pandas. | ||
|
||
Apache Arrow interoperability | ||
----------------------------- | ||
|
||
`Apache Arrow <https://arrow.apache.org>`__ is a cross-language development | ||
platform for in-memory data. The Arrow logical types are closely aligned with | ||
typical pandas use cases. | ||
|
||
We'd like to provide better-integrated support for Arrow memory and data types | ||
within pandas. This will let us take advantage of its I/O capabilities and | ||
provide for better interoperability with other languages and libraries | ||
using Arrow. | ||
|
||
Block manager rewrite | ||
--------------------- | ||
|
||
We'd like to replace pandas current internal data structures (a collection of | ||
1 or 2-D arrays) with a simpler collection of 1-D arrays. | ||
|
||
Pandas internal data model is quite complex. A DataFrame is made up of | ||
one or more 2-dimensional "blocks", with one or more blocks per dtype. This | ||
collection of 2-D arrays is managed by the BlockManager. | ||
|
||
The primary benefit of the BlockManager is improved performance on certain | ||
operations (construction from a 2D array, binary operations, reductions across the columns), | ||
especially for wide DataFrames. However, the BlockManager substantially increases the | ||
complexity and maintenance burden of pandas. | ||
|
||
By replacing the BlockManager we hope to achieve | ||
|
||
* Substantially simpler code | ||
* Easier extensibility with new logical types | ||
* Better user control over memory use and layout | ||
* Improved micro-performance | ||
* Option to provide a C / Cython API to pandas' internals | ||
|
||
See `these design documents <https://dev.pandas.io/pandas2/internal-architecture.html#removal-of-blockmanager-new-dataframe-internals>`__ | ||
for more. | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Decoupling of indexing and internals | ||
------------------------------------ | ||
|
||
The code for getting and setting values in pandas' data structures needs refactoring. | ||
In particular, we must clearly separate code that converts keys (e.g., the argument | ||
to ``DataFrame.loc``) to positions from code that uses uses these positions to get | ||
or set values. This is related to the proposed BlockManager rewrite. Currently, the | ||
BlockManager sometimes uses label-based, rather than position-based, indexing. | ||
We propose that it should only work with positional indexing, and the translation of keys | ||
to positions should be entirely done at a higher level. | ||
|
||
Indexing is a complicated API with many subtleties. This refactor will require care | ||
and attention. More details are discussed at | ||
https://github.com/pandas-dev/pandas/wiki/(Tentative)-rules-for-restructuring-indexing-code | ||
|
||
Numba-accelerated operations | ||
---------------------------- | ||
|
||
`Numba <https://numba.pydata.org>`__ is a JIT compiler for Python code. We'd like to provide | ||
ways for users to apply their own Numba-jitted functions where pandas accepts user-defined functions | ||
(for example, :meth:`Series.apply`, :meth:`DataFrame.apply`, :meth:`DataFrame.applymap`, | ||
and in groupby and window contexts). This will improve the performance of | ||
user-defined-functions in these operations by staying within compiled code. | ||
|
||
|
||
Documentation improvements | ||
-------------------------- | ||
|
||
We'd like to improve the content, structure, and presentation of the pandas documentation. | ||
Some specific goals include | ||
|
||
* Overhaul the HTML theme with a modern, responsive design (:issue:`15556`) | ||
* Improve the "Getting Started" documentation, designing and writing learning paths | ||
for users different backgrounds (e.g. brand new to programming, familiar with | ||
other languages like R, already familiar with Python). | ||
* Improve the overall organization of the documentation and specific subsections | ||
of the documentation to make navigation and finding content easier. | ||
|
||
Package docstring validation | ||
---------------------------- | ||
|
||
To improve the quality and consistency of pandas docstrings, we've developed | ||
tooling to check docstrings in a variety of ways. | ||
https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py | ||
contains the checks. | ||
|
||
Like many other projects, pandas uses the | ||
`numpydoc <https://numpydoc.readthedocs.io/en/latest/>`__ style for writing | ||
docstrings. With the collaboration of the numpydoc maintainers, we'd like to | ||
move the checks to a package other than pandas so that other projects can easily | ||
use them as well. | ||
TomAugspurger marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
Performance monitoring | ||
---------------------- | ||
|
||
Pandas uses `airspeed velocity <https://asv.readthedocs.io/en/stable/>`__ to | ||
monitor for performance regressions. ASV itself is a fabulous tool, but requires | ||
some additional work to be integrated into an open source project's workflow. | ||
|
||
The `asv-runner <https://github.com/asv-runner>`__ organization, currently made up | ||
of pandas maintainers, provides tools built on top of ASV. We have a physical | ||
machine for running a number of project's benchmarks, and tools managing the | ||
benchmark runs and reporting on results. | ||
|
||
We'd like to fund improvements and maintenance of these tools to | ||
|
||
* Be more stable. Currently, they're maintained on the nights and weekends when | ||
a maintainer has free time. | ||
* Tune the system for benchmarks to improve stability, following | ||
https://pyperf.readthedocs.io/en/latest/system.html | ||
* Build a GitHub bot to request ASV runs *before* a PR is merged. Currently, the | ||
benchmarks are only run nightly. | ||
|
||
.. _roadmap.evolution: | ||
|
||
Roadmap Evolution | ||
----------------- | ||
|
||
Pandas continues to evolve. The direction is primarily determined by community | ||
interest. Everyone is welcome to review existing items on the roadmap and | ||
to propose a new item. | ||
|
||
Each item on the roadmap should be a short summary of a larger design proposal. | ||
The proposal should include | ||
|
||
1. Short summary of the changes, which would be appropriate for inclusion in | ||
the roadmap if accepted. | ||
2. Motivation for the changes. | ||
3. An explanation of why the change is in scope for pandas. | ||
4. Detailed design: Preferably with example-usage (even if not implemented yet) | ||
and API documentation | ||
5. API Change: Any API changes that may result from the proposal. | ||
|
||
That proposal may then be submitted as a GitHub issue, where the pandas maintainers | ||
can review and comment on the design. The `pandas mailing list <https://mail.python.org/mailman/listinfo/pandas-dev>`__ | ||
should be notified of the proposal. | ||
|
||
When there's agreement that an implementation | ||
would be welcome, the roadmap should be updated to include the summary and a | ||
link to the discussion issue. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.