Closed
Description
Related to the discussion in #10556, and following up on the mailing list discussion "A case for a simplified (non-consolidating) BlockManager with 1D blocks" (archive).
Initial proof of concept for a non-consolidating "ArrayManager" (storing the columns of a DataFrame as a list of 1D arrays instead of blocks) is merged in #36010.
This issue is meant to track the required follow-up work items to get this to a more feature-complete implementation.
-
Functionality: get all tests passing
- There are big chunks of tests failing because some larger sets of functionality is not yet implemented or relying on BlockManager internals (this last aspect is also covered in Practical steps towards a simplified BlockManager #34669). Those bigger topics are:
-
quantile
/describe
related (ArrayManager.quantile is not yet implemented) -> ENH: ArrayManager.quantile #40189 -
equals
related (ArrayManager.equals is not yet implemented) -> [ArrayManager] Implement .equals method #39721 -
groupby
related tests (there are still a few parts of groupby that directly uses the blocks) -> [ArrayManager] GroupBy cython aggregations (no fallback) #39885, [ArrayManager] Remaining GroupBy tests (fix count, pass on libreduction for now) #40050 -
concat
related (internals/concat.py
only deals with the simple case when no reindexing is needed for ArrayManager at the moment, the full functionality (similarly to whatconcatenate_block_managers
/ theJoinUnits
now cover) still needs to be implemented) -> [ArrayManager] REF: Implement concat with reindexing #39612 - indexing related (some of the ArrayManager methods like
setitem
,iset
,insert
are not yet fully implementated for all corner cases + get indexing tests passing) - IO related:
- JSON (INT: the json C code should not deal with blocks #27164, fixed in Backport PR #41789 on branch 1.2.x (Bug in xs raising KeyError for MultiIndex columns with droplevel False and list indexe) #41809
- pytables code still relies on block internals
-
- In addition, the ArrayManager currently also uses an "apply_with_block" fallback for things that are right now implemented directly on the Block classes. Long term, all those cases should also be refactored so that the core functionality of the specific function can be shared between ArrayManager and BlockManager, without directly relying on the Block classes.
- Some examples of this:
replace
,where
,interpolate
,shift
,diff
,downcast
,putmask
, ... (those could all be refactored one at a time).
- Some examples of this:
- There will also be tests that either are 1) BlockManager specific (using block internals to test) or 2) testing behaviour specific to DataFrames with BlockManager (we will probably want to change some aspects about eg copy/view, setitem-like ops, etc. Those changes all need to discussed separately of course, but might also require some skipped tests initially).
Such tests can be skipped with eg@td.skip_array_manager_invalid_test
.
- There are big chunks of tests failing because some larger sets of functionality is not yet implemented or relying on BlockManager internals (this last aspect is also covered in Practical steps towards a simplified BlockManager #34669). Those bigger topics are:
-
Design questions:
- What to do with Series, which now is a SingleBlockManager inheriting from BlockManager (should we also have a "SingleArrayManager"?) -> [ArrayManager] Add SingleArrayManager to back a Series #40152
- ... (probably more items will come up) ...
-
Performance
- Currently, I didn't yet look at performance (I only ran a few of the ASV benchmarks, see top post of POC: ArrayManager -- array-based data manager for columnar store #36010). I also think that we should first focus on getting a larger part of the functionality working (which will also make it easier to run benchmarks), but afterwards we will need to identify the different areas where performance improvements are needed.