Skip to content

Internals: concat does not preserve block types / does not take ndim of blocks into account #17283

Closed
@jorisvandenbossche

Description

@jorisvandenbossche

xref #17144

I am trying to get concat (and affiliated functionality) working with the external GeometryBlock (which we are developing in geopandas). This is currently failing, and in such a manner (I think) that it will not be fixable in geopandas.

Currenlty the concat (and the concatenate_join_units) functionality has a lot of hardcoded logic about blocks. Two problems that I encountered (so far):

  • it does not respect the 'dimensionality' of the block.

    • In case of non-consolidatable blocks, the block values are stored in 1D array-likes, while the block is actually 2D. For this, there is some hard-coded logic: in _concat_compat (
      def _concat_compat(to_concat, axis=0):
      , called from concatenate_join_units), eg for tz datetimes it defers to _concat_datetimetz which knows it gets 1D instead of 2D data. The general case however is passed to np.concatenate(to_concat, axis=axis) where the axis does not take the dimensionality of the Block into account (which leads to an error if your block is non-consolidatable with 1D underlying data)
    • A possible solution is to check the dimensionality of the data of the block, and if it is 1D take this into account when concatting the data.
    • However, the problem is then still that np.concatenate does not fully the right thing. So a further possible solution would be that there is eg a class method on the block to concatenate the blocks.
  • In general, Block types are not preserved

    • Eg in the case of concatting two Series objects of the same block type, the values are concatted and the block type is re-inferred. This means that the result can never have the externally defined block type without pandas knowing about this block type. (this does not only happen in concat, but in other places as well)
    • In the specific case of concat, I think we could certainly check if the block types are equal and in that case preserve the block class.
    • More in general, it is difficult to manipulate with BlockManagers while preserving block types. Eg BlockManager.insert/set are not able to add a Block while preserving the block class (they always infer the block type based on the data). Other example is concatenate_block_managers where the blocks are again reconstructed without taking into account the original block class.
    • Would it be desirable/acceptable to more retain the block class? (not sure if this would have impact on pandas itself)

@jreback (cc @mrocklin)

Metadata

Metadata

Assignees

No one assigned

    Labels

    InternalsRelated to non-user accessible pandas implementation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions