Skip to content

ENH: create BlockManager positional indexer (for easier dupe cols support) #3092

Closed
@jreback

Description

@jreback

see discussion in #3059, #3095, also see #1943, #3102

This only applies with a non-unique column index

Currently if duplicate columns across dtypes there are issues in getting the correct block given a column name.

I think it is possible, though non-trivial, to instead have a positional map from the frame columns to the BlockManager blocks, will simplify BlockManager.iget.

Primary motivation is to_csv currently cannot handle these types of lookups.

Also should eliminate need for _find_block

In [6]: df = pd.DataFrame(np.random.randn(8,4))

In [12]: df = pd.DataFrame(np.random.randn(8,4))

In [13]: df._data.blocks[0].ref_locs
Out[13]: array([0, 1, 2, 3])

In [14]: df = pd.DataFrame(np.random.randn(8,4),columns=['a']*4)

In [15]: df._data.blocks[0].ref_locs
---------------------------------------------------------------------------

/mnt/home/jreback/pandas/pandas/core/internals.py in ref_locs(self)
     52     def ref_locs(self):
     53         if self._ref_locs is None:
---> 54             indexer = self.ref_items.get_indexer(self.items)
     55             indexer = com._ensure_platform_int(indexer)
     56             if (indexer == -1).any():

/mnt/home/jreback/pandas/pandas/core/index.pyc in get_indexer(self, target, method, limit)
    835 
    836         if not self.is_unique:
--> 837             raise Exception('Reindexing only valid with uniquely valued Index '
    838                             'objects')
    839 

Exception: Reindexing only valid with uniquely valued Index objects

This is the root of all evil, this should raise the same as above (but doesn't even if
I consolidate)......

In [16]: df = pd.DataFrame(np.random.randn(8,4))

In [17]: df.columns = ['a']*4

In [18]: df._data.blocks[0].ref_locs
Out[18]: array([0, 1, 2, 3])

Metadata

Metadata

Assignees

No one assigned

    Labels

    IdeasLong-Term Enhancement DiscussionsIndexingRelated to indexing on series/frames, not to indexes themselvesRefactorInternal refactoring of code

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions