Skip to content

DataFrame and DataMatrix column ordering #12

Closed
@hector13

Description

@hector13

First, thank you for the pandas package -- it's incredibly useful and well done.

I know that one of the fundamental concepts behind the data structures is that column ordering doesn't matter. And, as long as one only uses pandas' data access/manipulation functions (eg, sum(), ewma(), etc.), this works fine. But often, it's useful to access the underling values in a numpy array for some more complicated data manipulation. Using the values attribute (or values() method for a series) does this, but it's not always obvious what order the values come back in.

For example:

In [1]: dm = DataMatrix(np.arange(2*3).reshape(2,3), index=[1,0], columns=['B', 'A', 'C' ])

In [2]: dm 
Out[2]: 
     B              A              C  
1    0              1              2  
0    3              4              5              

In [3]: df = DataFrame(np.arange(2*3).reshape(2,3), index=[1,0], columns=['B', 'A', 'C' ])

In [4]: df 
Out[4]: 
     A              B              C  
1    1              0              2  
0    4              3              5              

In [5]: df.values
Out[5]: 
array([[1, 0, 2],
       [4, 3, 5]])

In [6]: dm.values
Out[6]: 
array([[0, 1, 2],
       [3, 4, 5]])

DataMatrix seems to respect the passed in ordering of columns, while DataFrame does not. I know this is documented, and not the biggest deal in the world, but does seem to cause quite a bit of confusion for some. Is it possible to have both data types keep the ordering that's passed in? If a user passes in the same column name twice, could this just throw an exception? Something stills need to be done when an operation is performed on two DataFrames (eg, combining them), but instead of reordering in alphabetical order, how about preserving the column ordering from left to right?

Anyway, my bigger concern is actually the following:

In [7]: dm.reindex(columns=['C','B','A']).values
Out[7]: 
array([[2, 0, 1],
       [5, 3, 4]])

In [8]: df.reindex(columns=['C','B','A']).values
Out[8]: 
array([[1, 0, 2],
       [4, 3, 5]])

Regardless of the ordering of the columns after creating a DataFrame/Matrix, a naive users (ie, me) would expect calling reindex and values would return an ndarray with the columns in the same order as was requested. But it looks like this only happens for DataMatrixes (and I'm not even sure that's always guaranteed).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions