Skip to content

Dataframe loading with duplicated columns and usecols #11823

Closed
@thenx

Description

@thenx

I'm using pandas 0.17.1

import pandas as pd
  pd.__version__

Out:'0.17.1'

When column names are duplicated

cols = ['A', 'A', 'B']
with open('pandas.csv', 'w') as f:
  f.write('1,2,3')

we can still load dataframe

pd.read_csv('pandas.csv',
            header=None,
            names=cols,
           )

with explainable behaviour

Out:
     A    A    B
0    2    2    3

Then we might want to load some of the columns with python engine

pd.read_csv('pandas.csv',
            engine='python',
            header=None,
            names=cols,
            usecols=cols
           )

and get different but still explainable result

Out:
     A    B
0    1    3

But then we switch back to c-engine

pd.read_csv('pandas.csv',
            engine='c',
            header=None,
            names=cols,
            usecols=cols
           )

and get the following

Out:
     A    A    B
0    2    2    NaN

which is:
(a) different (which is not good in my opinion)
(b) looks like bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions