Skip to content

index_col and usecols do not work reliably together in read_csv #9098

Closed
@awhan

Description

@awhan

This code shows 3 situations.

import pandas as pd
from io import StringIO
import random
import sys

def fun(s,n,u):
    random.seed(s)
    names = [str(e) for e in range(1, n)]
    data = ','.join([str(e) for e in names])

    usecols = random.sample(names, u)
    index_col = random.choice(usecols)
    print('usecols', usecols)
    print('index_col', index_col)

    try:
        df = pd.read_csv(StringIO(data), names=names, usecols=usecols, index_col=index_col, header=None)
        print(df)
    except:
        print(sys.exc_info())
        df = pd.read_csv(StringIO(data), names=names, usecols=usecols, header=None)
        df.set_index(index_col, inplace=True)
        print(df)

    print('--------------------------------------------------')


fun(123, 10, 4) # exception
fun(123, 10, 5) # works
fun(123, 20, 4) # works BUT index name and value are not proper

here are the results
fun(123, 10, 4), an exception occurs but when index_col is ommitted and later set_index is used then it works ok.

usecols ['1', '5', '9', '4']
index_col 9
(<class 'IndexError'>, IndexError('list index out of range',), <traceback object at 0x7f129001a348>)
   1  4  5
9
9  1  4  5
--------------------------------------------------

fun(123, 10, 5), this worked ok.

usecols ['1', '5', '9', '4', '3']
index_col 1
   3  4  5  9
1
1  3  4  5  9
--------------------------------------------------

fun(123, 20, 4), this worked ok but it picked up the wrong value for the index

usecols ['2', '9', '3', '14']
index_col 3
   2  3  14
3
9  2  3  14
--------------------------------------------------

pandas.__version__ is '0.15.2'
64 bit archlinux
$ python --version
Python 3.4.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions