Skip to content

df.filter(like='col_name') 2.975X slower than basic column list comprehension #5657

Closed
@dragoljub

Description

@dragoljub

I have found that using the filter method to select columns that match a string pattern is ~3x slower than basic list comprehension on the df.columns list. Not sure how its implemented under the hood but for basic 'in' checks on lots of columns this could slow you down depending on how often you filter.

import pandas as pd
import numpy as np

# Generate Test DataFrame
NUM_ROWS = 2000
NUM_COLS = 1000
col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
df['TEST'] = 0
df['TEST2'] = 0

%timeit df.filter(like='TEST')
1000 loops, best of 3: 1.19 ms per loop

%timeit df[[col for col in df.columns if 'TEST' in col]]
1000 loops, best of 3: 400 µs per loop

%time df.filter(like='TEST')
Wall time: 1 ms
Out[4]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 2 columns):
TEST     2000  non-null values
TEST2    2000  non-null values
dtypes: int64(2)

%time df[[col for col in df.columns if 'TEST' in col]]
Wall time: 1 ms
Out[5]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 2 columns):
TEST     2000  non-null values
TEST2    2000  non-null values
dtypes: int64(2)

pd.__version__
Out[7]: '0.12.0'

np.__version__
Out[8]: '1.7.1'

Metadata

Metadata

Assignees

No one assigned

    Labels

    IndexingRelated to indexing on series/frames, not to indexes themselvesPerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions