Closed
Description
I have found that using the filter method to select columns that match a string pattern is ~3x slower than basic list comprehension on the df.columns list. Not sure how its implemented under the hood but for basic 'in' checks on lots of columns this could slow you down depending on how often you filter.
import pandas as pd
import numpy as np
# Generate Test DataFrame
NUM_ROWS = 2000
NUM_COLS = 1000
col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
df['TEST'] = 0
df['TEST2'] = 0
%timeit df.filter(like='TEST')
1000 loops, best of 3: 1.19 ms per loop
%timeit df[[col for col in df.columns if 'TEST' in col]]
1000 loops, best of 3: 400 µs per loop
%time df.filter(like='TEST')
Wall time: 1 ms
Out[4]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 2 columns):
TEST 2000 non-null values
TEST2 2000 non-null values
dtypes: int64(2)
%time df[[col for col in df.columns if 'TEST' in col]]
Wall time: 1 ms
Out[5]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 0 to 1999
Data columns (total 2 columns):
TEST 2000 non-null values
TEST2 2000 non-null values
dtypes: int64(2)
pd.__version__
Out[7]: '0.12.0'
np.__version__
Out[8]: '1.7.1'