Description
Code Sample, a copy-pastable example if possible
# Method used to set index of data frame (defined for make time-memory recording possible)
def set_index(data_frame: pd.DataFrame, index_cols: list):
time_res = %timeit -n 1 -r 1 -o data_frame.set_index(index_cols)
return time_res
# Method that appends N-integers columns and call set_index while recording its memory usage
def insert_cols_and_index(data_frame: pd.DataFrame, col_number: int, values: np.ndarray):
idf = data_frame.copy()
index_cols = [None] * col_number
for i in range(col_number):
current_col_name = str(i)
index_cols[i] = current_col_name
idf[current_col_name] = values
# Record memory usage and get spent time on indexing
return memory_usage((set_index, (idf, index_cols)), interval=1e-7, retval=True)
# Values we'll use as new columns content
df_size = int(1e7)
unique_len = int(1e7)
int_array = np.tile(np.arange(unique_len), df_size // unique_len)
# Create an empty pandas data frame
df = pd.DataFrame()
# Make iterative add of integer columns
# Set index on all data frame columns and record memory usage peaks
max_col_num = 14
min_mem_usages = np.zeros(max_col_num)
max_mem_usages = np.zeros(max_col_num)
time_usages = np.zeros(max_col_num)
for i in range(max_col_num):
print("{0} iteration(s)".format(i+1))
mem, time = insert_cols_and_index(df, i + 1, int_array)
min_mem_usages[i] = min(mem)
max_mem_usages[i] = max(mem)
time_usages[i] = time.average
np.save("min_mem_usages_" + pd.__version__ + "_" + str(unique_len), min_mem_usages)
np.save("max_mem_usages_" + pd.__version__ + "_" + str(unique_len), max_mem_usages)
np.save("time_usages_" + pd.__version__ + "_" + str(unique_len), time_usages)
Problem description
For a bit of context : I'm working on a mid-range ETL software based on pandas and recently, we decided to upgrade our pandas version to the latest one : 0.24.2.
We're working with 0.20.3 version and since we've passed to the latest, set_index method seems to be responsible of big performance losses on our soft.
Setting index now takes ~2 times as long compared with the older version and use more memory.
After making some research about indexing changes between these two versions, I've found that indexing in pandas has changed from the 0.23 version : #19074
I've made a bunch of tests to understand what could be the cause of this overhead.
One type of tests i've made shows a big difference between the two version.
For this test, I've used a data frame of 1e7 size. This test consisting of iteratively adding a new integer column to the data frame and then, trying to call set_index method with every columns as index cols.
The code is expressed above. I've made some variations to it by changing the ratio : number of unique values / size of the column.
For each set_index call, i've recorded min and max memory usage and spent time on indexing.
First, I've concluded that the time and memory usage difference between these two versions becomes larger when we want to index on many columns.
df_size = int(1e7)
unique_len = int(1e7)
int_array = np.tile(np.arange(unique_len), df_size // unique_len)
Even if this test is non-representative, we clearly see that the difference increases with the number of columns we're indexing on.
This may be also caused by :
- the number of combinations of unique values along index levels (product of length of unique values for each column)
- the number of unique values for each index level (sum of length of unique values for each column)
If I reduce the number of unique values to 100, memory usage peaks and spent time (for the two versions) are :
df_size = int(1e7)
unique_len = int(1e2)
int_array = np.tile(np.arange(unique_len), df_size // unique_len)
The above plot shows that after some point, pandas seems to change its indexing method. I'm not sure about that but it seems to be.
So, i've tried with a more representative case according to our use cases (but always with integer columns) :
df_size = int(1e7)
unique_len = int(1e4)
int_array = np.tile(np.arange(unique_len), df_size // unique_len)
Same pattern for memory usages and spent time.
According to my tests, indexing in newer pandas version takes a lot of memory compared to the older version and takes more time to do the same thing. Even if recording memory could affect performance, difference is still visible.
This difference seems growing with how long the unique values set is for each column.
Indexing is a functionality we use a lot where i'm working and we often put 15-30 columns in the index and just a few in columns (1-5). Some of our columns we're indexing on may contain more than 10e4 unique values.
Maybe, it is not a good practice to index on so many columns, but it is really convenient to represent and explain data in high dimensions. Selections in data frame are also faster. Maybe, we're doing wrong ?
Thanks in advance for your help or your suggestions.
All tests were made in two exactly environnements except for the pandas version.
I put pd.show_versions() output of this two environnements in 'details' section.
For information, is used numpy+mkl 1.16.2 got from :
https://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy
Expected Output
Output of pd.show_versions()
INSTALLED VERSIONS (0.24.2)
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.24.2
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: 1.3.2
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
None
INSTALLED VERSIONS (0.20.3)
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Windows
OS-release: 2008ServerR2
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.20.3
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: None
numpy: 1.16.2
scipy: 1.1.0
xarray: None
IPython: 7.4.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.5.1
numexpr: 2.6.9
feather: None
matplotlib: 3.0.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.3.2
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None
None