Description
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Minimal sample
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})
# Assign > 100 new columns to the dataframe
for i in range(0, 100):
df.loc[:, f'n_{i}'] = np.random.randint(0, 100, size=55)
# Alternative assignment - triggers Performancewarnings here already.
# df[f'n_{i}'] = np.random.randint(0, 100, size=55)
df1 = df.copy()
# Triggers performance warning again
df1['c'] = np.random.randint(0, 100, size=55)
# Visualize blocks
print(df._data.nblocks)
print(df1._data.nblocks)
Problem description
Since pandas 1.3.0, the above minimal sample code produces the output of a Performance warning.
While i think i understand the warning - i don't understand how to mitigate it (the docs don't contain help i could find for this - and the proposed solution (copy()
does not seem to work.
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider using pd.concat instead. To get a de-fragmented frame, use `newframe = frame.copy()`
While this for sure isn't an ideal scenario (assignment of single columns one after the other), i also don't see how this can be changed in our usecase.
The proposed df.copy()
does not mitigate the warning - and the block count remains the same.
Based on my understanding, using df.loc[:, 'colname'] =
is the recommended way to assign new columns.
This does create a new block for every insert - and df.copy()
(which is proposed in the error) does not consolidate the blocks into 1 block - which means the error can't really be mitigated.
Strangely enough - the behaviour of df['colname] =
and df.loc[:, 'colname'] =
is not identical - with the first triggering the PerformanceWarning - and the 2nd not triggering the warning (although the problem is still there in the background).
So this leaves me with a few questions
- How should the above scenario correctly handle inserts to keep performance and avoid this error?
- how can the dataframe be effectively consolidated (the proposed
frame.copy()
in the error does not do that)
Expected Output
Output of pd.show_versions()
INSTALLED VERSIONS
------------------
commit : f00ed8f47020034e752baf0250483053340971b0
python : 3.9.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.12.11-arch1-1
Version : #1 SMP PREEMPT Wed, 16 Jun 2021 15:25:28 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.utf8
LOCALE : en_US.UTF-8
pandas : 1.3.0
numpy : 1.21.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.3
setuptools : 57.0.0
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : None
blosc : 1.10.4
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : 1.0.2
psycopg2 : 2.8.6 (dt dec pq3 ext lo64)
jinja2 : 3.0.1
IPython : 7.21.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.4.1
numexpr : 2.7.3
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.0
sqlalchemy : 1.4.20
tables : 3.6.1
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : None