Skip to content

1.3.0 PerformanceWarning: DataFrame is highly fragmented. #42477

Closed
@xmatthias

Description

@xmatthias
  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

Minimal sample

import pandas as pd
import numpy as np
df = pd.DataFrame({'a': np.random.randint(0, 100, size=55), 'b': np.random.randint(0, 100, size=55)})

# Assign > 100 new columns to the dataframe
for i in range(0, 100):
    df.loc[:, f'n_{i}'] = np.random.randint(0, 100, size=55)
    # Alternative assignment - triggers Performancewarnings here already.
    # df[f'n_{i}'] = np.random.randint(0, 100, size=55)

df1 = df.copy()
# Triggers performance warning again
df1['c'] = np.random.randint(0, 100, size=55)

# Visualize blocks
print(df._data.nblocks)
print(df1._data.nblocks)

Problem description

Since pandas 1.3.0, the above minimal sample code produces the output of a Performance warning.
While i think i understand the warning - i don't understand how to mitigate it (the docs don't contain help i could find for this - and the proposed solution (copy() does not seem to work.

PerformanceWarning: DataFrame is highly fragmented.  This is usually the result of calling `frame.insert` many times, which has poor performance.  Consider using pd.concat instead.  To get a de-fragmented frame, use `newframe = frame.copy()`

While this for sure isn't an ideal scenario (assignment of single columns one after the other), i also don't see how this can be changed in our usecase.

The proposed df.copy() does not mitigate the warning - and the block count remains the same.
Based on my understanding, using df.loc[:, 'colname'] = is the recommended way to assign new columns.
This does create a new block for every insert - and df.copy() (which is proposed in the error) does not consolidate the blocks into 1 block - which means the error can't really be mitigated.

Strangely enough - the behaviour of df['colname] = and df.loc[:, 'colname'] = is not identical - with the first triggering the PerformanceWarning - and the 2nd not triggering the warning (although the problem is still there in the background).

So this leaves me with a few questions

  • How should the above scenario correctly handle inserts to keep performance and avoid this error?
  • how can the dataframe be effectively consolidated (the proposed frame.copy() in the error does not do that)

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : f00ed8f47020034e752baf0250483053340971b0
python           : 3.9.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.12.11-arch1-1
Version          : #1 SMP PREEMPT Wed, 16 Jun 2021 15:25:28 +0000
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : None
LANG             : en_US.utf8
LOCALE           : en_US.UTF-8

pandas           : 1.3.0
numpy            : 1.21.0
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 21.1.3
setuptools       : 57.0.0
Cython           : None
pytest           : 6.2.4
hypothesis       : None
sphinx           : None
blosc            : 1.10.4
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : 1.0.2
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 3.0.1
IPython          : 7.21.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.1
numexpr          : 2.7.3
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.0
sqlalchemy       : 1.4.20
tables           : 3.6.1
tabulate         : 0.8.9
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    DataFrameDataFrame data structureIndexingRelated to indexing on series/frames, not to indexes themselvesRegressionFunctionality that used to work in a prior pandas version

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions