Skip to content

BUG: Memory leak when creating a df inside a loop #60897

Open
@Chuck321123

Description

@Chuck321123

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import tracemalloc
import numpy as np
import time
import gc

# Start memory tracking
tracemalloc.start()

iteration = 0

Row_Number = 20000

while iteration < 1000:
    
    test_lst = [*range(12)]
    
    for i in range(12):
        
        # Create a DataFrame with X amount of rows
        df = pd.DataFrame({
            "A": np.arange(Row_Number),  # Sequential Row_Numbers from 0 to 999999
            "B": np.random.rand(Row_Number),  # Random floats between 0 and 1
            "C": np.random.randint(0, 100, size=Row_Number),  # Random integers between 0 and 99
            "D": np.random.choice(["apple", "banana", "cherry"], size=Row_Number),  # Random categories
            "E": np.random.randn(Row_Number)  # Normally distributed random Row_Numbers
        })

        test_lst[i] = df # The bug also appears without appending to list

        del df # Deleting df at the end of loop doesnt affect memory leak
  
    del test_lst # Deleting list at the end of loop doesnt affect memory leak
        
    time.sleep(0.01)
    
    iteration += 1

    # Check memory usage for 3rd party packages
    if iteration % 1 == 0:
    
        snapshot = tracemalloc.take_snapshot()
        
        # Get memory statistics **without filtering** first
        top_stats = snapshot.statistics("lineno")
        
        print(f"\n[ Memory Snapshot at iteration {iteration} ]")
        for stat in top_stats[:5]:  # Show top memory-consuming locations
            print(stat)

Issue Description

By using tracemalloc (a tool to track memory usage in loops), I can see that pandas doesnt release memory when creating dfs inside a loop. The problem seems to come from pandas\core\internals\blocks around line 228. Would be nice if anyone could find a fix to this.

Expected Behavior

That the memory doesnt leak

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.13.1
python-bits : 64
OS : Windows
OS-release : 11
Version : 10.0.22631
machine : AMD64
processor : Intel64 Family 6 Model 186 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : Norwegian Bokmål_Norway.1252

pandas : 2.2.3
numpy : 2.2.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.2
Cython : None
sphinx : 8.1.3
IPython : 8.31.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : 3.1.5
lxml.etree : None
matplotlib : 3.10.0
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.5
pandas_gbq : None
psycopg2 : None
pymysql : None
pyarrow : 19.0.0
pyreadstat : None
pytest : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.15.1
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2024.2
qtpy : 2.4.2
pyqt5 : None

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugClosing CandidateMay be closeable, needs more eyeballsConstructorsSeries/DataFrame/Index/pd.array ConstructorsPerformanceMemory or execution speed performanceWindowsWindows OS

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions