Skip to content

BUG: Performance issue with fillna() after merging DataFrames #61180

Open
@sjfakharian

Description

@sjfakharian

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
import time

# Create two large DataFrames with missing data
np.random.seed(0)
size = 1_000_000

df1 = pd.DataFrame({
    'ID': range(size),
    'Name': np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eve', None], size)
})

df2 = pd.DataFrame({
    'ID': range(size // 2, size * 3 // 2),  # Overlapping and new IDs
    'Age': np.random.choice([None, 20, 30, 40, 50, 60], size)
})

# Measure time for merge operation
start_time = time.time()
merged_df = pd.merge(df1, df2, on='ID', how='outer')
merge_time = time.time() - start_time
print(f"Merge time: {merge_time:.2f} seconds")

# Measure time for fillna operation
start_time = time.time()
merged_df['Name'].fillna('Unknown', inplace=True)
merged_df['Age'].fillna(0, inplace=True)
fillna_time = time.time() - start_time
print(f"Fillna time: {fillna_time:.2f} seconds")

# Print some statistics
print(f"Total rows after merge: {len(merged_df)}")
print(f"Null values in 'Name' after fillna: {merged_df['Name'].isnull().sum()}")
print(f"Null values in 'Age' after fillna: {merged_df['Age'].isnull().sum()}")

Issue Description

Bug Description

When using fillna() after merging DataFrames, unexpected behavior and performance issues occur.

Reproducible Code Example

Expected Behavior

Expected Behavior

The fillna() operation should efficiently fill missing values after merging, without unexpected behavior or significant performance degradation.

Actual Behavior

The fillna() operation may exhibit unexpected behavior or poor performance, especially with larger datasets.

Additional Context

This issue becomes more apparent when working with larger datasets and complex merge operations. Improving the performance and reliability of fillna() after merging would greatly benefit data processing workflows.

Environment

  • pandas version: 3.0.0
  • Python version: 3.13.2
  • Operating System: Linux

Installed Versions

INSTALLED VERSIONS

commit : None
python : 3.13.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 3.0.0
numpy : 1.26.3
pytz : 2024.1
dateutil : 2.8.2
pip : 24.0
setuptools : 69.0.2
Cython : 3.0.8
pytest : 8.0.0
hypothesis : 6.98.3
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.21.0
pandas_datareader: None

[other dependencies ...]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Missing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateNeeds InfoClarification about behavior needed to assess issuePerformanceMemory or execution speed performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions