Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
import numpy as np
import time
# Create two large DataFrames with missing data
np.random.seed(0)
size = 1_000_000
df1 = pd.DataFrame({
'ID': range(size),
'Name': np.random.choice(['Alice', 'Bob', 'Charlie', 'David', 'Eve', None], size)
})
df2 = pd.DataFrame({
'ID': range(size // 2, size * 3 // 2), # Overlapping and new IDs
'Age': np.random.choice([None, 20, 30, 40, 50, 60], size)
})
# Measure time for merge operation
start_time = time.time()
merged_df = pd.merge(df1, df2, on='ID', how='outer')
merge_time = time.time() - start_time
print(f"Merge time: {merge_time:.2f} seconds")
# Measure time for fillna operation
start_time = time.time()
merged_df['Name'].fillna('Unknown', inplace=True)
merged_df['Age'].fillna(0, inplace=True)
fillna_time = time.time() - start_time
print(f"Fillna time: {fillna_time:.2f} seconds")
# Print some statistics
print(f"Total rows after merge: {len(merged_df)}")
print(f"Null values in 'Name' after fillna: {merged_df['Name'].isnull().sum()}")
print(f"Null values in 'Age' after fillna: {merged_df['Age'].isnull().sum()}")
Issue Description
Bug Description
When using fillna()
after merging DataFrames, unexpected behavior and performance issues occur.
Reproducible Code Example
Expected Behavior
Expected Behavior
The fillna()
operation should efficiently fill missing values after merging, without unexpected behavior or significant performance degradation.
Actual Behavior
The fillna()
operation may exhibit unexpected behavior or poor performance, especially with larger datasets.
Additional Context
This issue becomes more apparent when working with larger datasets and complex merge operations. Improving the performance and reliability of fillna()
after merging would greatly benefit data processing workflows.
Environment
- pandas version: 3.0.0
- Python version: 3.13.2
- Operating System: Linux
Installed Versions
INSTALLED VERSIONS
commit : None
python : 3.13.2.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 3.0.0
numpy : 1.26.3
pytz : 2024.1
dateutil : 2.8.2
pip : 24.0
setuptools : 69.0.2
Cython : 3.0.8
pytest : 8.0.0
hypothesis : 6.98.3
sphinx : 7.2.6
blosc : None
feather : None
xlsxwriter : 3.1.9
lxml.etree : 5.1.0
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.3
IPython : 8.21.0
pandas_datareader: None
[other dependencies ...]