Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
[nav] In [42]: first_char = pd.Series(['a']*int(1e6)); second_char = pd.Series(['b']*int(1e6))
[ins] In [43]: %time pd.concat([first_char, second_char])
CPU times: user 7.49 ms, sys: 11.2 ms, total: 18.7 ms
Wall time: 18.2 ms
[ins] In [50]: first_nan = pd.Series([np.nan]*int(1e6)); second_nan = pd.Series([np.nan]*int(1e6))
[ins] In [51]: %time pd.concat([first_nan, second_nan])
CPU times: user 12 ms, sys: 2.47 ms, total: 14.5 ms
Wall time: 13.8 ms
Out[51]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
..
999995 NaN
999996 NaN
999997 NaN
999998 NaN
999999 NaN
Length: 2000000, dtype: float64
[ins] In [52]: first_nan_object = pd.Series([np.nan]*int(1e6), dtype=object); second_nan_object = pd.Series([np.nan]*int(1e6), dtype=object)
[ins] In [53]: %time pd.concat([first_nan_object, second_nan_object])
CPU times: user 7.77 ms, sys: 11.9 ms, total: 19.6 ms
Wall time: 19.1 ms
Out[53]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
999995 NaN
999996 NaN
999997 NaN
999998 NaN
999999 NaN
Length: 2000000, dtype: object
Issue Description
Calling pd.concat
for very large datasets can be significantly slower for all-nan object-dtyped columns than if the same columns were float dtyped. The discrepancy in the provided example is mild compared to what I'm seeing for real datasets.
Expected Behavior
It should be able to do an optimization under-the-hood for all nan columns, regardless of dtype.
Installed Versions
Note that I am on an older version of pandas because my organization has restrictions.
Replace this line with the output of pd.show_versions()
INSTALLED VERSIONS
commit : 66e3805
python : 3.8.13.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-348.23.1.el8_5.x86_64
Version : #1 SMP Tue Apr 12 11:20:32 EDT 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.3.5
numpy : 1.23.3
pytz : 2022.4
dateutil : 2.8.2
pip : 22.2.2
setuptools : 65.4.1
Cython : 0.29.32
pytest : 7.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.8.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.5.0
pandas_datareader: None
bs4 : 4.11.1
bottleneck : 1.3.5
fsspec : 2022.8.2
fastparquet : None
gcsfs : None
matplotlib : 3.6.0
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : None
pyarrow : 9.0.0
pyxlsb : None
s3fs : None
scipy : 1.9.1
sqlalchemy : 1.4.41
tables : 3.6.1
tabulate : 0.9.0
xarray : 2022.9.0
xlrd : 2.0.1
xlwt : None
numba : 0.56.2