Skip to content

reset_index() on MultiIndexed empty dataframe does not preserve dtypes  #19602

Closed
@alberto-dellera

Description

@alberto-dellera

Code Sample, a copy-pastable example if possible

df = pd.DataFrame( data=[[0,0,0]], columns=['level_1','level_2','payload'] )

# make dataframe empty
df = df[ df.payload == -1 ]

# columns are all int64 here

print(df.info())
#output: level_1    0 non-null int64
#output: level_2    0 non-null int64
#output: payload    0 non-null int64

# set MultiIndex - levels are still int64 
df = df.set_index(['level_1','level_2'])

print(str(df.index.levels[0].dtype))
print(str(df.index.levels[1].dtype))
#output: int64
#output: int64

# reset_index - former-levels columns are now float64
df = df.reset_index()

print(df.info())
#output: level_1    0 non-null float64
#output: level_2    0 non-null float64
#output: payload    0 non-null int64

Problem description

The dtypes are preserved instead if either
a) index is not a MultiIndex
b) dataframe is not empty

(b) is a big issue for programs that calculate subset of dataframes that sometimes
can be empty, since downstream code might expect a certain dtype and fail when it finds
a float64 instead.

Real-world scenario: sampling a system (a collection of processes or threads) at regular intervals,
and collecting some measures (cpu used, or other resources or figures); a very common strategy
in performance investigation software (e.g. check Oracle's v$active_session_history).
Here, the natural index is (sample_time, process_id), sample_time being datetime64 (a Time Series).
Even more naturally, we want to computes differences of sample_time, yielding a timedelta64,
and divide it by np.timedelta64(1,'s') to get the elapsed time in seconds; but when the
initial dataframe is empty, we try to divide float64 / np.timedelta64(1,'s') and get an exception.

An obvious workaround is to check for empty dataframes after EVERY reset_index()
and coerce the float64s back to their correct value - but that easily becomes a maintenance/coverage nightmare :O

Expected Output

resetted columns having their initial dtype

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.1.1
openpyxl: 2.4.9
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    MultiIndexReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions