Skip to content

DataFrame.set_index() may not preserve dtype #30517

Open
@Dr-Irv

Description

@Dr-Irv

xref #19602

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.25.3'

In [3]: df = pd.DataFrame({'mixed' : [1, 2, 'abc', 'def'], 'ints': [100, 200, 3
   ...: 00, 400]})

In [4]: df
Out[4]:
  mixed  ints
0     1   100
1     2   200
2   abc   300
3   def   400

In [5]: df.dtypes
Out[5]:
mixed    object
ints      int64
dtype: object

In [6]: df.query('ints < 300').set_index('mixed').index
Out[6]: Int64Index([1, 2], dtype='int64', name='mixed')

In [7]: df.set_index('mixed').query('ints < 300').index
Out[7]: Index([1, 2], dtype='object', name='mixed')

Problem description

In the above, I start with a DataFrame with a column mixed that has both integer and string values.

In statement [6], I do a query on a different column and then set the index to be the column mixed. The resulting index now has an int64 dtype as opposed to having the dtype preserved from the original column.

But in statement [7], I first set the index, and then do the query, and now the index has the object dtype.

This becomes an issue if one does some computation on the queried DataFrame and then create the index mixed, and then you want to merge it back to the original DataFrame. Now the original one will have mixed as dtype 'O' and the new one has mixed as dtype 'int'

Expected Output

From statement [6], I would have expected:

Index([1, 2], dtype='object', name='mixed')

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2.post20191203
Cython : 0.29.14
pytest : 5.3.2
hypothesis : 4.54.2
sphinx : 2.3.0
blosc : None
feather : None
xlsxwriter : 1.2.6
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : 4.8.1
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.1
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.3.2
sqlalchemy : 1.3.11
tables : 3.6.1
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.6

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions