Description
Code Sample, a copy-pastable example if possible
In [2]: mi = pd.MultiIndex.from_product([['i'], ['ii'], ['iii']])
In [3]: mi.rename([1,5,6]).get_level_values(1) # Interpreted as label
Out[3]: Index(['i'], dtype='object', name=1)
In [4]: mi.rename([1,5,1]).get_level_values(1) # Interpreted as index
Out[4]: Index(['ii'], dtype='object', name=5)
In [5]: mi.rename(['a',5,'a']).get_level_values('a') # ValueError is OK, KeyError is not
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
670 raise ValueError('The name %s occurs multiple times, use a '
--> 671 'level number' % level)
672 level = self.names.index(level)
ValueError: The name a occurs multiple times, use a level number
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
<ipython-input-5-e8a616f9610f> in <module>()
----> 1 mi.rename(['a',5,'a']).get_level_values('a')
/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in get_level_values(self, level)
975 Index(['d', 'e', 'f'], dtype='object', name='level_2')
976 """
--> 977 level = self._get_level_number(level)
978 values = self._get_level_values(level)
979 return values
/home/nobackup/repo/pandas/pandas/core/indexes/multi.py in _get_level_number(self, level)
673 except ValueError:
674 if not isinstance(level, int):
--> 675 raise KeyError('Level %s not found' % str(level))
676 elif level < 0:
677 level += self.nlevels
KeyError: 'Level a not found'
In [6]: mi.rename([1, 'a', 'a']).get_level_values(1) # How am I going to access the second level?!
Out[6]: Index(['i'], dtype='object', name=1)
Problem description
There are different problems, but the root cause is (I think) the same:
- the first is trivial: the
KeyError
inIn [5]:
should not appear - the second is that the interpretation of an integer changes when there is a duplicate name (difference between
Out[3]
andOut[4]
) - the third is that in a situation like
In [6]
, I have no way whatsoever to access the second column, since it is denoted by a duplicated name and its index is also the name of another column (sure, this is a perverse example, but I suspect it can bite in some cases in which users use duplicate labels and pandas internal code adds integer labels)
Expected Output
If we were to design this from scratch, the solution would be simple: prioritize the "index" interpretation of an integer over the "label" interpretation, so that the former is always unambiguous. Is this a too strong API change?
If the answer is "no", I will be happy to implement it, possibly with a temporary warning in those cases where the behaviour will change (that is: requested label is integer and is present in the names).
If the answer is "yes", I would like to at least suppress the KeyError
in In [5]:
and have In [4]:
raise an error rather than return a result inconsistent with In [3]:
.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: 04db779
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8
pandas: 0.22.0.dev0+388.g04db779d4
pytest: 3.2.3
pip: 9.0.1
setuptools: 36.7.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.0dev
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.3.0
xlsxwriter: 0.9.6
lxml: 4.1.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1