Description
Pandas version checks
- I have checked that the issue still exists on the latest versions of the docs on
main
here
Location of the documentation
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html
https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html
Documentation problem
The documentation for DataFrame
states that dtype
can be a dict
mapping from column label to new type, or a scalar type. However, dtype
can also be a pd.Series
whose index contains a subset of the dataframe's labels (though perhaps in a different order) and whose values are dtypes. For example:
import pandas as pd
df = pd.DataFrame([[1, 2]])
print(f"original dtypes:\n{df.dtypes}")
new_df = df.astype(pd.Series(["string", "object"], index=df.columns[::-1]))
print(f"new dtypes:\n{new_df.dtypes}")
In case the frame has duplicate column labels, the index of the new series of dtypes may still be a subset of the column labels. However, it appears that the only way that the new series can have a non-unique index is if that index is exactly equal to the dataframe's columns
. For example, this works:
import pandas as pd
df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1])
print(f"original dtypes:\n{df.dtypes}")
new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns))
print(f"new dtypes:\n{new_df.dtypes}")
but giving the series an index with all the columns in a different order
import pandas as pd
df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1])
print(f"original dtypes:\n{df.dtypes}")
new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns[::-1]))
print(f"new dtypes:\n{new_df.dtypes}")
raises ValueError: cannot reindex on an axis with duplicate labels
:
Show stack trace
ValueError Traceback (most recent call last)
Input In [14], in <module>
3 df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1])
4 print(f"original dtypes:\n{df.dtypes}")
----> 5 new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns[::-1]))
6 print(f"new dtypes:\n{new_df.dtypes}")
File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:5898, in NDFrame.astype(self, dtype, copy, errors)
5895 from pandas import Series
5897 dtype_ser = Series(dtype, dtype=object)
-> 5898 dtype_ser = dtype_ser.reindex(self.columns, fill_value=None, copy=False)
5900 results = []
5901 for i, (col_name, col) in enumerate(self.items()):
File /usr/local/lib/python3.9/site-packages/pandas/core/series.py:4672, in Series.reindex(self, *args, **kwargs)
4668 raise TypeError(
4669 "'index' passed as both positional and keyword argument"
4670 )
4671 kwargs.update({"index": index})
-> 4672 return super().reindex(**kwargs)
File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:4974, in NDFrame.reindex(self, *args, **kwargs)
4971 return self._reindex_multi(axes, copy, fill_value)
4973 # perform the reindex on the axes
-> 4974 return self._reindex_axes(
4975 axes, level, limit, tolerance, method, fill_value, copy
4976 ).__finalize__(self, method="reindex")
File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:4994, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
4989 new_index, indexer = ax.reindex(
4990 labels, level=level, limit=limit, tolerance=tolerance, method=method
4991 )
4993 axis = self._get_axis_number(a)
-> 4994 obj = obj._reindex_with_indexers(
4995 {axis: [new_index, indexer]},
4996 fill_value=fill_value,
4997 copy=copy,
4998 allow_dups=False,
4999 )
5000 # If we've made a copy once, no need to make another one
5001 copy = False
File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:5040, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
5037 indexer = ensure_platform_int(indexer)
5039 # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi)
-> 5040 new_data = new_data.reindex_indexer(
5041 index,
5042 indexer,
5043 axis=baxis,
5044 fill_value=fill_value,
5045 allow_dups=allow_dups,
5046 copy=copy,
5047 )
5048 # If we've made a copy once, no need to make another one
5049 copy = False
File /usr/local/lib/python3.9/site-packages/pandas/core/internals/managers.py:679, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice, use_na_proxy)
677 # some axes don't allow reindexing with dups
678 if not allow_dups:
--> 679 self.axes[axis]._validate_can_reindex(indexer)
681 if axis >= self.ndim:
682 raise IndexError("Requested axis not found in manager")
File /usr/local/lib/python3.9/site-packages/pandas/core/indexes/base.py:4107, in Index._validate_can_reindex(self, indexer)
4105 # trying to reindex on an axis with duplicates
4106 if not self._index_as_unique and len(indexer):
-> 4107 raise ValueError("cannot reindex on an axis with duplicate labels")
ValueError: cannot reindex on an axis with duplicate labels
and so does passing a series of dtypes for just the first two columns:
import pandas as pd
df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1])
print(f"original dtypes:\n{df.dtypes}")
new_df = df.astype(pd.Series(["string", "object"], index=df.columns[:2]))
print(f"new dtypes:\n{new_df.dtypes}")
although passing a type for just duplicate column label 0
converts the first two columns to the same new type:
import pandas as pd
df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1])
print(f"original dtypes:\n{df.dtypes}")
new_df = df.astype(pd.Series(["string"], index=[0]))
print(f"new dtypes:\n{new_df.dtypes}")
Finally, it's possible to pass a series of a single dtype to Series.astype
. In that case, the one value in the series index must be the series name, e.g.:
import pandas as pd
s = pd.Series(1)
s.astype(pd.Series(["int64"], index=[s.name]))
Suggested fix for documentation
I don't know whether all of this is an intended feature, or just a bug caused by converting dict-like dtypes to Series
here. If it's a feature, it should be documented in the documentation for astype
for both DataFrame
and Series
.