Skip to content

DOC: Document that astype() for Series and Dataframe can accept a series of dtypes #46353

Open
@mvashishtha

Description

@mvashishtha

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

https://pandas.pydata.org/docs/reference/api/pandas.Series.astype.html

Documentation problem

The documentation for DataFrame states that dtype can be a dict mapping from column label to new type, or a scalar type. However, dtype can also be a pd.Series whose index contains a subset of the dataframe's labels (though perhaps in a different order) and whose values are dtypes. For example:

import pandas as pd

df = pd.DataFrame([[1, 2]])
print(f"original dtypes:\n{df.dtypes}")
new_df = df.astype(pd.Series(["string", "object"], index=df.columns[::-1]))
print(f"new dtypes:\n{new_df.dtypes}")

In case the frame has duplicate column labels, the index of the new series of dtypes may still be a subset of the column labels. However, it appears that the only way that the new series can have a non-unique index is if that index is exactly equal to the dataframe's columns. For example, this works:

import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1])
print(f"original dtypes:\n{df.dtypes}")
new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns))
print(f"new dtypes:\n{new_df.dtypes}")

but giving the series an index with all the columns in a different order

import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1])
print(f"original dtypes:\n{df.dtypes}")
new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns[::-1]))
print(f"new dtypes:\n{new_df.dtypes}")

raises ValueError: cannot reindex on an axis with duplicate labels:

Show stack trace
ValueError                                Traceback (most recent call last)
Input In [14], in <module>
      3 df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1])
      4 print(f"original dtypes:\n{df.dtypes}")
----> 5 new_df = df.astype(pd.Series(["string", "object", "float"], index=df.columns[::-1]))
      6 print(f"new dtypes:\n{new_df.dtypes}")

File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:5898, in NDFrame.astype(self, dtype, copy, errors)
   5895 from pandas import Series
   5897 dtype_ser = Series(dtype, dtype=object)
-> 5898 dtype_ser = dtype_ser.reindex(self.columns, fill_value=None, copy=False)
   5900 results = []
   5901 for i, (col_name, col) in enumerate(self.items()):

File /usr/local/lib/python3.9/site-packages/pandas/core/series.py:4672, in Series.reindex(self, *args, **kwargs)
   4668         raise TypeError(
   4669             "'index' passed as both positional and keyword argument"
   4670         )
   4671     kwargs.update({"index": index})
-> 4672 return super().reindex(**kwargs)

File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:4974, in NDFrame.reindex(self, *args, **kwargs)
   4971     return self._reindex_multi(axes, copy, fill_value)
   4973 # perform the reindex on the axes
-> 4974 return self._reindex_axes(
   4975     axes, level, limit, tolerance, method, fill_value, copy
   4976 ).__finalize__(self, method="reindex")

File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:4994, in NDFrame._reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy)
   4989 new_index, indexer = ax.reindex(
   4990     labels, level=level, limit=limit, tolerance=tolerance, method=method
   4991 )
   4993 axis = self._get_axis_number(a)
-> 4994 obj = obj._reindex_with_indexers(
   4995     {axis: [new_index, indexer]},
   4996     fill_value=fill_value,
   4997     copy=copy,
   4998     allow_dups=False,
   4999 )
   5000 # If we've made a copy once, no need to make another one
   5001 copy = False

File /usr/local/lib/python3.9/site-packages/pandas/core/generic.py:5040, in NDFrame._reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups)
   5037     indexer = ensure_platform_int(indexer)
   5039 # TODO: speed up on homogeneous DataFrame objects (see _reindex_multi)
-> 5040 new_data = new_data.reindex_indexer(
   5041     index,
   5042     indexer,
   5043     axis=baxis,
   5044     fill_value=fill_value,
   5045     allow_dups=allow_dups,
   5046     copy=copy,
   5047 )
   5048 # If we've made a copy once, no need to make another one
   5049 copy = False

File /usr/local/lib/python3.9/site-packages/pandas/core/internals/managers.py:679, in BaseBlockManager.reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice, use_na_proxy)
    677 # some axes don't allow reindexing with dups
    678 if not allow_dups:
--> 679     self.axes[axis]._validate_can_reindex(indexer)
    681 if axis >= self.ndim:
    682     raise IndexError("Requested axis not found in manager")

File /usr/local/lib/python3.9/site-packages/pandas/core/indexes/base.py:4107, in Index._validate_can_reindex(self, indexer)
   4105 # trying to reindex on an axis with duplicates
   4106 if not self._index_as_unique and len(indexer):
-> 4107     raise ValueError("cannot reindex on an axis with duplicate labels")

ValueError: cannot reindex on an axis with duplicate labels

and so does passing a series of dtypes for just the first two columns:

import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1])
print(f"original dtypes:\n{df.dtypes}")
new_df = df.astype(pd.Series(["string", "object"], index=df.columns[:2]))
print(f"new dtypes:\n{new_df.dtypes}")

although passing a type for just duplicate column label 0 converts the first two columns to the same new type:

import pandas as pd

df = pd.DataFrame([[1, 2, 3]], columns=[0, 0, 1])
print(f"original dtypes:\n{df.dtypes}")
new_df = df.astype(pd.Series(["string"], index=[0]))
print(f"new dtypes:\n{new_df.dtypes}")

Finally, it's possible to pass a series of a single dtype to Series.astype. In that case, the one value in the series index must be the series name, e.g.:

import pandas as pd
s = pd.Series(1)
s.astype(pd.Series(["int64"], index=[s.name]))

Suggested fix for documentation

I don't know whether all of this is an intended feature, or just a bug caused by converting dict-like dtypes to Series here. If it's a feature, it should be documented in the documentation for astype for both DataFrame and Series.

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocsDtype ConversionsUnexpected or buggy dtype conversionsSeriesSeries data structure

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions