Description
Calling pd.Series(EA).astype(object)
will always generate an intermediate NumPy array and never delegate to the astype
method of the ExtensionArray
.
Callstack is as follows:
pd.Series(EA).astype()
- (omitting some intermediates)
pandas.core.internals.Block._astype
callsself.get_values()
inpandas/pandas/core/internals.py
Line 661 in 4274b84
(introducted by https://github.com/pandas-dev/pandas/pull/20581/files)pandas.core.internals.ExtensionBlock.get_values
then casts the ExtensionArray to anumpy.array
:pandas/pandas/core/internals.py
Line 1937 in 4274b84
Expected behaviour would have been that astype
is called on the ExtensionArray
which then can do the casting on its own. Currently I have the problem that my underlying storage (ExtensionArray backed by Arrow) is not numpy-compatible and thus everything turns into np.array(…, dtype=object)
before it is casted.
Happy to fix this on my own but I would need a pointer on what the correct approach is, i.e. where one should delegate to ExtensionArray.astype
.
Output of pd.show_versions()
Ran into this with 0.23.0 but the code has not changed in master in this area.
INSTALLED VERSIONS [6/1805]
------------------
commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-112-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8
LOCALE: de_DE.UTF-8
pandas: 0.23.0
pytest: 3.6.0
pip: 9.0.3
setuptools: 39.2.0
Cython: None
numpy: 1.14.2
scipy: None
pyarrow: 0.9.0
xarray: None
IPython: 6.4.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.4
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None