Description
Code Sample, a copy-pastable example if possible
>>> import pandas as pd
>>> import numpy as np
>>> s1 = pd.Series(['1-1','1-1',np.NaN], dtype='category')
>>> s1.apply(lambda x: x.split('-')[0])
0 1
1 1
2 NaN
dtype: category
Categories (1, object): [1]
>>> s2 = pd.Series(['1-1','1-2',np.NaN], dtype='category')
>>> s2.apply(lambda x: x.split('-')[0])
0 1
1 1
2 1
dtype: object
Problem description
In the above code, s1
shows the expected behaviour. We are trying to transform a categorical series by getting the part before the hyphen, and for rows where the original value is NaN
the output is also NaN
.
The series s2
shows the unexpected behaviour - note only a single change to the original series, the middle value has changed from '1-1'
to '1-2'
. The third value, which was NaN
in the original series now becomes '1'
in the output rather than staying as NaN
. Also, the dtype of the result series is now object
rather than category
. It looks like maybe the NaN
is somehow getting the applied value of the previous row.
Expected Output
0 1
1 1
2 NaN
Output of pd.show_versions()
[paste the output of pd.show_versions()
here below this line]
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.13.0-38-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8
pandas: 0.22.0
pytest: None
pip: 9.0.1
setuptools: 38.4.0
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.1
openpyxl: None
xlrd: None
xlwt: 1.3.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None