Description
There seems no direct way to return to the original dtype and the documentation recommends: "To get back to the original Series or numpy array, use Series.astype(original_dtype) or np.asarray(categorical)"
That's slow and a decode
or decat
method would be trivial:
df=pd.DataFrame(np.random.choice(list(u'abcde'), 4e6).reshape(1e6, 4),
columns=list(u'ABCD'))
for col in df.columns: df[col] = df[col].astype('category')
%timeit for col in df.columns: df[col].astype('unicode')
1 loops, best of 3: 1.06 s per loop
%timeit for col in df.columns: cats=df[col].cat.categories; cats[df[col].cat.codes]
10 loops, best of 3: 33.2 ms per loop
I was working with ~10 categories (partially longer strings) on a 20 mio rows dataset where the difference was even bigger (unfortunately can't reproduce it with dummy data) and using astype
felt rather buggy (minutes) than only a performance issue.
Given the current limitations on exporting categorical data, having a fast decode
method would be very convenient. Since category codes are most often strings an optional parameter for direct character set encoding would also be good to have for such a method.
%timeit for col in df.columns: df[col].astype('unicode').str.encode('latin1')
1 loops, best of 3: 3.95 s per loop
%timeit for col in df.columns: cats=pd.Series(df[col].cat.categories).str.encode('latin1'); cats[df[col].cat.codes]
10 loops, best of 3: 74.5 ms per loop