Skip to content

ENH: decode for Categoricals #8628

Closed
@fkaufer

Description

@fkaufer

There seems no direct way to return to the original dtype and the documentation recommends: "To get back to the original Series or numpy array, use Series.astype(original_dtype) or np.asarray(categorical)"

That's slow and a decode or decat method would be trivial:

df=pd.DataFrame(np.random.choice(list(u'abcde'), 4e6).reshape(1e6, 4),
    columns=list(u'ABCD'))                                     
for col in df.columns: df[col] = df[col].astype('category')   

%timeit for col in df.columns: df[col].astype('unicode')      
1 loops, best of 3: 1.06 s per loop

%timeit for col in df.columns: cats=df[col].cat.categories; cats[df[col].cat.codes]    
10 loops, best of 3: 33.2 ms per loop   

I was working with ~10 categories (partially longer strings) on a 20 mio rows dataset where the difference was even bigger (unfortunately can't reproduce it with dummy data) and using astype felt rather buggy (minutes) than only a performance issue.

Given the current limitations on exporting categorical data, having a fast decode method would be very convenient. Since category codes are most often strings an optional parameter for direct character set encoding would also be good to have for such a method.

%timeit for col in df.columns: df[col].astype('unicode').str.encode('latin1')  
1 loops, best of 3: 3.95 s per loop
%timeit for col in df.columns: cats=pd.Series(df[col].cat.categories).str.encode('latin1'); cats[df[col].cat.codes]                                                                  
10 loops, best of 3: 74.5 ms per loop   

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions