Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity?

When doing linear regression and encoding categorical variables, perfect collinearity can be a problem.  To get around this, the suggested approach is to use n-1 columns.  It would be useful if `pd.get_dummies()` had a boolean parameter that returns n-1 for each categorical column that gets encoded.

Example:

```
>>> df
    Account  Network      Device
0  Account1   Search  Smartphone
1  Account1  Display      Tablet
2  Account2   Search  Smartphone
3  Account3  Display  Smartphone
4  Account2   Search      Tablet
5  Account3   Search  Smartphone
```

```
>>> pd.get_dummies(df)
   Account_Account1  Account_Account2  Account_Account3  Network_Display  \
0                 1                 0                 0                0   
1                 1                 0                 0                1   
2                 0                 1                 0                0   
3                 0                 0                 1                1   
4                 0                 1                 0                0   
5                 0                 0                 1                0   

   Network_Search  Device_Smartphone  Device_Tablet  
0               1                  1              0  
1               0                  0              1  
2               1                  1              0  
3               0                  1              0  
4               1                  0              1  
5               1                  1              0 
```

Instead, I'd like to have some parameter such as `drop_first=True` in `get_dummies()` and it does something like this:

```
>>> new_df = pd.DataFrame(index=df.index)
>>> for i in df:
    new_df = new_df.join(pd.get_dummies(df[i]).iloc[:, 1:])


>>> new_df
   Account2  Account3  Search  Tablet
0         0         0       1       0
1         0         0       0       1
2         1         0       1       0
3         0         1       0       0
4         1         0       1       1
5         0         1       1       0
```

**Sources**
http://fastml.com/converting-categorical-data-into-numbers-with-pandas-and-scikit-learn/
http://stackoverflow.com/questions/31498390/how-to-get-pandas-get-dummies-to-emit-n-1-variables-to-avoid-co-lineraity
http://dss.princeton.edu/online_help/analysis/dummy_variables.htm


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity? #12042

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Pandas get_dummies() and n-1 Categorical Encoding Option to avoid Collinearity? #12042

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions