Adding DataFrameImputer #82

gsmafra · 2017-02-19T15:38:07Z

Code adapted from http://stackoverflow.com/questions/25239958/impute-categorical-missing-values-in-scikit-learn

dukebody · 2017-03-04T14:49:45Z

Hi @gsmafra. First of all, thanks for your contribution!

I think having an imputer for a string column that imputes with the most frequent value is a cool feature. However, given that sklearn-pandas allows you to select to which columns apply which transformers, I believe it's better to have the "most frequent string value imputer" separated from the traditional mean/median imputer, already implemented in sklearn.preprocessing.Imputer.

Could you refactor your code into a MostFrequentValueImputer class that does the following?

Takes a numpy array as input
Outputs a numpy array
For every column, imputes the NaNs with the most frequent value in that column

Thanks!

gsmafra · 2017-03-04T16:33:55Z

Hi @dukebody, thanks for your answer

Sure I can do that, but wouldn't it make more sense to add this functionality directly on scikit-learn if that is what we want to do?

Also, sklearn.preprocessing.Imputer already has a most_frequent option, but it doesn't accept strings in the input, so the best name would probably be StringImputer or CategoricalImputer, don't you agree?

dukebody · 2017-04-08T16:01:23Z

sklearn_pandas/categorical_imputer.py

+
+        """
+
+        self.fill = pd.Series(X).mode().values[0]


This is implicitly assuming that there will be only one mode value. Can you raise an explicit exception if this is not true? Something like:

modes = pd.Series(X).mode() if modes.shape[0] == 0: raise ValueError('No value is repeteated more than twice in the column') elif modes.shape[0] > 1: raise ValueError('Column has multiple modes {}, can't select one to fill'.format(modes.tolist()) else: self.fill = modes[0]

dukebody · 2017-04-08T18:27:48Z

I merged a rebase of your work here in b40328c

Thanks for your contribution! And sorry for the delay.

dukebody · 2017-04-08T18:28:16Z

If you want you can submit another PR to improve what I mentioned in the comment.

gsmafra added 2 commits February 19, 2017 12:35

adding imputer

90dfb58

rm cache directory

c448b6a

gsmafra added 2 commits March 4, 2017 16:56

replacing dataframe imputer by categorical imputer

cb4b412

rm cache

0e24660

dukebody reviewed Apr 8, 2017

View reviewed changes

dukebody closed this Apr 8, 2017

dukebody mentioned this pull request Apr 17, 2017

CategoricalImputer enhancements. #87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding DataFrameImputer #82

Adding DataFrameImputer #82

Uh oh!

gsmafra commented Feb 19, 2017

Uh oh!

dukebody commented Mar 4, 2017

Uh oh!

gsmafra commented Mar 4, 2017

Uh oh!

dukebody Apr 8, 2017

Uh oh!

dukebody commented Apr 8, 2017

Uh oh!

dukebody commented Apr 8, 2017

Uh oh!

Uh oh!

Adding DataFrameImputer #82

Adding DataFrameImputer #82

Uh oh!

Conversation

gsmafra commented Feb 19, 2017

Uh oh!

dukebody commented Mar 4, 2017

Uh oh!

gsmafra commented Mar 4, 2017

Uh oh!

dukebody Apr 8, 2017

Choose a reason for hiding this comment

Uh oh!

dukebody commented Apr 8, 2017

Uh oh!

dukebody commented Apr 8, 2017

Uh oh!

Uh oh!