Skip to content

Adding DataFrameImputer #82

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed

Conversation

gsmafra
Copy link
Contributor

@gsmafra gsmafra commented Feb 19, 2017

@dukebody
Copy link
Collaborator

dukebody commented Mar 4, 2017

Hi @gsmafra. First of all, thanks for your contribution!

I think having an imputer for a string column that imputes with the most frequent value is a cool feature. However, given that sklearn-pandas allows you to select to which columns apply which transformers, I believe it's better to have the "most frequent string value imputer" separated from the traditional mean/median imputer, already implemented in sklearn.preprocessing.Imputer.

Could you refactor your code into a MostFrequentValueImputer class that does the following?

  • Takes a numpy array as input
  • Outputs a numpy array
  • For every column, imputes the NaNs with the most frequent value in that column

Thanks!

@gsmafra
Copy link
Contributor Author

gsmafra commented Mar 4, 2017

Hi @dukebody, thanks for your answer

Sure I can do that, but wouldn't it make more sense to add this functionality directly on scikit-learn if that is what we want to do?

Also, sklearn.preprocessing.Imputer already has a most_frequent option, but it doesn't accept strings in the input, so the best name would probably be StringImputer or CategoricalImputer, don't you agree?


"""

self.fill = pd.Series(X).mode().values[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implicitly assuming that there will be only one mode value. Can you raise an explicit exception if this is not true? Something like:

modes = pd.Series(X).mode()
if modes.shape[0] == 0:
    raise ValueError('No value is repeteated more than twice in the column')
elif modes.shape[0] > 1:
    raise ValueError('Column has multiple modes {}, can't select one to fill'.format(modes.tolist())
else:
    self.fill = modes[0]

@dukebody
Copy link
Collaborator

dukebody commented Apr 8, 2017

I merged a rebase of your work here in b40328c

Thanks for your contribution! And sorry for the delay.

@dukebody dukebody closed this Apr 8, 2017
@dukebody
Copy link
Collaborator

dukebody commented Apr 8, 2017

If you want you can submit another PR to improve what I mentioned in the comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants