Skip to content

Commit 5c0b88d

Browse files
gsmafradukebody
authored andcommitted
Add CategoricalImputer working with string columns
It replaces the null-like values with the mode of the column and works with string-like columns (object dtype in pandas).
1 parent 2fc6286 commit 5c0b88d

File tree

5 files changed

+149
-2
lines changed

5 files changed

+149
-2
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,4 +2,5 @@
22
*.pyc
33
.tox/
44
build/
5-
dist/
5+
dist/
6+
.cache/

README.rst

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ In particular, it provides:
88

99
1. A way to map ``DataFrame`` columns to transformations, which are later recombined into features.
1010
2. A compatibility shim for old ``scikit-learn`` versions to cross-validate a pipeline that takes a pandas ``DataFrame`` as input. This is only needed for ``scikit-learn<0.16.0`` (see `#11 <https://github.com/paulgb/sklearn-pandas/issues/11>`__ for details). It is deprecated and will likely be dropped in ``skearn-pandas==2.0``.
11+
3. A ``CategoricalImputer`` that replaces null-like values with the mode and works with string columns.
1112

1213
Installation
1314
------------
@@ -249,7 +250,7 @@ Working with sparse features
249250
The stacking of the sparse features is done without ever densifying them.
250251

251252
Cross-Validation
252-
----------------
253+
****************
253254

254255
Now that we can combine features from pandas DataFrames, we may want to use cross-validation to see whether our model works. ``scikit-learn<0.16.0`` provided features for cross-validation, but they expect numpy data structures and won't work with ``DataFrameMapper``.
255256

@@ -263,13 +264,31 @@ To get around this, sklearn-pandas provides a wrapper on sklearn's ``cross_val_s
263264

264265
Sklearn-pandas' ``cross_val_score`` function provides exactly the same interface as sklearn's function of the same name.
265266

267+
``CategoricalImputer``
268+
**********************
269+
270+
Since the ``scikit-learn`` ``Imputer`` transformer currently only works with
271+
numbers, ``sklearn-pandas`` provides an equivalent helper transformer that do
272+
work with strings, substituting null values with the most frequent value in
273+
that column.
274+
275+
Example:
276+
277+
>>> from sklearn_pandas import CategoricalImputer
278+
>>> data = np.array(['a', 'b', 'b', np.nan], dtype=object)
279+
>>> imputer = CategoricalImputer()
280+
>>> imputer.fit_transform(data)
281+
array(['a', 'b', 'b', 'b'], dtype=object)
282+
266283

267284
Changelog
268285
---------
269286

270287
Development
271288
***********
272289
* Capture output columns generated names in ``transformed_names_`` attribute (#78).
290+
* Add ``CategoricalImputer`` that replaces null-like values with the mode
291+
for string-like columns.
273292

274293

275294
1.3.0 (2017-01-21)
@@ -324,6 +343,7 @@ Other contributors:
324343

325344
* Arnau Gil Amat
326345
* Cal Paterson
346+
* Gustavo Sena Mafra
327347
* Israel Saeta Pérez
328348
* Jeremy Howard
329349
* Olivier Grisel

sklearn_pandas/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@
22

33
from .dataframe_mapper import DataFrameMapper # NOQA
44
from .cross_validation import cross_val_score, GridSearchCV, RandomizedSearchCV # NOQA
5+
from .categorical_imputer import CategoricalImputer

sklearn_pandas/categorical_imputer.py

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
"""
2+
3+
Impute missing values from a categorical/string np.ndarray or pd.Series with the most frequent value on the training data.
4+
5+
"""
6+
7+
import pandas as pd
8+
import numpy as np
9+
10+
from sklearn.base import TransformerMixin
11+
12+
13+
class CategoricalImputer(TransformerMixin):
14+
15+
"""
16+
17+
Attributes
18+
----------
19+
20+
fill : str
21+
Most frequent value of the training data.
22+
23+
"""
24+
25+
def __init__(self):
26+
27+
self.fill = None
28+
29+
def fit(self, X):
30+
31+
"""
32+
33+
Get the most frequent value.
34+
35+
Parameters
36+
----------
37+
X : np.ndarray or pd.Series
38+
Training data.
39+
40+
Returns
41+
-------
42+
CategoricalImputer
43+
Itself.
44+
45+
"""
46+
47+
self.fill = pd.Series(X).mode().values[0]
48+
49+
return self
50+
51+
def transform(self, X):
52+
53+
"""
54+
55+
Replaces null values in the input data with the most frequent value of the training data.
56+
57+
Parameters
58+
----------
59+
X : np.ndarray or pd.Series
60+
Data with values to be imputed.
61+
62+
Returns
63+
-------
64+
np.ndarray
65+
Data with imputed values.
66+
67+
"""
68+
69+
X = X.copy()
70+
71+
X[pd.isnull(X)] = self.fill
72+
73+
return np.asarray(X)

tests/test_categorical_imputer.py

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
import pytest
2+
3+
import numpy as np
4+
import pandas as pd
5+
6+
from sklearn_pandas import CategoricalImputer
7+
from sklearn_pandas import DataFrameMapper
8+
9+
10+
@pytest.mark.parametrize('none_value', [None, np.nan])
11+
@pytest.mark.parametrize('input_type', ['np', 'pd'])
12+
def test_unit(input_type, none_value):
13+
14+
data = ['a', 'b', 'b', none_value]
15+
16+
if input_type == 'pd':
17+
X = pd.Series(data)
18+
else:
19+
X = np.asarray(data)
20+
21+
Xc = X.copy()
22+
23+
Xt = CategoricalImputer().fit_transform(X)
24+
25+
assert (np.asarray(X) == np.asarray(Xc)).all()
26+
assert type(Xt) == np.ndarray
27+
assert len(X) == len(Xt)
28+
assert len(Xt[pd.isnull(Xt)]) == 0
29+
30+
31+
@pytest.mark.parametrize('none_value', [None, np.nan])
32+
def test_integration(none_value):
33+
34+
df = pd.DataFrame({'cat': ['a', 'a', 'a', none_value, 'b'],
35+
'num': [1, 2, 3, 4, 5]})
36+
37+
mapper = DataFrameMapper([
38+
('cat', CategoricalImputer()),
39+
('num', None)
40+
], df_out=True).fit(df)
41+
42+
df_t = mapper.transform(df)
43+
44+
assert pd.notnull(df_t).all().all()
45+
46+
val_idx = pd.notnull(df['cat'])
47+
nan_idx = ~val_idx
48+
49+
assert (df['num'] == df_t['num']).all()
50+
51+
assert (df['cat'][val_idx] == df_t['cat'][val_idx]).all()
52+
assert (df_t['cat'][nan_idx] == df['cat'].mode().values[0]).all()

0 commit comments

Comments
 (0)