Skip to content

Commit 80062d9

Browse files
authored
Merge branch 'master' into issue772
2 parents fc2cde9 + 8a42dfc commit 80062d9

File tree

146 files changed

+4664
-1484
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

146 files changed

+4664
-1484
lines changed

.pre-commit-config.yaml

+4
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,7 @@ repos:
2020
- id: mypy
2121
files: sklearn/
2222
additional_dependencies: [pytest==6.2.4]
23+
- repo: https://github.com/PyCQA/isort
24+
rev: 5.10.1
25+
hooks:
26+
- id: isort

README.rst

+4-89
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
.. |PythonMinVersion| replace:: 3.8
3131
.. |NumPyMinVersion| replace:: 1.17.3
3232
.. |SciPyMinVersion| replace:: 1.3.2
33-
.. |ScikitLearnMinVersion| replace:: 1.1.0
33+
.. |ScikitLearnMinVersion| replace:: 1.0.2
3434
.. |MatplotlibMinVersion| replace:: 3.1.2
3535
.. |PandasMinVersion| replace:: 1.0.5
3636
.. |TensorflowMinVersion| replace:: 2.4.3
@@ -154,92 +154,7 @@ One way of addressing this issue is by re-sampling the dataset as to offset this
154154
imbalance with the hope of arriving at a more robust and fair decision boundary
155155
than you would otherwise.
156156

157-
Re-sampling techniques are divided in two categories:
158-
1. Under-sampling the majority class(es).
159-
2. Over-sampling the minority class.
160-
3. Combining over- and under-sampling.
161-
4. Create ensemble balanced sets.
162-
163-
Below is a list of the methods currently implemented in this module.
164-
165-
* Under-sampling
166-
1. Random majority under-sampling with replacement
167-
2. Extraction of majority-minority Tomek links [1]_
168-
3. Under-sampling with Cluster Centroids
169-
4. NearMiss-(1 & 2 & 3) [2]_
170-
5. Condensed Nearest Neighbour [3]_
171-
6. One-Sided Selection [4]_
172-
7. Neighboorhood Cleaning Rule [5]_
173-
8. Edited Nearest Neighbours [6]_
174-
9. Instance Hardness Threshold [7]_
175-
10. Repeated Edited Nearest Neighbours [14]_
176-
11. AllKNN [14]_
177-
178-
* Over-sampling
179-
1. Random minority over-sampling with replacement
180-
2. SMOTE - Synthetic Minority Over-sampling Technique [8]_
181-
3. SMOTENC - SMOTE for Nominal and Continuous [8]_
182-
4. SMOTEN - SMOTE for Nominal [8]_
183-
5. bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2 [9]_
184-
6. SVM SMOTE - Support Vectors SMOTE [10]_
185-
7. ADASYN - Adaptive synthetic sampling approach for imbalanced learning [15]_
186-
8. KMeans-SMOTE [17]_
187-
9. ROSE - Random OverSampling Examples [19]_
188-
189-
* Over-sampling followed by under-sampling
190-
1. SMOTE + Tomek links [12]_
191-
2. SMOTE + ENN [11]_
192-
193-
* Ensemble classifier using samplers internally
194-
1. Easy Ensemble classifier [13]_
195-
2. Balanced Random Forest [16]_
196-
3. Balanced Bagging
197-
4. RUSBoost [18]_
198-
199-
* Mini-batch resampling for Keras and Tensorflow
200-
201-
The different algorithms are presented in the sphinx-gallery_.
202-
203-
.. _sphinx-gallery: https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/index.html
204-
205-
206-
References:
207-
-----------
208-
209-
.. [1] : I. Tomek, “Two modifications of CNN,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6, pp. 769-772, 1976.
210-
211-
.. [2] : I. Mani, J. Zhang. “kNN approach to unbalanced data distributions: A case study involving information extraction,” In Proceedings of the Workshop on Learning from Imbalanced Data Sets, pp. 1-7, 2003.
212-
213-
.. [3] : P. E. Hart, “The condensed nearest neighbor rule,” IEEE Transactions on Information Theory, vol. 14(3), pp. 515-516, 1968.
214-
215-
.. [4] : M. Kubat, S. Matwin, “Addressing the curse of imbalanced training sets: One-sided selection,” In Proceedings of the 14th International Conference on Machine Learning, vol. 97, pp. 179-186, 1997.
216-
217-
.. [5] : J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe, pp. 63-66, 2001.
218-
219-
.. [6] : D. Wilson, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” IEEE Transactions on Systems, Man, and Cybernetrics, vol. 2(3), pp. 408-421, 1972.
220-
221-
.. [7] : M. R. Smith, T. Martinez, C. Giraud-Carrier, “An instance level analysis of data complexity,” Machine learning, vol. 95(2), pp. 225-256, 2014.
222-
223-
.. [8] : N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.
224-
225-
.. [9] : H. Han, W.-Y. Wang, B.-H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” In Proceedings of the 1st International Conference on Intelligent Computing, pp. 878-887, 2005.
226-
227-
.. [10] : H. M. Nguyen, E. W. Cooper, K. Kamei, “Borderline over-sampling for imbalanced data classification,” In Proceedings of the 5th International Workshop on computational Intelligence and Applications, pp. 24-29, 2009.
228-
229-
.. [11] : G. E. A. P. A. Batista, R. C. Prati, M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM Sigkdd Explorations Newsletter, vol. 6(1), pp. 20-29, 2004.
230-
231-
.. [12] : G. E. A. P. A. Batista, A. L. C. Bazzan, M. C. Monard, “Balancing training data for automated annotation of keywords: A case study,” In Proceedings of the 2nd Brazilian Workshop on Bioinformatics, pp. 10-18, 2003.
232-
233-
.. [13] : X.-Y. Liu, J. Wu and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 39(2), pp. 539-550, 2009.
234-
235-
.. [14] : I. Tomek, “An experiment with the edited nearest-neighbor rule,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), pp. 448-452, 1976.
236-
237-
.. [15] : H. He, Y. Bai, E. A. Garcia, S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In Proceedings of the 5th IEEE International Joint Conference on Neural Networks, pp. 1322-1328, 2008.
238-
239-
.. [16] : C. Chao, A. Liaw, and L. Breiman. "Using random forest to learn imbalanced data." University of California, Berkeley 110 (2004): 1-12.
240-
241-
.. [17] : Felix Last, Georgios Douzas, Fernando Bacao, "Oversampling for Imbalanced Learning Based on K-Means and SMOTE"
242-
243-
.. [18] : Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. "RUSBoost: A hybrid approach to alleviating class imbalance." IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40.1 (2010): 185-197.
157+
You can refer to the `imbalanced-learn`_ documentation to find details about
158+
the implemented algorithms.
244159

245-
.. [19] : Menardi, G., Torelli, N.: "Training and assessing classification rules with unbalanced data", Data Mining and Knowledge Discovery, 28, (2014): 92–122
160+
.. _imbalanced-learn: https://imbalanced-learn.org/stable/user_guide.html

azure-pipelines.yml

+26-14
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,16 @@ jobs:
4545
versionSpec: '3.9'
4646
- bash: |
4747
# Include pytest compatibility with mypy
48-
pip install pytest flake8 mypy==0.782 black==22.3
48+
pip install pytest flake8 mypy==0.782 black==22.3 isort
4949
displayName: Install linters
5050
- bash: |
5151
black --check --diff .
5252
displayName: Run black
5353
- bash: |
54-
./build_tools/circle/linting.sh
54+
isort --check --diff .
55+
displayName: Run isort
56+
- bash: |
57+
./build_tools/azure/linting.sh
5558
displayName: Run linting
5659
- bash: |
5760
mypy imblearn/
@@ -102,8 +105,8 @@ jobs:
102105
# Check compilation with Ubuntu bionic 18.04 LTS and scipy from conda-forge
103106
- template: build_tools/azure/posix.yml
104107
parameters:
105-
name: Ubuntu_Bionic
106-
vmImage: ubuntu-18.04
108+
name: Ubuntu_Jammy_Jellyfish
109+
vmImage: ubuntu-22.04
107110
dependsOn: [git_commit, linting]
108111
condition: |
109112
and(
@@ -112,7 +115,7 @@ jobs:
112115
ne(variables['Build.Reason'], 'Schedule')
113116
)
114117
matrix:
115-
py37_conda_forge_openblas_ubuntu_1804:
118+
py38_conda_forge_openblas_ubuntu_1804:
116119
DISTRIB: 'conda'
117120
CONDA_CHANNEL: 'conda-forge'
118121
PYTHON_VERSION: '3.8'
@@ -141,12 +144,12 @@ jobs:
141144
THREADPOOLCTL_VERSION: 'min'
142145
COVERAGE: 'false'
143146
# Linux + Python 3.8 build with OpenBLAS and without SITE_JOBLIB
144-
py37_conda_defaults_openblas:
147+
py38_conda_defaults_openblas:
145148
DISTRIB: 'conda'
146149
CONDA_CHANNEL: 'conda-forge'
147150
PYTHON_VERSION: '3.8'
148151
BLAS: 'openblas'
149-
NUMPY_VERSION: '1.19.5' # we cannot get an older version of the dependencies resolution
152+
NUMPY_VERSION: '1.21.0' # we cannot get an older version of the dependencies resolution
150153
SCIPY_VERSION: 'min'
151154
SKLEARN_VERSION: 'min'
152155
MATPLOTLIB_VERSION: 'none'
@@ -155,10 +158,18 @@ jobs:
155158
# Linux environment to test the latest available dependencies and MKL.
156159
pylatest_pip_openblas_pandas:
157160
DISTRIB: 'conda-pip-latest'
158-
PYTHON_VERSION: '3.9'
161+
PYTHON_VERSION: '*'
159162
TEST_DOCS: 'true'
160163
TEST_DOCSTRINGS: 'true'
161164
CHECK_WARNINGS: 'true'
165+
# Test the intermediate version of scikit-learn
166+
pylatest_pip_openblas_sklearn_intermediate:
167+
DISTRIB: 'conda-pip-latest'
168+
PYTHON_VERSION: '3.10'
169+
TEST_DOCS: 'true'
170+
TEST_DOCSTRINGS: 'true'
171+
CHECK_WARNINGS: 'false'
172+
SKLEARN_VERSION: '1.1.3'
162173
pylatest_pip_tensorflow:
163174
DISTRIB: 'conda-pip-latest-tensorflow'
164175
CONDA_CHANNEL: 'conda-forge'
@@ -178,11 +189,13 @@ jobs:
178189
DISTRIB: 'conda-minimum-tensorflow'
179190
CONDA_CHANNEL: 'conda-forge'
180191
PYTHON_VERSION: '3.8'
192+
NUMPY_VERSION: '1.19.5' # This version is the minimum requrired by tensorflow
193+
SCIPY_VERSION: 'min'
181194
SKLEARN_VERSION: 'min'
182195
TENSORFLOW_VERSION: 'min'
183196
TEST_DOCS: 'true'
184197
TEST_DOCSTRINGS: 'false' # it is going to fail because of scikit-learn inheritance
185-
CHECK_WARNINGS: 'true'
198+
CHECK_WARNINGS: 'false' # in case the older version raise some FutureWarnings
186199
pylatest_pip_keras:
187200
DISTRIB: 'conda-pip-latest-keras'
188201
CONDA_CHANNEL: 'conda-forge'
@@ -202,11 +215,13 @@ jobs:
202215
DISTRIB: 'conda-minimum-keras'
203216
CONDA_CHANNEL: 'conda-forge'
204217
PYTHON_VERSION: '3.8'
218+
NUMPY_VERSION: '1.19.5' # This version is the minimum requrired by tensorflow
219+
SCIPY_VERSION: 'min'
205220
SKLEARN_VERSION: 'min'
206221
KERAS_VERSION: 'min'
207222
TEST_DOCS: 'true'
208223
TEST_DOCSTRINGS: 'false' # it is going to fail because of scikit-learn inheritance
209-
CHECK_WARNINGS: 'true'
224+
CHECK_WARNINGS: 'false' # in case the older version raise some FutureWarnings
210225

211226
# Currently runs on Python 3.8 while only Python 3.7 available
212227
# - template: build_tools/azure/posix-docker.yml
@@ -233,7 +248,7 @@ jobs:
233248
- template: build_tools/azure/posix.yml
234249
parameters:
235250
name: macOS
236-
vmImage: macOS-10.15
251+
vmImage: macOS-11
237252
dependsOn: [linting, git_commit]
238253
condition: |
239254
and(
@@ -275,6 +290,3 @@ jobs:
275290
PYTHON_ARCH: '64'
276291
PYTEST_VERSION: '*'
277292
COVERAGE: 'true'
278-
py38_pip_openblas_32bit:
279-
PYTHON_VERSION: '3.8'
280-
PYTHON_ARCH: '32'

build_tools/azure/install.sh

+2-1
Original file line numberDiff line numberDiff line change
@@ -67,7 +67,8 @@ elif [[ "$DISTRIB" == "conda-pip-latest" ]]; then
6767
make_conda "python=$PYTHON_VERSION"
6868
python -m pip install -U pip
6969

70-
python -m pip install scikit-learn pandas matplotlib
70+
python -m pip install pandas matplotlib
71+
python -m pip install scikit-learn
7172

7273
elif [[ "$DISTRIB" == "conda-pip-latest-tensorflow" ]]; then
7374
make_conda "python=$PYTHON_VERSION"

build_tools/azure/linting.sh

+43
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
#!/bin/bash
2+
3+
set -e
4+
# pipefail is necessary to propagate exit codes
5+
set -o pipefail
6+
7+
flake8 --show-source .
8+
echo -e "No problem detected by flake8\n"
9+
10+
# For docstrings and warnings of deprecated attributes to be rendered
11+
# properly, the property decorator must come before the deprecated decorator
12+
# (else they are treated as functions)
13+
14+
# do not error when grep -B1 "@property" finds nothing
15+
set +e
16+
bad_deprecation_property_order=`git grep -A 10 "@property" -- "*.py" | awk '/@property/,/def /' | grep -B1 "@deprecated"`
17+
18+
if [ ! -z "$bad_deprecation_property_order" ]
19+
then
20+
echo "property decorator should come before deprecated decorator"
21+
echo "found the following occurrencies:"
22+
echo $bad_deprecation_property_order
23+
exit 1
24+
fi
25+
26+
# Check for default doctest directives ELLIPSIS and NORMALIZE_WHITESPACE
27+
28+
doctest_directive="$(git grep -nw -E "# doctest\: \+(ELLIPSIS|NORMALIZE_WHITESPACE)")"
29+
30+
if [ ! -z "$doctest_directive" ]
31+
then
32+
echo "ELLIPSIS and NORMALIZE_WHITESPACE doctest directives are enabled by default, but were found in:"
33+
echo "$doctest_directive"
34+
exit 1
35+
fi
36+
37+
joblib_import="$(git grep -l -A 10 -E "joblib import.+delayed" -- "*.py" ":!sklearn/utils/_joblib.py" ":!sklearn/utils/fixes.py")"
38+
39+
if [ ! -z "$joblib_import" ]; then
40+
echo "Use from sklearn.utils.fixes import delayed instead of joblib delayed. The following files contains imports to joblib.delayed:"
41+
echo "$joblib_import"
42+
exit 1
43+
fi

build_tools/azure/posix-docker.yml

+1
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@ jobs:
3030
THREADPOOLCTL_VERSION: 'latest'
3131
COVERAGE: 'false'
3232
TEST_DOCSTRINGS: 'false'
33+
CHECK_WARNINGS: 'false'
3334
BLAS: 'openblas'
3435
# Set in azure-pipelines.yml
3536
DISTRIB: ''

build_tools/azure/posix.yml

+1
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,7 @@ jobs:
3636
COVERAGE: 'true'
3737
TEST_DOCS: 'false'
3838
TEST_DOCSTRINGS: 'false'
39+
CHECK_WARNINGS: 'false'
3940
SHOW_SHORT_SUMMARY: 'false'
4041
strategy:
4142
matrix:

build_tools/azure/test_script.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ if [[ "$COVERAGE" == "true" ]]; then
3434
TEST_CMD="$TEST_CMD --cov-config='$COVERAGE_PROCESS_START' --cov imblearn --cov-report="
3535
fi
3636

37-
if [[ -n "$CHECK_WARNINGS" ]]; then
37+
if [[ "$CHECK_WARNINGS" == "true" ]]; then
3838
# numpy's 1.19.0's tostring() deprecation is ignored until scipy and joblib removes its usage
3939
TEST_CMD="$TEST_CMD -Werror::DeprecationWarning -Werror::FutureWarning -Wignore:tostring:DeprecationWarning"
4040

build_tools/azure/windows.yml

+1
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ jobs:
2121
PYTEST_XDIST_VERSION: 'latest'
2222
TEST_DIR: '$(Agent.WorkFolder)/tmp_folder'
2323
CPU_COUNT: '2'
24+
CHECK_WARNINGS: 'false'
2425
strategy:
2526
matrix:
2627
${{ insert }}: ${{ parameters.matrix }}

conftest.py

+1
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
# rather than the one from site-packages.
77

88
import os
9+
910
import pytest
1011

1112

doc/common_pitfalls.rst

+17-1
Original file line numberDiff line numberDiff line change
@@ -130,8 +130,24 @@ cross-validation::
130130
... f"{cv_results['test_score'].std():.3f}"
131131
... )
132132
Balanced accuracy mean +/- std. dev.: 0.724 +/- 0.042
133+
134+
The cross-validation performance looks good, but evaluating the classifiers
135+
on the left-out data shows a different picture::
133136

134-
We see that the statistical performance are worse than in the previous case.
137+
>>> scores = []
138+
>>> for fold_id, cv_model in enumerate(cv_results["estimator"]):
139+
... scores.append(
140+
... balanced_accuracy_score(
141+
... y_left_out, cv_model.predict(X_left_out)
142+
... )
143+
... )
144+
>>> print(
145+
... f"Balanced accuracy mean +/- std. dev.: "
146+
... f"{np.mean(scores):.3f} +/- {np.std(scores):.3f}"
147+
... )
148+
Balanced accuracy mean +/- std. dev.: 0.698 +/- 0.014
149+
150+
We see that the performance is now worse than the cross-validated performance.
135151
Indeed, the data leakage gave us too optimistic results due to the reason
136152
stated earlier in this section.
137153

doc/conf.py

+1-6
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@
1515
import os
1616
import sys
1717
from datetime import datetime
18-
from pathlib import Path
1918
from io import StringIO
19+
from pathlib import Path
2020

2121
# If extensions (or modules to document with autodoc) are in another directory,
2222
# add these directories to sys.path here. If the directory is relative to the
@@ -82,11 +82,6 @@
8282
# The name of the Pygments (syntax highlighting) style to use.
8383
pygments_style = "sphinx"
8484

85-
# -- Options for math equations -----------------------------------------------
86-
87-
extensions.append("sphinx.ext.imgmath")
88-
imgmath_image_format = "svg"
89-
9085
# -- Options for HTML output ----------------------------------------------
9186

9287
# The theme to use for HTML and HTML Help pages. See the documentation for

0 commit comments

Comments
 (0)