不平衡采样-欠采样-过采样-SMOTE-Combination-Emsemble-Model-Feature

不平衡采样

各种不平衡采样,欠采样、过采样、SMOTE均衡采样、组合采样、集成采样、模型选择和特征选择

Various Sampling methods, Under Sampling, Over Sampling, SMOTE balanced sampling, Combination Sampling, Ensemble Sampling, Model Selection, and Feature Selection

参考:

数据不平衡imblearn算法汇总

欠采样

随机欠采样

class imblearn.under_sampling.RandomUnderSampler(*, sampling_strategy=’auto’, random_state=None, replacement=False

下面的example是最简单的欠采样,当然还有很多种欠采样,可以参考:Under-sampling methods

1
2
3
4
5
6
7
8
9
10
11
12
from collections import Counter
from sklearn.datasets import make_classification

X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=2, n_redundant=0, flip_y=0,
n_features=2, n_clusters_per_class=1, n_samples=100,random_state=10)

from imblearn.over_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(Counter(y_resampled))
print(Counter(y))
聚类采样

基本用法

class imblearn.under_sampling.ClusterCentroids(*, sampling_strategy=’auto’, random_state=None, estimator=None, voting=’auto’, n_jobs=’deprecated’)[source]

1
2
3
4
from imblearn.under_sampling import ClusterCentroids
cc = ClusterCentroids(random_state=0)
X_resampled, y_resampled = cc.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))

主要参数:sampling_strategy=’auto’, random_state=None, estimator=None, voting=’auto’

sampling_strategy是比例:

sampling_strategyfloat, str, dict, callable, default=’auto’

Sampling information to sample the data set.

  • When float, it corresponds to the desired ratio of the number of samples in the minority class over the number of samples in the majority class after resampling. Therefore, the ratio is expressed as \alpha_{us} = N_{m} / N_{rM} where N_{m} is the number of samples in the minority class and N_{rM} is the number of samples in the majority class after resampling.

estimatorestimator object, default=None

Pass a KMeans estimator. By default, it will be a default KMeans estimator.

voting{“hard”, “soft”, “auto”}, default=’auto’

Voting strategy to generate the new samples:

  • If 'hard', the nearest-neighbors of the centroids found using the clustering algorithm will be used.
  • If 'soft', the centroids found by the clustering algorithm will be used.
  • If 'auto', if the input is sparse, it will default on 'hard' otherwise, 'soft' will be used.
NearMiss

class imblearn.under_sampling.NearMiss(*, sampling_strategy=’auto’, version=1, n_neighbors=3, n_neighbors_ver3=3, n_jobs=None)

基本用法

1
2
3
4
5
6
7
8
9
from imblearn.under_sampling import NearMiss 
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))

nm = NearMiss()
X_res, y_res = nm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

关键参数:version=1,n_neighbors=3, n_neighbors_ver3=3

version: int, default=1

Version of the NearMiss to use. Possible values are 1, 2 or 3.

n_neighbors:int or estimator object, default=3

If int, size of the neighbourhood to consider to compute the average distance to the minority point samples. If object, an estimator that inherits from KNeighborsMixin that will be used to find the k_neighbors. By default, it will be a 3-NN.

n_neighbors_ver3: int or estimator object, default=3

If int, NearMiss-3 algorithm start by a phase of re-sampling. This parameter correspond to the number of neighbours selected create the subset in which the selection will be performed. If object, an estimator that inherits from KNeighborsMixin that will be used to find the k_neighbors. By default, it will be a 3-NN.

其中,关于NearMiss的版本,官方介绍,简单解释

  1. NearMiss-1
    选择到最近的三个少数类样本平均距离最小的那些多数类样本

  2. NearMiss-2
    选择到最远的三个少数类样本平均距离最小的那些多数类样本

  3. NearMiss-3
    为每个少数类样本选择给定数目的最近多数类样本,目的是保证每个少数类样本都被一些多数类样本包围

imgimgimg

Note:实验结果表明一般NearMiss-2 方法的不均衡分类性能最优

class imblearn.under_sampling.TomekLinks(*, sampling_strategy=’auto’, n_jobs=None)

基本用法

1
2
3
tl = TomekLinks()
X_res, y_res = tl.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

TomekLinks detects the so-called Tomek’s links [Tom76b]. A Tomek’s link between two samples of different class x and y is defined such that for any sample z:

d(x,y) < d(x, z) \text{ and } d(x, y) < d(y, z)

where d(.) is the distance between the two samples. In some other words, a Tomek’s link exist if the two samples are the nearest neighbors of each other. In the figure below, a Tomek’s link is illustrated by highlighting the samples of interest in green.is the distance between the two samples. In some other words, a Tomek’s link exist if the two samples are the nearest neighbors of each other. In the figure below, a Tomek’s link is illustrated by highlighting the samples of interest in green.

_images/sphx_glr_plot_illustration_tomek_links_001.png

关键参数:

sampling_strategy: str, list or callable

Sampling information to sample the data set.

  • When str, specify the class targeted by the resampling. Note the the number of samples will not be equal in each. Possible choices are:

    'majority': resample only the majority class;

    'not minority': resample all classes but the minority class;

    'not majority': resample all classes but the majority class;

    'all': resample all classes;

    'auto': equivalent to 'not minority'.

  • When list, the list contains the classes targeted by the resampling.

  • When callable, function taking y and returns a dict. The keys correspond to the targeted classes. The values correspond to the desired number of samples for each class.

The parameter sampling_strategy control which sample of the link will be removed. For instance, the default (i.e., sampling_strategy='auto') will remove the sample from the majority class. Both samples from the majority and minority class can be removed by setting sampling_strategy to 'all'. The figure illustrates this behaviour.

_images/sphx_glr_plot_illustration_tomek_links_002.png

EditedNearestNeighbours

因为随机欠抽样方法未考虑样本的分布情况,采样具有很大的随机性,可能会删除重要的多数类样本信息。针对以上的不足,Wilson 等人提出了一种最近邻规则(edited nearest neighbor: ENN)。

基本思想:删除那些类别与其最近的三个近邻样本中的两个或两个以上的样本类别不同的样本
缺点:因为大多数的多数类样本的样本附近都是多数类,所以该方法所能删除的多数类样本十分有限。

基本用法

1
2
3
4
from imblearn.under_sampling import EditedNearestNeighbours 
enn = EditedNearestNeighbours()
X_res, y_res = enn.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

关键参数:

*n_neighbors=3*

kind_sel{‘all’, ‘mode’}, default=’all’

Strategy to use in order to exclude samples.

  • If 'all', all neighbours will have to agree with the samples of interest to not be excluded. / 所有的邻居都要同类
  • If 'mode', the majority vote of the neighbours will be used in order to exclude a sample. / 邻居中同类占多数

The strategy "all" will be less conservative than 'mode'. Thus, more samples will be removed when kind_sel="all" generally.

RepeatedEditedNearestNeighbours

RepeatedEditedNearestNeighbours extends EditedNearestNeighbours by repeating the algorithm multiple times [Tom76a]. Generally, repeating the algorithm will delete more data:

关键参数max_iter=100

1
2
3
4
5
from imblearn.under_sampling import RepeatedEditedNearestNeighbours
renn = RepeatedEditedNearestNeighbours()
X_resampled, y_resampled = renn.fit_resample(X, y)
print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 208), (2, 4551)]
ALLKNN

This method will apply ENN several time and will vary the number of nearest neighbours.

AllKNN differs from the previous RepeatedEditedNearestNeighbours since the number of neighbors of the internal nearest neighbors algorithm is increased at each iteration [Tom76a]

CondensedNearestNeighbour

使用1近邻的方法来进行迭代, 来判断一个样本是应该保留还是剔除, 具体的实现步骤如下:

  1. 集合C: 所有的少数类样本;
  2. 选择一个多数类样本(需要下采样)加入集合C, 其他的这类样本放入集合S;
  3. 使用集合S训练一个1-NN的分类器, 对集合S中的样本进行分类;
  4. 将集合S中错分的样本加入集合C;
  5. 重复上述过程, 直到没有样本再加入到集合C.
NeighbourhoodCleaningRule

基本用法

1
2
3
4
5
6
7
8
9
10
11
12
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.under_sampling import NeighbourhoodCleaningRule
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)
print('Original dataset shape %s' % Counter(y))
Original dataset shape Counter({1: 900, 0: 100})
ncr = NeighbourhoodCleaningRule()
X_res, y_res = ncr.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))
Resampled dataset shape Counter({1: 877, 0: 100})
InstanceHardnessThreshold

基本用法:

InstanceHardnessThreshold is a specific algorithm in which a classifier is trained on the data and the samples with lower probabilities are removed [SMGC14]. The class can be used as:

1
2
3
4
5
6
7
8
>>> from sklearn.linear_model import LogisticRegression
>>> from imblearn.under_sampling import InstanceHardnessThreshold
>>> iht = InstanceHardnessThreshold(random_state=0,
... estimator=LogisticRegression(
... solver='lbfgs', multi_class='auto'))
>>> X_resampled, y_resampled = iht.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 64), (1, 64), (2, 64)]

This class has 2 important parameters. estimator will accept any scikit-learn classifier which has a method predict_proba. The classifier training is performed using a cross-validation and the parameter cv can set the number of folds to use.

关键参数:

estimator: estimator object, default=None

Classifier to be used to estimate instance hardness of the samples. By default a RandomForestClassifier will be used. If str, the choices using a string are the following: 'knn', 'decision-tree', 'random-forest', 'adaboost', 'gradient-boosting' and 'linear-svm'. If object, an estimator inherited from ClassifierMixin and having an attribute predict_proba.

cv: int, default=5

Number of folds to be used when estimating samples’ instance hardness.

难例阈值的确定

过采样

class imblearn.over_sampling.RandomOverSampler(*, sampling_strategy=’auto’, random_state=None, shrinkage=None)[source]¶

下面的example是最简单的过采样,当然还有很多种欠采样,可以参考:Under-sampling methods

1
2
3
4
5
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=0)
X_resampled, y_resampled = ros.fit_resample(X, y)
print(Counter(y_resampled))
print(Counter(y))

SMOTE

官方文档: SMOTE

Naive SMOTE

class imblearn.over_sampling.SMOTE(*, sampling_strategy=’auto’, random_state=None, k_neighbors=5, n_jobs=None)

论文地址:http://xueshu.baidu.com/usercenter/paper/show?paperid=28300870422e64fd0ac338860cd0010a&site=xueshu_se

SMOTE(Synthetic Minority Oversampling Technique)合成少数类过采样技术,是在随机采样的基础上改进的一种过采样算法。

  首先,从少数类样本中选取一个样本xi。其次,按采样倍率N,从xi的K近邻中随机选择N个样本xzi。最后,依次在xzi和xi之间随机合成新样本,合成公式如下:

image-20211029202105468

SMOTE实现简单,但其弊端也很明显,由于SMOTE对所有少数类样本一视同仁,并未考虑近邻样本的类别信息,往往出现样本混叠现象,导致分类效果不佳。

1
2
3
4
5
6
7
8
9
10
11
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import SMOTE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=2, n_redundant=0, flip_y=0,
n_features=2, n_clusters_per_class=1, n_samples=100,random_state=10)
print('Original dataset shape %s' % Counter(y))

sm = SMOTE(random_state=42)
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

SMOTE采样前后对比

image-20211029210142803

Borderline SMOTE

论文地址:http://xueshu.baidu.com/usercenter/paper/show?paperid=f95a4bd0843c4c6389cc878bc1d525a2&site=xueshu_se
Borderline SMOTE是在SMOTE基础上改进的过采样算法,该算法仅使用边界上的少数类样本来合成新样本,从而改善样本的类别分布。
Borderline SMOTE采样过程是将少数类样本分为3类,分别为Safe、Danger和Noise,具体说明如下。最后,仅对表为Danger的少数类样本过采样。
Safe,样本周围一半以上均为少数类样本,如图中点A
Danger:样本周围一半以上均为多数类样本,视为在边界上的样本,如图中点B
Noise:样本周围均为多数类样本,视为噪音,如图中点C

watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3UwMTA2NTQyOTk=,size_16,color_FFFFFF,t_70#pic_center

Borderline-SMOTE又可分为Borderline-SMOTE1和Borderline-SMOTE2,Borderline-SMOTE1在对Danger点生成新样本时,在K近邻随机选择少数类样本(与SMOTE相同),Borderline-SMOTE2则是在k近邻中的任意一个样本(不关注样本类别)

Borderline-SMOTE Python使用

class imblearn.over_sampling.BorderlineSMOTE(*, sampling_strategy=’auto’, random_state=None, k_neighbors=5, n_jobs=None, m_neighbors=10, kind=’borderline-1’) [source]

1
2
3
4
5
6
7
8
9
10
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import BorderlineSMOTE
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=2, n_redundant=0, flip_y=0,
n_features=2, n_clusters_per_class=1, n_samples=100, random_state=9)
print('Original dataset shape %s' % Counter(y))
sm = BorderlineSMOTE(random_state=42,kind="borderline-1")
X_res, y_res = sm.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

image-20211029210202402

ADASYN

论文地址:http://xueshu.baidu.com/usercenter/paper/show?paperid=13cbcaf6a33e0e3df06c0c0c421209d0&site=xueshu_se
  ADASYN (adaptive synthetic sampling)自适应合成抽样,与Borderline SMOTE相似,对不同的少数类样本赋予不同的权重,从而生成不同数量的样本。具体流程如下:

image-20211029202613762

ADASYN Python使用

1
2
3
4
5
6
7
8
9
10
11
from collections import Counter
from sklearn.datasets import make_classification
from imblearn.over_sampling import ADASYN
X, y = make_classification(n_classes=2, class_sep=2,
weights=[0.1, 0.9], n_informative=3, n_redundant=1, flip_y=0,
n_features=20, n_clusters_per_class=1, n_samples=1000,
random_state=10)
print('Original dataset shape %s' % Counter(y))
ada = ADASYN(random_state=42)
X_res, y_res = ada.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

ADASYN 采样前后对比

image-20211029210218017

SMOTENC

class imblearn.over_sampling.SMOTENC(categorical_features, *, sampling_strategy=’auto’, random_state=None, k_neighbors=5, n_jobs=None)

因为SMOTE算法和ADASY算法在上采样的时候用到了距离,因此当x xx为异构数据,即含有离散变量(例如0代表男,1代表女),此时无法直接对离散变量使用欧氏距离。因此有了变体SMOTENC,其在处理离散数据时,采用K近邻样本中出现频率最高的离散数据作为新的样本的值。但是要提前告知离散数据出现的维度位置。

1
2
3
4
5
6
from imblearn.over_sampling import BorderlineSMOTE
#当处理mixed data时,以上除了RandomOverSampler都不行(因为用到了距离),但是一下算法可以——SMORENC
from imblearn.over_sampling import SMOTENC
#例如我们的数据的第一个(0)和最后一个数据(3)为categorical features(离散数据)
smote_nc = SMOTENC(categorical_features=[0,2], random_state=0)
X_resampled, y_resampled = smote_nc.fit_resample(X,y)

此外,还有专门针对全部都是categorical features的SMOTEN:Over-sample using the SMOTE variant specifically for categorical features only.

SVMSMOTE

class imblearn.over_sampling.SVMSMOTE(*, sampling_strategy=’auto’, random_state=None, k_neighbors=5, n_jobs=None, m_neighbors=10, svm_estimator=None, out_step=0.5)

使用SVM分类器找到支持向量,在生成的时候会考虑它们. SVM的C参数决定了选择支持向量的多少

Variant of SMOTE algorithm which use an SVM algorithm to detect sample to use for generating new synthetic samples as proposed in [2].

Read more in the User Guide.

KMeansSMOTE

class imblearn.over_sampling.KMeansSMOTE(*, sampling_strategy=’auto’, random_state=None, k_neighbors=2, n_jobs=None, kmeans_estimator=None, cluster_balance_threshold=’auto’, density_exponent=’auto’)

Apply a KMeans clustering before to over-sample using SMOTE.

This is an implementation of the algorithm described in [1].

Read more in the User Guide.

Combination of over- and under-sampling

We previously presented SMOTE and showed that this method can generate noisy samples by interpolating new points between marginal outliers and inliers. This issue can be solved by cleaning the space resulting from over-sampling.

In this regard, Tomek’s link and edited nearest-neighbours are the two cleaning methods that have been added to the pipeline after applying SMOTE over-sampling to obtain a cleaner space. The two ready-to use classes imbalanced-learn implements for combining over- and undersampling methods are: (i) SMOTETomek [BPM04] and (ii) SMOTEENN [BBM03].

基本用法

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from collections import Counter
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=5000, n_features=2, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=3,
... n_clusters_per_class=1,
... weights=[0.01, 0.05, 0.94],
... class_sep=0.8, random_state=0)
>>> print(sorted(Counter(y).items()))
[(0, 64), (1, 262), (2, 4674)]
>>> from imblearn.combine import SMOTEENN
>>> smote_enn = SMOTEENN(random_state=0)
>>> X_resampled, y_resampled = smote_enn.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4060), (1, 4381), (2, 3502)]
>>> from imblearn.combine import SMOTETomek
>>> smote_tomek = SMOTETomek(random_state=0)
>>> X_resampled, y_resampled = smote_tomek.fit_resample(X, y)
>>> print(sorted(Counter(y_resampled).items()))
[(0, 4499), (1, 4566), (2, 4413)]

关键参数:

smote: sampler object, default=None

The SMOTE object to use. If not given, a SMOTE object with default parameters will be given.

enn: sampler object, default=None

The EditedNearestNeighbours object to use. If not given, a EditedNearestNeighbours object with sampling strategy=’all’ will be given.

tomek: sampler object, default=None

The TomekLinks object to use. If not given, a TomekLinks object with sampling strategy=’all’ will be given.

Ensemble resampling

BalancedBaggingClassifier

In ensemble classifiers, bagging methods build several estimators on different randomly selected subset of data. In scikit-learn, this classifier is named BaggingClassifier. However, this classifier does not allow to balance each subset of data. Therefore, when training on imbalanced data set, this classifier will favor the majority classes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=10000, n_features=2, n_informative=2,
... n_redundant=0, n_repeated=0, n_classes=3,
... n_clusters_per_class=1,
... weights=[0.01, 0.05, 0.94], class_sep=0.8,
... random_state=0)
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import balanced_accuracy_score
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.tree import DecisionTreeClassifier
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
>>> bc = BaggingClassifier(base_estimator=DecisionTreeClassifier(),
... random_state=0)
>>> bc.fit(X_train, y_train)
BaggingClassifier(...)
>>> y_pred = bc.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.77...

In BalancedBaggingClassifier, each bootstrap sample will be further resampled to achieve the sampling_strategy desired. Therefore, BalancedBaggingClassifier takes the same parameters than the scikit-learn BaggingClassifier. In addition, the sampling is controlled by the parameter sampler or the two parameters sampling_strategy and replacement, if one wants to use the RandomUnderSampler:

1
2
3
4
5
6
7
8
9
10
>>> from imblearn.ensemble import BalancedBaggingClassifier
>>> bbc = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
... sampling_strategy='auto',
... replacement=False,
... random_state=0)
>>> bbc.fit(X_train, y_train)
BalancedBaggingClassifier(...)
>>> y_pred = bbc.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.8...

Changing the sampler will give rise to different known implementation [MO97], [HKT09], [WY09]. You can refer to the following example shows in practice these different methods: Bagging classifiers using sampler

BalancedRandomForestClassifier

BalancedRandomForestClassifier is another ensemble method in which each tree of the forest will be provided a balanced bootstrap sample [CLB+04]. This class provides all functionality of the RandomForestClassifier:

1
2
3
4
5
6
7
>>> from imblearn.ensemble import BalancedRandomForestClassifier
>>> brf = BalancedRandomForestClassifier(n_estimators=100, random_state=0)
>>> brf.fit(X_train, y_train)
BalancedRandomForestClassifier(...)
>>> y_pred = brf.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.8...

RUSBoostClassifier

Several methods taking advantage of boosting have been designed.

RUSBoostClassifier randomly under-sample the dataset before to perform a boosting iteration [SKVHN09]:

1
2
3
4
5
6
7
8
>>> from imblearn.ensemble import RUSBoostClassifier
>>> rusboost = RUSBoostClassifier(n_estimators=200, algorithm='SAMME.R',
... random_state=0)
>>> rusboost.fit(X_train, y_train)
RUSBoostClassifier(...)
>>> y_pred = rusboost.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0...

Bagging - AdaBoostClassifier

A specific method which uses AdaBoostClassifier as learners in the bagging classifier is called “EasyEnsemble”.

The EasyEnsembleClassifier allows to bag AdaBoost learners which are trained on balanced bootstrap samples [LWZ08]. Similarly to the BalancedBaggingClassifier API, one can construct the ensemble as:

1
2
3
4
5
6
7
>>> from imblearn.ensemble import EasyEnsembleClassifier
>>> eec = EasyEnsembleClassifier(random_state=0)
>>> eec.fit(X_train, y_train)
EasyEnsembleClassifier(...)
>>> y_pred = eec.predict(X_test)
>>> balanced_accuracy_score(y_test, y_pred)
0.6...

Model selection

偷懒一下…

Feature selection

偷懒两下…

0%