The Imbalanced Datasets are very common in our life such as illegal users or illness check. The machine learning model always performs bad on these datasets if there are no specific dealings, especially the prediction accuracy of minority class. For example, if the data is highly imbalanced such as 9995(negative):5(positive), then if your model just let every instance to be negative and you can get an acc of 99.95% but the result is meaningless. Another example is that misclassifying the minority is very severe. Assume that you misclassify the patient as normal. Oh my god!
So researchers proposed two kinds of methods for this problem:
class_weight
and sample_weight
for you. For class_weight
, you can specify the weights for different classes such as {0:0.1,1:0.9}
or you can set it to balanced
then weights will be computed by \(\frac{\#samples}{\#classes\ *\ np.bincount(y)}\). For fit(sample_weight=)
, you give every instance different weights. When computing the loss for the instance, it will be class_weight
* sample_weight
* loss
.Over-sampling means to increment the minority class.
First you find the k_neighbors
of \(x_i\) in the minority class, then just select one \(x_{zi}\) randomly and produce the new one. There are some variants such as borderline SMOTE, SVM SMOTE and KMeans SMOTE.
Under-sampling means to decrease the majority class.
Dealing with Imbalanced Datasets
原文:https://www.cnblogs.com/EIMadrigal/p/14738860.html