
Nuno Moniz
nuno.moniz@fc.up.pt
A balanced distribution
An imbalanced distribution
A balanced distribution
An imbalanced distribution
There are two main problems when learning with imbalanced domains
1. How to learn?
2. How to evaluate?
How to learn?
How to evaluate?
Standard Evaluation
Non-Standard Evaluation
Standard Evaluation
Non-Standard Evaluation
UBL
R Package)
Imagine the following scenario:
Fraud
or Normal
Normal
, and only 10 cases classified as Fraud
However, learning algorithms make choices - they have assumptions.
The most hazardous for imbalanced domain learning are:
Instead of "It's all about the bass", it's in fact all about the mean/mode.
Remember this?
UBL
from CRANinstall.packages("UBL")
UBL
from GitHublibrary(devtools)
# stable release
install_github("paobranco/UBL",ref="master")
# development release
install_github("paobranco/UBL",ref="develop")
library(UBL)
iris
. The iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in 1936setosa
as being the rare class, and the other classes as being the normal classlibrary(UBL)
# generating an artificially imbalanced data set
data(iris)
data <- iris[, c(1, 2, 5)]
data$Species <- factor(ifelse(data$Species == "setosa","rare","common"))
## checking the class distribution of this artificial data set
table(data$Species)
##
## common rare
## 100 50
# using a percentage provided by the user to perform undersampling
datU <- RandUnderClassif(Species ~ ., data, C.perc = list(common=0.4))
table(datU$Species)
##
## common rare
## 40 50
# automaticaly balancing the data distribution
datB <- RandUnderClassif(Species ~ ., data, C.perc = "balance")
table(datB$Species)
##
## common rare
## 50 50
# using a percentage provided by the user to perform oversampling
datO <- RandOverClassif(Species ~ ., data, C.perc = list(rare=3))
table(datO$Species)
##
## common rare
## 100 150
# automaticaly balancing the data distribution
datB <- RandOverClassif(Species ~ ., data, C.perc = "balance")
table(datB$Species)
##
## common rare
## 100 100
#using smote just to oversample the class rare
datSM1 <- SmoteClassif(Species ~ ., data, C.perc = list(common=1,rare=6))
table(datSM1$Species)
##
## common rare
## 100 300
# user defined percentages for both undersample and oversample
datSM2 <- SmoteClassif(Species~., data, C.perc=list(common=0.2, rare=2))
table(datSM2$Species)
##
## common rare
## 20 100
algae
data set from the package DMwR
a7
# loading the algae data set
library("DMwR")
## Error in library("DMwR"): there is no package called 'DMwR'
data(algae)
algae <- knnImputation(algae)
## Error in knnImputation(algae): could not find function "knnImputation"
# checking the density distribution of the data set
plot(density(algae$a7))
## Error in density(algae$a7): object 'algae' not found
Ribeiro, R. Utility-based regression. Diss. PhD thesis, Dep. Computer Science, Faculty of Sciences-University of Porto, 2011
## Error in adjbox(algae$a7, horizontal = TRUE): object 'algae' not found
## Error in library(IRon): there is no package called 'IRon'
## Error in loadNamespace(name): there is no package called 'IRon'
# using the automatic method for defining the relevance function and the default threshold (0.5)
Alg.U <- RandUnderRegress(a7 ~ ., algae, C.perc = list(0.5)) # 50% undersample
## Error in which(names(dat) == as.character(form[[2]])): object 'algae' not found
Alg.UBal <- RandUnderRegress(a7 ~ ., algae, C.perc = "balance")
## Error in which(names(dat) == as.character(form[[2]])): object 'algae' not found
## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.U$a7): object 'Alg.U' not found
## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.UBal$a7): object 'Alg.UBal' not found
# using the automatic method for defining the relevance function and the default threshold (0.5)
Alg.O <- RandOverRegress(a7 ~ ., algae, C.perc = list(4.5))
## Error in which(names(dat) == as.character(form[[2]])): object 'algae' not found
Alg.OBal <- RandOverRegress(a7 ~ ., algae, C.perc = "balance")
## Error in which(names(dat) == as.character(form[[2]])): object 'algae' not found
## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.U$a7): object 'Alg.U' not found
## Error in density(Alg.O$a7): object 'Alg.O' not found
## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.UBal$a7): object 'Alg.UBal' not found
## Error in density(Alg.OBal$a7): object 'Alg.OBal' not found
# we have two bumps: the first must be undersampled and the second oversampled.
# Thus, we can chose the following percentages:
thr.rel=0.8; C.perc=list(0.2, 4)
## Error in knnImputation(algae): could not find function "knnImputation"
# using these percentages and the relevance threshold of 0.8 with all the other parameters default values
Alg.SM <- SmoteRegress(a7 ~ ., algae, thr.rel = thr.rel, C.perc = C.perc, dist = "HEOM")
## Error in SmoteRegress(a7 ~ ., algae, thr.rel = thr.rel, C.perc = C.perc, : object 'algae' not found
# use the automatic method for obtaining a balanced data set
Alg.SMBal <- SmoteRegress(a7 ~ ., algae, thr.rel = thr.rel, C.perc = "balance", dist = "HEOM")
## Error in SmoteRegress(a7 ~ ., algae, thr.rel = thr.rel, C.perc = "balance", : object 'algae' not found
## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.U$a7): object 'Alg.U' not found
## Error in density(Alg.O$a7): object 'Alg.O' not found
## Error in density(Alg.SM$a7): object 'Alg.SM' not found
## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.UBal$a7): object 'Alg.UBal' not found
## Error in density(Alg.OBal$a7): object 'Alg.OBal' not found
## Error in density(Alg.SMBal$a7): object 'Alg.SMBal' not found
1. Machine Learning has a lot of faces and some of them are not pretty
Imbalanced Domain Learning is considered one of the most important topics for Machine Learning and Data Mining
There are a lot of strategies to tackle this type of tasks, but all of them have their advantages and disadvantages
Solutions are domain-dependent
Remember: before you begin tackling any ML problem, investigate the domain and your objective.
1. Auto-Machine Learning and Imbalanced Domain Learning
Targeted Resampling: reduce the variance of outcomes in strategies
How to "force" a model to account for small concepts without sampling