Imbalanced Domain Learning

Fraud Detection Course - 2020/2021

Nuno Moniz
nuno.moniz@fc.up.pt

Today

  1. Beyond Standard ML
  2. Imbalanced Domain Learning
    • 2.1 Problem Formulation
    • 2.2 Evaluation/Learning
  3. Strategies for Imbalanced Domain Learning
  4. Practical Examples

Beyond Standard Machine Learning

Hey Model 1, all apples are red, yellow or green.

Hey Model 1, what's the colour of this apple?

Famous ML Mistakes #1

Famous ML Mistakes #2

Machine Learning, Predictive Modelling


  • The goal of predictive modelling is to obtain a good approximation for an unknown function
    • \(Y = f(x_1, x_2,\cdots)\)
  • What you need (the most basic):
    • A dataset
    • A target variable
    • Learning algorithm(s)
    • Evaluation metric(s)

Modelling Wally


  • Your task is to find Wally among 100 people
  • You train two models using the state-of-the-art in Deep Learning
    • Model 1 obtains 99% Accuracy
    • Model 2 obtains 1% Accuracy
  • Confusion Matrix?

plot of chunk unnamed-chunk-1

Predicting Popular News


  • Your task is to anticipate the most popular news of the day
  • You train two models using the state-of-the-art Ensemble Learning techniques
    • Model 1 obtains 0.1 NMSE
    • Model 2 obtains 0.5 NMSE
  • Normalized Mean Squared Error: \(\frac{1}{n} \sum_{i=1}^{n} \frac{(\hat{y_i} - y_i)^2}{y_i^2}\)

plot of chunk unnamed-chunk-2

Imbalanced Domain Learning

Imbalanced Domain Learning?


  • It's still predictive modelling, and as such...
  • "The goal of predictive modelling is to obtain a good approximation for an unknown function"
    • \(Y = f(x_1, x_2,\cdots)\)
  • Standard predictive modelling has some assumptions:
    • The distribution of the target variable is balanced
    • Users have uniform preferences: all errors were born equal
  • Assumptions in Imbalanced Domain Learning:
    • The distribution of the target variable is imbalanced
    • Users have non-uniform preferences: some cases are more important
    • The more important/relevant cases are those which are rare or extreme

Imbalanced Domain Learning - Nominal Target

plot of chunk unnamed-chunk-3 A balanced distribution

plot of chunk unnamed-chunk-4 An imbalanced distribution

Imbalanced Domain Learning - Numerical Target

plot of chunk unnamed-chunk-5 A balanced distribution

plot of chunk unnamed-chunk-6 An imbalanced distribution

Problems with Imbalanced Domains


There are two main problems when learning with imbalanced domains
1. How to learn?
2. How to evaluate?

How to learn?

  • Models are optimized to accurately represent the maximum of information
  • When there's information imbalance, this means that it will more likely represent the majority type of information in detriment to the minority (rare/extreme cases)

How to evaluate?

  • Most of the most well-known evaluation metrics are focused on assessing the average behaviour of the models
  • However, there are a lot of cases where the evaluation objective is to understand if a model is capable of predicting a certain class or subset of values, i.e. imbalanced domain learning

The Problem of Evaluation


  • This problem is different for classification and regression tasks
  • Imbalanced Learning has been explored in classification for over 20 years, but in regression problems it is very recent
  • The main idea is this: for evaluating this type of problems:
    • Remember that not all cases are equal
    • You're focused on the ability of models in predicting a rare cases
    • Missing a prediction of rare cases is worst than missing a normal case

The Problem of Evaluation


Classification

Standard Evaluation

  • Accuracy
  • Error Rate (complement)

Non-Standard Evaluation

  • F-Score
  • G-Mean
  • ROC Curves / AUC

Regression

Standard Evaluation

  • MSE
  • RMSE

Non-Standard Evaluation

  • MSE\(_\phi\)
  • RMSE\(_\phi\)
  • Utility-Based Metrics (UBL R Package)

The Problem of Learning


Imagine the following scenario:

  • You have a dataset of 10,000 cases of credit transactions classified as Fraud or Normal
  • This dataset has 9,990 cases classified as Normal, and only 10 cases classified as Fraud
  • Learning algorithm are not human beings: they're programmed to operate in a pre-determined way
  • This usually means that the problem they want to solve is: how can we accurately represent the data?

However, learning algorithms make choices - they have assumptions.

The most hazardous for imbalanced domain learning are:

  1. Assuming that all cases are equal
  2. Internal optimization/decisions based on standard metrics

The Problem of Learning


Instead of "It's all about the bass", it's in fact all about the mean/mode.

Remember this?

plot of chunk unnamed-chunk-7

plot of chunk unnamed-chunk-8

Until now


  1. There's more to machine learning than standard tasks
  2. Learning algorithms are biased and,
  3. Algorithms are focused on reducing the average error/representing the majority cases
  4. Beware of standard evaluation metrics if your task is imbalanced domain learning

Strategies for Imbalanced Domain Learning

Strategies for Imbalanced Domain Learning


plot of chunk unnamed-chunk-9

Data Pre-Processing


  • Goal: change the examples distribution before applying any learning algorithm
  • Advantages: any standard learning algorithm can then be used
  • Disadvantages
    • difficult to decide the optimal distribution (a perfect balance does not always provide the optimal results)
    • the strategies applied may severly increase/decrease the total number of examples

plot of chunk unnamed-chunk-10 plot of chunk unnamed-chunk-11

Special-purpose Learning Methods


  • Goal: change existing algorithms to provide a better fit to the imbalanced distribution
  • Advantages
    • very effective in the contexts for which they were design
    • more comprehensible to the user
  • Disadvantages
    • difficult task because it requires a deep knowledge of both the algorithm and the domain
    • difficulty of using an already adapted method in a different learning system

plot of chunk unnamed-chunk-12

Prediction Post-Processing


  • Goal: change the predictions after applying any learning algorithm
  • Advantages: any standard learning algorithm can be used
  • Disadvantages: potential loss of models interpretability

plot of chunk unnamed-chunk-13

Practical Examples


Practical Examples - Data Pre-Processing

  • Data Pre-Processing strategies are also known are Resampling Strategies
  • These are the most common strategies to tackle imbalanced domain learning tasks
  • We will look at practical examples for both classification and regression using:
    • Random Undersampling
    • Random Oversampling
    • SMOTE

Preliminaries in R


  • Install the package UBL from CRAN
install.packages("UBL")
  • Install UBL from GitHub
library(devtools)
# stable release
install_github("paobranco/UBL",ref="master")
# development release
install_github("paobranco/UBL",ref="develop")
  • After installation, the package can be used, as any other R package
library(UBL)

Data for Pratical Examples - Classification Tasks

  • We will use the well-known dataset iris. The iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in 1936
  • The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor).
  • For the purpose of the practical examples, we will consider the class setosa as being the rare class, and the other classes as being the normal class
library(UBL)
# generating an artificially imbalanced data set
data(iris)
data <- iris[, c(1, 2, 5)]
data$Species <- factor(ifelse(data$Species == "setosa","rare","common"))
## checking the class distribution of this artificial data set
table(data$Species)
## 
## common   rare 
##    100     50

Random Undersampling


  • To force the models to focus on the most important and least represented class(es) this technique randomly removes examples from the most represented and therefore less important class(es)
  • As such, the modified data set obtained is smaller than the original one
  • The user must always be aware that to obtain a more balanced data set, this strategy may discard useful data
  • Therefore, this strategy should be applied with caution, specially in smaller datasets

plot of chunk unnamed-chunk-18

Random Undersampling


# using a percentage provided by the user to perform undersampling
datU <- RandUnderClassif(Species ~ ., data, C.perc = list(common=0.4))
table(datU$Species)
## 
## common   rare 
##     40     50
# automaticaly balancing the data distribution
datB <- RandUnderClassif(Species ~ ., data, C.perc = "balance")
table(datB$Species)
## 
## common   rare 
##     50     50

Random Undersampling

plot of chunk unnamed-chunk-21

plot of chunk unnamed-chunk-22

Random Oversampling


  • This strategy introduces replicas of randomly selected examples from relevant classes of the data set, e.g. replicating fraud cases
  • This allows to obtain a better balanced data set without discarding any important examples
  • However, this method is highly prone to over-fitting (think a data set only with red apples)
  • This methods also has a strong impact on the number of examples of the new data set which can be a difficulty for the algorithm

plot of chunk unnamed-chunk-23

Random Oversampling


 # using a percentage provided by the user to perform oversampling
datO <- RandOverClassif(Species ~ ., data, C.perc = list(rare=3))
table(datO$Species)
## 
## common   rare 
##    100    150
# automaticaly balancing the data distribution
datB <- RandOverClassif(Species ~ ., data, C.perc = "balance")
table(datB$Species)
## 
## common   rare 
##    100    100

Random Oversampling

plot of chunk unnamed-chunk-25

plot of chunk unnamed-chunk-26

SMOTE


  • Synthetic Minority Over-Sampling Technique
  • The SMOTE algorithm is a strategy that performs oversampling via the generation of synthetic examples
  • A synthetic case of the minority class is generated via the interpolation of two minority class cases
  • A new minority case example is obtained with a seed example from that class and one of its randomly selected \(k\) nearest neighbours
  • With two examples, a new synthetic case is obtained by interpolating the example features

SMOTE


 #using smote just to oversample the class rare
datSM1 <- SmoteClassif(Species ~ ., data, C.perc = list(common=1,rare=6))
table(datSM1$Species)
## 
## common   rare 
##    100    300
# user defined percentages for both undersample and oversample
datSM2 <- SmoteClassif(Species~., data, C.perc=list(common=0.2, rare=2))
table(datSM2$Species)
## 
## common   rare 
##     20    100

SMOTE

plot of chunk unnamed-chunk-28

plot of chunk unnamed-chunk-29

SMOTE

plot of chunk unnamed-chunk-30

plot of chunk unnamed-chunk-31

Data for Pratical Examples - Regression Tasks


  • We will use the algae data set from the package DMwR
  • This data set contains observations on 11 variables as well as the concentration levels of 7 harmful algae. Values were measured in several European rivers
  • We will focus on the target variable a7
# loading the algae data set
library("DMwR")
## Error in library("DMwR"): there is no package called 'DMwR'
data(algae)
algae <- knnImputation(algae)
## Error in knnImputation(algae): could not find function "knnImputation"
# checking the density distribution of the data set
plot(density(algae$a7))
## Error in density(algae$a7): object 'algae' not found

Extreme Values?


  • In classification tasks, it is easy to know if a distribution is imbalanced: class frequency
  • How can we do it regression tasks?
  • Specifically, how can we determine the importance (or relevance) of a given value in a domain?
    • Which values should be considered extreme?

Relevance Functions

  • When dealing with imbalanced regression tasks the user should specify which are the important values
  • However, this can be a very difficult task: continuous domains are potentially infinite
  • Professor Rita Ribeiro (Ribeiro, 2011) proposed a framework for defining relevance functions of a given continuous target variable. This framework:
    • Includes an automatic method that allows to obtain a relevance function from the target value distribution. It is based on boxplot rules (lookup adjusted boxplot). This method assumes that the most important values for users are those considered as extreme by such rules
    • It also allows users to manually specify the values which are considered to be relevant and irrelevant using a matrix (the user inserts a given set of points, and the remaining are interpolated)

Ribeiro, R. Utility-based regression. Diss. PhD thesis, Dep. Computer Science, Faculty of Sciences-University of Porto, 2011

Relevance Threshold

  • After obtaining a relevance function for a given target variable, we still have to define which range of relevance values should be considered as being the most important values
## Error in adjbox(algae$a7, horizontal = TRUE): object 'algae' not found
## Error in library(IRon): there is no package called 'IRon'
## Error in loadNamespace(name): there is no package called 'IRon'

Random Undersampling


  • The Random Undersampling approach is simply based on the random removal of examples from the original data set
  • The examples removed are randomly selected from the subset of examples with less important ranges of the target variable, those with a relevance that is below the user-defined threshold
  • You need to define a relevance function, a relevance threshold and the percentage of undersampling to perform
# using the automatic method for defining the relevance function and the default threshold (0.5)
Alg.U <- RandUnderRegress(a7 ~ ., algae, C.perc = list(0.5)) # 50% undersample
## Error in which(names(dat) == as.character(form[[2]])): object 'algae' not found
Alg.UBal <- RandUnderRegress(a7 ~ ., algae, C.perc = "balance")
## Error in which(names(dat) == as.character(form[[2]])): object 'algae' not found

Random Undersampling

## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.U$a7): object 'Alg.U' not found
## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.UBal$a7): object 'Alg.UBal' not found

Random Oversampling


  • The Random Oversampling approach is simply based on the introduction of random copies of examples from the training data set
  • These replicas are only introduced in the most important ranges of the target variable, i.e. for cases with a relevance value that is above the user-defined threshold
  • It is necessary to define a relevance function, a relevance threshold and the percentage of oversampling to perform
# using the automatic method for defining the relevance function and the default threshold (0.5)
Alg.O <- RandOverRegress(a7 ~ ., algae, C.perc = list(4.5))
## Error in which(names(dat) == as.character(form[[2]])): object 'algae' not found
Alg.OBal <- RandOverRegress(a7 ~ ., algae, C.perc = "balance")
## Error in which(names(dat) == as.character(form[[2]])): object 'algae' not found

Random Oversampling

## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.U$a7): object 'Alg.U' not found
## Error in density(Alg.O$a7): object 'Alg.O' not found
## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.UBal$a7): object 'Alg.UBal' not found
## Error in density(Alg.OBal$a7): object 'Alg.OBal' not found

SMOTE

  • The relevance function and the relevance threshold defined determine which are the relevant and non-relevant (normal) cases
  • This algorithm combines an oversampling strategy by interpolation of important examples with a random undersampling approach
  • The procedure is in all similar to the generation of synthetic examples in SMOTE for classification tasks
# we have two bumps: the first must be undersampled and the second oversampled.
# Thus, we can chose the following percentages:
thr.rel=0.8; C.perc=list(0.2, 4)
## Error in knnImputation(algae): could not find function "knnImputation"
# using these percentages and the relevance threshold of 0.8 with all the other parameters default values
Alg.SM <- SmoteRegress(a7 ~ ., algae, thr.rel = thr.rel, C.perc = C.perc, dist = "HEOM")
## Error in SmoteRegress(a7 ~ ., algae, thr.rel = thr.rel, C.perc = C.perc, : object 'algae' not found
# use the automatic method for obtaining a balanced data set
Alg.SMBal <- SmoteRegress(a7 ~ ., algae, thr.rel = thr.rel, C.perc = "balance", dist = "HEOM")
## Error in SmoteRegress(a7 ~ ., algae, thr.rel = thr.rel, C.perc = "balance", : object 'algae' not found

SMOTE

## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.U$a7): object 'Alg.U' not found
## Error in density(Alg.O$a7): object 'Alg.O' not found
## Error in density(Alg.SM$a7): object 'Alg.SM' not found
## Error in density(algae$a7): object 'algae' not found
## Error in density(Alg.UBal$a7): object 'Alg.UBal' not found
## Error in density(Alg.OBal$a7): object 'Alg.OBal' not found
## Error in density(Alg.SMBal$a7): object 'Alg.SMBal' not found

Wrap-up

Summary


1. Machine Learning has a lot of faces and some of them are not pretty

  1. Imbalanced Domain Learning is considered one of the most important topics for Machine Learning and Data Mining

  2. There are a lot of strategies to tackle this type of tasks, but all of them have their advantages and disadvantages

  3. Solutions are domain-dependent

  4. Remember: before you begin tackling any ML problem, investigate the domain and your objective.

Three Challenges for the Future


1. Auto-Machine Learning and Imbalanced Domain Learning

  1. Targeted Resampling: reduce the variance of outcomes in strategies

  2. How to "force" a model to account for small concepts without sampling