
Nuno Moniz
nuno.moniz@fc.up.pt
What?
Why? (Diettrich, 2002)
Dietterich, T. G. (2002). Ensemble Learning. In The Handbook of Brain Theory and Neural Networks, Second edition, (M.A. Arbib, Ed.), pp. 405-408. The MIT Press.
Result: there is a well-known bias-variance trade-off in learning a prediction model
Ensembles? They are able to reduce both components of the error
Their approach consists on applying the same algorithm to different samples of the data and use the resulting models in a voting/averaging schema to obtain predictions for new cases
Expected error
E \left [ \left ( y - \hat{f} \left ( x \right ) \right )^2 \right ] = \left ( {Bias} \left [ \hat{f} \left ( x \right ) \right ] \right )^2 + {Var} \left [ \hat{f} \left ( x \right ) \right ] + \sigma^2
Bias
{Bias} \left [ \hat{f} \left ( x \right ) \right ] = E \left [ \hat{f} \left ( x \right ) \right ] - f \left ( x \right )
Variance
{Var} \left [ \hat{f} \left ( x \right ) \right ] = E \left [ \hat{f} \left ( x \right )^2 \right ] - E \left [ \hat{f} \left ( x \right ) \right ]^2
The expected error of a certain model m on an unseen case is equal to the sum of its
Bias: the distance between the average value of predictions and the true value of the unseen case
Variance: the distance between the average of the squared predicted values and the average of the predicted values squared
Irreducible error: The previous components are all positive - this component sets the lower bound on the expected error
Breiman, L. (1996): Bagging predictors. In Machine Learning, 24: 123–140.
simpleBagging <- function(form,data,model='rpartXse',nModels=100,...) {
ms <- list()
n <- nrow(data)
for(i in 1:nModels) {
tr <- sample(n,n,replace=T)
ms[[i]] <- do.call(model,c(list(form,data[tr,]),...))
}
ms
}
predict.simpleBagging <- function(models,test) {
ps <- sapply(models, function(m) predict(m,test))
apply(ps,1,mean)
}
data(Boston, package='MASS')
library(DMwR)
trPerc <- 0.7
sp <- sample(1:nrow(Boston), trPerc * nrow(Boston))
tr <- Boston[sp, ]
ts <- Boston[-sp, ]
m <- simpleBagging(medv ~ ., tr, nModels=300, se=0.5)
ps <- predict.simpleBagging(m, ts)
mean(abs(ps - ts$medv))
## [1] 2.203965
adabag
library(adabag)
data(iris)
trPerc <- 0.7
sp <- sample(1:nrow(iris), trPerc * nrow(iris))
tr <- iris[sp, ]; ts <- iris[-sp, ]
m <- bagging(Species ~ ., tr, mfinal = 50)
ps <- predict(m, ts)
ps$confusion
## Observed Class
## Predicted Class setosa versicolor virginica
## setosa 9 0 0
## versicolor 0 12 0
## virginica 0 5 19
ps$error
## [1] 0.1111111
Breiman, L. (2001): "Random Forests". Machine Learning 45 (1): 5—32.
randomForest
(look into the ranger
package also - faster)library(randomForest)
data(Boston,package="MASS")
samp <- sample(1:nrow(Boston),354)
tr <- Boston[samp,]
ts <- Boston[-samp,]
m <- randomForest(medv ~ ., tr)
ps <- predict(m,ts)
mean(abs(ts$medv-ps))
## [1] 2.467972
data(Glass,package='mlbench')
sp <- sample(1:nrow(Glass),150)
tr <- Glass[sp,]; ts <- Glass[-sp,]
m <- randomForest(Type ~ ., tr, ntree=200)
ps <- predict(m,ts)
table(ps,ts$Type)
##
## ps 1 2 3 5 6 7
## 1 17 3 5 0 0 1
## 2 1 19 0 0 0 2
## 3 0 0 1 0 0 0
## 5 0 1 0 3 0 0
## 6 0 0 0 0 2 0
## 7 0 0 0 0 0 9
mc <- table(ps,ts$Type)
err <- 100*(1-sum(diag(mc))/sum(mc))
err
## [1] 20.3125
data(Boston,package='MASS')
library(randomForest)
m <- randomForest(medv ~ ., Boston,
importance=T)
importance(m)
## %IncMSE IncNodePurity
## crim 15.192643 2507.2427
## zn 3.776905 234.2308
## indus 10.882079 2260.5562
## chas 4.392328 242.1772
## nox 19.566910 3136.5959
## rm 37.051416 12803.6063
## age 15.099170 1060.0131
## dis 18.906455 2508.2821
## rad 5.649464 305.0809
## tax 12.739131 1247.6428
## ptratio 15.368937 2772.9928
## black 7.408346 796.7219
## lstat 29.521515 12158.9617
varImpPlot(m, main="Feature Relevance Scores")
Boosting (Schapire, 1990) was developed with the goal of answering the question: Can a set of weak learners create a single strong learner?
Rob Schapire (1990). Strength of Weak Learnability. Machine Learning Vol. 5, pages 197–227.
Y. Freund and R. Schapire (1996). Experiments with a new boosting algorithm, in Proc. of 13th International Conference on Machine Learning
adabag
library(adabag)
data(iris)
trPerc <- 0.7
sp <- sample(1:nrow(iris),as.integer(trPerc*nrow(iris)))
tr <- iris[sp,]
ts <- iris[-sp,]
m <- boosting(Species ~ ., tr)
ps <- predict(m,ts)
ps$confusion
## Observed Class
## Predicted Class setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 16 0
## virginica 0 2 15
This package also includes a function that allows you to check the evolution of the error as you increase the number of weak learners.
library(adabag)
data(BreastCancer,package="mlbench")
trPerc <- 0.7
sp <- sample(1:nrow(BreastCancer),as.integer(trPerc*nrow(BreastCancer)))
tr <- BreastCancer[sp,-1]
ts <- BreastCancer[-sp,-1]
m <- bagging(Class ~ ., tr,mfinal=100)
ps <- predict(m,ts)
ptr <- errorevol(m,tr)
pts <- errorevol(m,ts)
plot(ptr$error,type="l",xlab="nr.models",ylab="error",ylim=c(0,0.1))
lines(pts$error,col="red")
gbm
Package gbm
implements the Gradient Boosting Machine (Friedman, 2001)
library(gbm)
## Loaded gbm 2.1.8
data(Boston,package='MASS')
trPerc <- 0.7
sp <- sample(1:nrow(Boston),as.integer(trPerc*nrow(Boston)))
tr <- Boston[sp,]; ts <- Boston[-sp,]
m <- gbm(medv ~ .,distribution='gaussian',data=tr, n.trees=20000,verbose=F)
ps <- predict(m,ts,type='response',n.trees=20000)
mean(abs(ps-ts$medv))
## [1] 3.198266
Load in the data set algae
and answer the following questions:
How would you obtain a random forest to forecast the value of alga a4
Repeat the previous exercise but now using a boosting model
Obtain the predictions of the two previous models for the data used to obtain them (the train set). Draw a scatterplot comparing these predictions
Load the data set testAlgae
. It contains a data frame named test.algae
with some extra 140 water samples for which we want predictions. Use the previous two models to obtain predictions for a4 on these new samples. Check what happened to the test cases with NA’s. Fill-in the NA’s on the test set (remember the function knnImputation()
from the DMwR
package) and repeat the experiment.
Generic Goals of Descriptive Analytics
Generic Goals of Clustering
Remember the Euclidean, Minkoswki and heterogeneous distance functions from the previous class?
Function dist()
can be used to obtain a distance matrix between a set of cases
data(iris)
dm <- dist(iris[,-5]) # excluding the nominal target
as.matrix(dm)[1,4] # because dm is of class "dist"
## [1] 0.6480741
Parameter method of function dist()
allows you to specify different types of functions (defaults to Euclidean), such as Manhattan or Minkowski (check the help of the function)
Note that this function only accepts numeric variables
Function daisy()
from package cluster
can calculate distances between cases that include both numeric and nominal variables (heterogeneous distance)
library(cluster)
data(iris)
dm <- daisy(iris)
as.matrix(dm)[1,4] # because dm is of class "dist"
## [1] 0.06450094
Parameter metric of function daisy()
allows you to specify different types of functions (defaults to Euclidean), namely Euclidean, Manhattan or Gower (check the help of the function)
Hierarchical or Partitional
Exclusive or with Overlapping
Goal: Partition the given set of data into k groups by either minimizing or maximizing a pre-specified criterion
Some key issues:
N(n,k) = \frac{1}{k!} \sum_{i=1}^k (-1)^{k-i} \binom{k}{i} i^n
e.g. for n = 100 and k = 5, N(100, 5) \approx 6.6 \cdot 10^{67}
\bar{v}_i^c is the average value of variable i in cluster c, and \tilde{v}_i^c the median
It is a very simple partition-based method that obtains k groups of a data set
The algorithm
data(iris)
k3 <- kmeans(iris[,-5],centers=3,iter.max=200)
k3
## K-means clustering with 3 clusters of sizes 38, 50, 62
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 6.850000 3.073684 5.742105 2.071053
## 2 5.006000 3.428000 1.462000 0.246000
## 3 5.901613 2.748387 4.393548 1.433871
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [75] 3 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 1 1 1 1 3 1 1 1 1
## [112] 1 1 3 3 1 1 1 1 3 1 3 1 3 1 1 3 3 1 1 1 1 1 3 1 1 1 1 3 1 1 1 3 1 1 1 3 1
## [149] 1 3
##
## Within cluster sum of squares by cluster:
## [1] 23.87947 15.15100 39.82097
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
How to validate/evaluate/compare the results obtained with some clustering method?
Related Questions
Popular coefficient that tries to incorporate both the notions of cohesion and separation
s_i = \frac{b_i - a_i}{max(a_i,b_i)}
The coefficient takes values between -1 and 1. Ideally all objects should have positive values (a_i < b_i), and the a_i's should be near zero
table(k3$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 0 2 36
## 2 50 0 0
## 3 0 48 14
library(cluster)
s <- silhouette(k3$cluster,
dist(iris[, -5]))
plot(s)
Among many different possible strategies we could use the average silhouette coefficient value to try several possible clusters and select the "best" one. An illustrative example in R:
d <- dist(iris[,-5])
avgS <- c()
for(k in 2:6) {
cl <- kmeans(iris[,-5],centers=k,iter.max=200)
s <- silhouette(cl$cluster,d)
avgS <- c(avgS,mean(s[,3]))
}
library(ggplot2)
ggplot(data.frame(nClus=2:6,Silh=avgS),
aes(x=nClus,y=Silh)) +
geom_point(size=3,color="red") +
geom_line() + xlab("Nr.Clusters") +
ylab("Silh.Coef.")
H(C,k) = \sum_{i=1}^k \sum_{\mathbf{x} \in C_i} | \mathbf{x} - \bar{\mathbf{x}}_{C_i} |
where \bar{\mathbf{x}}_{C_i} is the centroid of cluster C_i
library(cluster)
pc <- pam(iris[,-5],k=3)
table(pc$clustering,iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 14
## 3 0 2 36
s <- silhouette(pc$clustering,
dist(iris[,-5]))
plot(s)
library(cluster)
cl <- clara(iris[,-5],3)
table(cl$clustering,iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 48 13
## 3 0 2 37
clusplot(cl)
clusplot(cl)
The DBSCAN Algorithm
Note that this method does not require the user to specify the number of groups.
(But, you need to specify the radius)
library(fpc)
d <- scale(iris[,-5])
db <- dbscan(d,0.9)
db
## dbscan Pts=150 MinPts=5 eps=0.9
## 0 1 2
## border 4 1 4
## seed 0 48 93
## total 4 49 97
table(db$cluster,iris$Species)
##
## setosa versicolor virginica
## 0 1 0 3
## 1 49 0 0
## 2 0 50 47
plot(db, d)
library(cluster)
f <- fanny(iris[,-5], 3, stand=T,metric='euclidean')
head(f$membership)
## [,1] [,2] [,3]
## [1,] 0.8297737 0.07841458 0.0918117
## [2,] 0.6409940 0.15528310 0.2037229
## [3,] 0.7534768 0.10943177 0.1370914
## [4,] 0.6870170 0.13724986 0.1757331
## [5,] 0.7931229 0.09626758 0.1106095
## [6,] 0.6080978 0.18974175 0.2021604
table(f$clustering,iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 12 39
## 3 0 38 11
clusplot(f)
The difference among two groups, candidates for merging can be measure in several ways:
First obtaining the clustering with the Complete Linkage method (default)
data(iris)
d <- dist(scale(iris[,-5]))
h <- hclust(d)
Now cutting the dendrogram to form 3 clusters
cls <- cutree(h,3)
table(cls,iris$Species)
##
## cls setosa versicolor virginica
## 1 49 0 0
## 2 1 21 2
## 3 0 29 48
plot(h)
library(cluster)
d <- dist(scale(iris[,-5]))
methds <- c('complete','single','average')
avgS <- matrix(NA,ncol=3,nrow=5,
dimnames=list(2:6,methds))
for(k in 2:6) {
for(m in seq_along(methds)) {
h <- hclust(d,meth=methds[m])
c <- cutree(h,k)
s <- silhouette(c,d)
avgS[k-1,m] <- mean(s[,3])
}
}
library(reshape2)
dt <- melt(avgS)
colnames(dt) <- c("NClusts","Meth","AvgS")
library(ggplot2)
ggplot(dt,aes(x=NClusts,y=AvgS,color=Meth)) +
geom_line()
These methods start by a single cluster and at each stage decide which cluster is split in two.
The DIANA method
library(cluster)
di <- diana(iris[,-5],
metric='euclidean',
stand=TRUE)
table(cutree(di,3),iris$Species)
##
## setosa versicolor virginica
## 1 50 0 0
## 2 0 11 33
## 3 0 39 17
pltree(di)
iris
data set answer the following questions. Species
)!Apply the k-means algorithm, and find the optimal number of clusters
Using the best k, apply the PAM, CLARA, DBSCAN, FANNY, Agglomerative Hierarchical Clustering (hclust()
) and DIANA algorithms
Compare all results using the silhouette coefficient