
Nuno Moniz
nuno.moniz@fc.up.pt
The Basic Idea!
Case 1 is “closer” to case 2 than to 3
$\large{d(\mathbf{x,y}) = \sqrt{\sum_{i=1}^p (x_i - y_i)^2} }$
where \(x_i\) is the value of case x on variable i
Example!
$\large{d(\mathbf{x,y}) = \sqrt{\sum_{i=1}^p (x_i - y_i)^2} }$ $\large{d(\mathbf{x,y}) = \sqrt{(3-12)^2 + (5-5.4)^2 + (1-(-3))^2 } }$
$d(\mathbf{x,y}) = \left ( \sum_{i=1}^p | x_i - y_i |^r \right )^{1/r}$
How to calculate the distance between two cases described by variables with different type (e.g. numeric and nominal variables)? A possible solution,
$\large{ d(\mathbf{x,y}) = \left ( \sum_{i=1}^p \Delta_i (x_i, y_i) \right ) }$
where,
$\delta_i(x_i,y_i) = \left\{\begin{matrix} 1 & \text{if i is nominal and } v_1 \neq v_2 \\ 0 & \text{if i is nominal and } v_1 = v_2 \\ \frac{|v_1-v_2|}{range(i)} & \text{if i is numeric} \end{matrix}\right.$
The distance between \(<2500, f, director>\) and \(<2750, f, director>\) would be given by \(\frac{|2500-2750|}{range(Salary)} + 0 + 0 + \frac{|35-30|}{range(Age)}\)
Method
Very simple method
May suffer with the presence of outliers
Frequently achieves good results
knn()
library(class)
sp <- sample(1:150,100)
tr <- iris[sp,]; ts <- iris[-sp,]
nn3 <- knn(tr[,-5],ts[,-5],tr[,5],k=3)
(mtrx <- table(nn3,ts$Species))
##
## nn3 setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 19 1
## virginica 0 2 18
(err <- 1-sum(diag(mtrx))/sum(mtrx))
## [1] 0.06
library(class); library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
sp <- sample(1:150,100)
tr <- iris[sp,]; ts <- iris[-sp,]
nn3 <- kNN(Species ~ .,tr,ts,k=3,norm=TRUE)
(mtrx <- table(nn3,ts$Species))
##
## nn3 setosa versicolor virginica
## setosa 18 0 0
## versicolor 0 12 2
## virginica 0 2 16
(err <- 1-sum(diag(mtrx))/sum(mtrx))
## [1] 0.08
trials <- c(1,3,5,7,11); nreps <- 10
res <- matrix(NA, nrow=length(trials), ncol=2)
for(k in seq_along(trials)) {
errs <- rep(0,nreps)
for(r in 1:nreps) {
sp <- sample(1:150,100); tr <- iris[sp,]; ts <- iris[-sp,]
nn3 <- kNN(Species ~ .,tr,ts,k=trials[k],norm=TRUE)
mtrx <- table(nn3,ts$Species)
errs[r] <- 1-sum(diag(mtrx))/sum(mtrx)
}
res[k,] <- c(mean(errs),sd(errs))
}
dimnames(res) <- list(paste('k',trials,sep='='),c('avg','std'))
res
## avg std
## k=1 0.074 0.04115013
## k=3 0.054 0.03777124
## k=5 0.046 0.02988868
## k=7 0.044 0.02458545
## k=11 0.056 0.03373096
library(caret)
data(Boston,package="MASS")
sp <- sample(1:506,354)
tr <- Boston[sp,]
ts <- Boston[-sp,]
tgt <- which(colnames(Boston) == "medv")
nn3 <- knnreg(tr[,-tgt],tr[,tgt],k=3)
pnn3 <- predict(nn3,ts[,-tgt])
(mse <- mean((pnn3-ts[,tgt])^2))
## [1] 46.14592
Build a distance matrix between all cases (lookup the function dist()
)
Load the data and split it in two random partitions: one with 100 cases that will form the training set; and the other with the remaining 50 that are the test cases.
Using the previously obtained distance matrix: how would you find the 2 nearest neighbours of the first test case?
Using the strategy of the previous question what would be the prediction for the same test case if we used 3, 5 or 7 nearest neighbours?
Obtain the predictions for all test cases using the 3 nearest neighbours strategy (without using the function knn())
Obtain the confusion matrix and the error rate of the previous predictions
Map the original data into a new (higher dimension) coordinates system where the classes are linearly separable
$L_D = \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i,j}^n \alpha_i \alpha_j y_i y_j (\mathbf{x_i} \cdot \mathbf{x_j})$
Subject to:
\(\alpha_i \geq 0\) \(\sum_i \alpha_i y_i = 0\)
In the found solution, the \(\alpha_i\)'s \(> 0\) correspond to the support vectors that represent the optimal solution
The Kernel Trick
Instead of calculating the dot products in a high dimensional space, take advantage of the proof that \(K(\mathbf{x,z}) = \phi(\mathbf{x}) \cdot \phi(\mathbf{z})\) and simply replace the complex dot products by these simpler and efficient calculations
The Algorithm
library(e1071)
data(Glass,package='mlbench')
tr <- Glass[1:200,]
ts <- Glass[201:214,]
s <- svm(Type ~ .,tr)
predict(s,ts)
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214
## 7 2 7 7 7 7 7 2 7 7 7 7 7 7
## Levels: 1 2 3 5 6 7
ps <- predict(s,ts)
table(ps,ts$Type)
##
## ps 1 2 3 5 6 7
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 2
## 3 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 12
mc <- table(ps,ts$Type)
error <- 100*(1-sum(diag(mc))/sum(mc))
error
## [1] 14.28571
V. Vapnik (1995). The Nature of Statistical Learning Theory. Springer.
$| \xi |_\epsilon = \left\{\begin{matrix} 0 & | \xi | \leq \epsilon \\ | \xi | - \epsilon & \text{otherwise} \end{matrix}\right.$
Minimize: \(\large{\frac{1}{2} || \mathbf{w} ||^2 + C \sum_{i=1}^I (\xi_i + \xi_i^\star )}\)
Subject to:
$\left\{\begin{matrix} y_i - \mathbf{w} \cdot \mathbf{x} - b & \leq \epsilon + \xi_i \\ \mathbf{w} \cdot \mathbf{x} + b - y_i & \leq \epsilon + \xi_i^{\star} \\ \xi_i, \xi_i^{\star} & \geq 0 \end{matrix}\right.$
where \(C\) corresponds to the cost to pay for each violation of the error limit \(\epsilon\)
library(e1071)
library(DMwR)
data(Boston,package='MASS')
sp <- sample(1:nrow(Boston),354)
tr <- Boston[sp,]
ts <- Boston[-sp,]
s <- svm(medv ~ .,tr,cost=10,epsilon=0.02)
preds <- predict(s,ts)
regr.eval(ts$medv,preds)
## mae mse rmse mape
## 2.0674690 12.5715754 3.5456417 0.1214812
plot(ts$medv,preds,main='Errors Scaterplot',
ylab='Predictions',xlab='True')
abline(0,1,col='red',lty=2)
File Wine.data
contains a data frame with data on green wine quality. The data set contains a series of tests with green wines (red and white). For each of these tests the values of several physicochemical variables together with a quality score assigned by wine experts (column Proline) is depicted.
Obtain an SVM model for forecasting the quality of variant with Class 2
Split the data set in two parts: one with 70% of the samples and the other with the remaining 30%. Obtain an SVM with the first part and apply it to the second. What was the resulting mean absolute error?
Using the round()
function, round the predictions obtained in the previous question to the nearest integer. Calculate the error rate of the resulting integers when compared to the true values