
Nuno Moniz
nuno.moniz@fc.up.pt
Prediction is:
the ability to anticipate the future (forecasting).
possible if we assume that there is some regularity in what we observe, i.e. if the observed events are not random.
Example: Medical Diagnosis
Given an historical record containing the symptoms observed in several patients and the respective diagnosis, try to forecast the correct diagnosis for a new patient for which we know the symptoms.
Predictive models have two main uses:
Classification Task
\(Species = f(Sepal.Width, \cdots)\)
## Sepal.Width Petal.Length Petal.Width Species
## 1 3.5 1.4 0.2 setosa
## 2 3.0 1.4 0.2 setosa
## 3 3.2 1.3 0.2 setosa
## 4 3.1 1.5 0.2 setosa
## 5 3.6 1.4 0.2 setosa
## 6 3.9 1.7 0.4 setosa
Regression Task
\(a1 = f(Cl, \cdots)\)
## Cl NO3 NH4 oPO4 PO4 Chla a1
## 1 60.800 6.238 578.000 105.000 170.000 50.0 0.0
## 2 57.750 1.288 370.000 428.750 558.750 1.3 1.4
## 3 40.020 5.330 346.667 125.667 187.057 15.6 3.3
## 4 77.364 2.302 98.182 61.182 138.700 1.4 3.1
## 5 55.350 10.416 233.700 58.222 97.580 10.5 9.2
## 6 65.750 9.248 430.000 18.250 56.667 28.4 15.1
The setting
The approach
$\large{L_{0/1} = \frac{1}{N_{test}} \sum_{i=1}^N I( \hat{h_\theta} (\mathbf{x_i}),y_i)}$
where \(I()\) is an indicator function such that \(I(x,y) = 0\) if \(x = y\) and \(1\) otherwise; and \(\hat{h_\theta(x)}\) is the prediction of the model being evaluated for the test case \(i\) that has as true class the value \(y_i\).
trueVals <- c("c1", "c1", "c2", "c1", "c3", "c1", "c2", "c3", "c2", "c3")
preds <- c("c1", "c2", "c1", "c3", "c3", "c1", "c1", "c3", "c1", "c2")
confMatrix <- table(trueVals, preds)
confMatrix
## preds
## trueVals c1 c2 c3
## c1 2 1 1
## c2 3 0 0
## c3 0 1 2
errorRate <- 1 - sum(diag(confMatrix)) / sum(confMatrix)
errorRate
## [1] 0.6
Cost/benefit Matrices
Table where each entry specifies the cost (negative benefit) or benefit of each type of prediction
trueVals <- c("c1", "c1", "c2", "c1", "c3", "c1", "c2", "c3", "c2", "c3")
preds <- c("c1","c2","c1","c3","c3","c1","c1","c3","c1","c2")
confMatrix <- table(trueVals, preds)
costMatrix <- matrix(c(10, -2, -4, -2, 30, -3, -5, -6, 12), ncol = 3)
colnames(costMatrix) <- c("predC1", "predC2", "predC3")
rownames(costMatrix) <- c("obsC1", "obsC2", "obsC3")
costMatrix
## predC1 predC2 predC3
## obsC1 10 -2 -5
## obsC2 -2 30 -6
## obsC3 -4 -3 12
utilityPreds <- sum(confMatrix * costMatrix)
utilityPreds
## [1] 28
$Prec = \frac{TP}{TP+FP}$
$Rec = \frac{TP}{TP+FN}$
$F_\beta = \frac{ ( \beta^2 + 1 ) \times Prec \times Rec }{ \beta^2 \times Prec + Rec }$
where \(\beta\) controls the relative importance of \(Prec\) and \(Rec\).
The setting
The approach
$MSE = \frac{1}{N_{test}} \sum_{i=1}^N ( \hat{y} - y )^2$
where \(\hat{y_i}\) is the prediction of the model under evaluation for the case \(i\) and \(y_i\) the respective true target variable value.
$MAE = \frac{1}{N_{test}} \sum_{i=1}^N | \hat{y} - y |$
where \(\hat{y_i}\) is the prediction of the model under evaluation for the case \(i\) and \(y_i\) the respective true target variable value.
$NMSE = \frac{ \sum_{i=1}^{N_{test}} ( \hat{y_i} - y_i)^2 }{ \sum_{i=1}^{N_{test}} ( \bar{y_i} - y_i)^2 }$
$NMAE = \frac{ \sum_{i=1}^{N_{test}} | \hat{y_i} - y_i | }{ \sum_{i=1}^{N_{test}} | \bar{y_i} - y_i | }$
The Mean Average Percentage Error (MAPE) is given by,
$MAPE = \frac{1}{N_{test}} \sum_{i=1}^{N_{test}} \frac{ | \hat{y_i} - y_i | }{ y_i }$
The Correlation between the predictions and the true values (\(\rho_{\hat{y}, y}\)) is given by,
$\rho_{\hat{y}, y} = \frac{ \sum_{i=1}^{N_{test}} (\hat{y_i} - \bar{\hat{y}}) ( y_i - \bar{y} ) }{ \sqrt{ \sum_{i=1}^{N_{test}} ( \hat{y_i} - \bar{\hat{y_i}})^2 \sum_{i=1}^{N_{test}} ( \hat{y_i} - \bar{y_i})^2 } }$
trueVals <- c(10.2, -3, 5.4, 3, -43, 21,
32.4, 10.4, -65, 23)
preds <- c(13.1, -6, 0.4, -1.3, -30, 1.6,
3.9, 16.2, -6, 20.4)
mse <- mean((trueVals - preds)^2); mse
## [1] 493.991
rmse <- sqrt(mse); rmse
## [1] 22.22591
mae <- mean(abs(trueVals - preds)); mae
## [1] 14.35
nmse <- sum((trueVals - preds)^2) / sum((trueVals - mean(trueVals))^2); nmse
## [1] 0.5916071
nmae <- sum(abs(trueVals - preds)) / sum(abs(trueVals - mean(trueVals))); nmae
## [1] 0.65633
mape <- mean(abs(trueVals - preds)/trueVals); mape
## [1] 0.290773
corr <- cor(trueVals, preds); corr
## [1] 0.6745381
Too large trees tend to overfit the training data and will perform badly on new data - a question of reliability of error estimates
Should be the value that better represents the cases in the leaves
A test is good if it is able to split the cases of sample in such a way that they form partitions that are “purer” than the parent node
rpart
DMwR
you may find function rpartXse()
that grows and prunes a tree in a way similar to CART using the above infra-structurelibrary(DMwR)
library(rpart.plot)
data(Glass,package='mlbench')
ac <- rpartXse(Type ~ .,Glass, model=TRUE)
prp(ac,type=4,extra=101)
tr <- Glass[1:200,]
ts <- Glass[201:214,]
ac <- rpartXse(Type ~ .,tr, model=TRUE)
predict(ac,ts)
## 1 2 3 5 6 7
## 201 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 202 0.2 0.2666667 0.4666667 0.00000000 0.06666667 0.0000000
## 203 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 204 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 205 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 206 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 207 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 208 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 209 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 210 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 211 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 212 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 213 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
## 214 0.0 0.0000000 0.0000000 0.09090909 0.00000000 0.9090909
predict(ac, ts, type='class')
## 201 202 203 204 205 206 207 208 209 210 211 212 213 214
## 7 3 7 7 7 7 7 7 7 7 7 7 7 7
## Levels: 1 2 3 5 6 7
ps <- predict(ac, ts, type='class'); table(ps, ts$Type)
##
## ps 1 2 3 5 6 7
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 1
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 13
mc <- table(ps, ts$Type); err <- 100 * (1-sum(diag(mc)) / sum(mc)); err
## [1] 7.142857
library(DMwR); library(rpart.plot);
data(algae)
d <- algae[,1:12]
ar <- rpartXse(a1 ~ .,d, model=TRUE)
prp(ar,type=4,extra=101)
tr <- d[1:150,]
ts <- d[151:200,]
ar <- rpartXse(a1 ~ .,tr, model=TRUE)
preds <- predict(ar,ts)
mae <- mean(abs(preds-ts$a1)); mae
## [1] 12.27911
cr <- cor(preds,ts$a1); cr
## [1] 0.512407
File Wine.data
contains a data frame with data on green wine quality. The data set contains a series of tests with green wines (red and white). For each of these tests the values of several physicochemical variables together with a quality score assigned by wine experts (column Proline) is depicted.
Build a regression tree for the data set
Obtain a graph of the obtained regression tree
Apply the tree to the data used to obtain the model and calculate the mean squared error of the predictions
Split the data set in two parts: 70% of the tests (training data) and the remaining 30% (test data). Use the larger part to obtain a regression tree and apply it to the other part. Calculate the mean squared error again, and compare with the previous scores.
e1071
, through function naiveBayes()
$\large{P(H|\mathbf{x}) = \frac{P(\mathbf{x}|H) P(H)}{P(\mathbf{x})}}$
We have a data set \(D\) with cases belonging to one of m classes \(c_1, c_2, \cdots, c_m\)
Given a new test case \(\mathbf{x}\) this classifier produces as prediction the class that has the highest estimated probability, i.e. \(max_{i \in {1, 2, \cdots, m}} P(c_i|\mathbf{x})\)
Given that \(P(\mathbf{x})\) is constant for all classes, according to the Bayes Theorem the class with the highest probability is the one maximizing the quantity \(P(\mathbf{x}|c_i) P(c_i)\)
library(e1071); sp <- sample(1:150, 100)
tr <- iris[sp, ]; ts <- iris[-sp, ]
nb <- naiveBayes(Species ~ ., tr)
(mtrx <- table(predict(nb, ts), ts$Species))
##
## setosa versicolor virginica
## setosa 14 0 0
## versicolor 0 16 2
## virginica 0 2 16
(err <- 1 - sum(diag(mtrx)) / sum(mtrx))
## [1] 0.08
head(predict(nb, ts, type='raw'), 2)
## setosa versicolor virginica
## [1,] 1 1.115293e-17 3.730156e-28
## [2,] 1 1.209394e-13 5.587356e-23
What if one of the components of the Bayes Theorem is equal to zero?
library(e1071); sp <- sample(1:150,100)
tr <- iris[sp,]; ts <- iris[-sp,]
nb <- naiveBayes(Species ~ ., tr,laplace=1)
(mtrx <- table(predict(nb,ts),ts$Species))
##
## setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 18 1
## virginica 0 3 13
(err <- 1-sum(diag(mtrx))/sum(mtrx))
## [1] 0.08
head(predict(nb,ts,type='raw'),2)
## setosa versicolor virginica
## [1,] 1 1.636668e-18 2.342638e-28
## [2,] 1 3.748327e-19 1.061047e-28
Using the dataset iris
, and remembering that the target variable is Species
:
Separate the data set into 70% / 30% for training and test set, respectively
Build a Bayesian Classifier for the data set
Calculate the Accuracy metric
Rebuild the Bayesian Classifier with a Laplace Correction, and calculate the Accuracy metric once more.