Hands-on Tree-Based Model
1. Build a regression tree for the data set
ds <- read.csv("~/Desktop/Academia/Teaching/2020 - 2021/2º Semestre/DF_MSI/Classes/Class 5 - Predictive Modelling 1/wine.data")
head(ds)
## Class Alcohol Malic_acid Ash Ash_Alcalinity Magnesium Total_Phenols
## 1 1 14.23 1.71 2.43 15.6 127 2.80
## 2 1 13.20 1.78 2.14 11.2 100 2.65
## 3 1 13.16 2.36 2.67 18.6 101 2.80
## 4 1 14.37 1.95 2.50 16.8 113 3.85
## 5 1 13.24 2.59 2.87 21.0 118 2.80
## 6 1 14.20 1.76 2.45 15.2 112 3.27
## Flavanoids Nonfla_Phenols Proanthocyanins ColorInt Hue OD Proline
## 1 3.06 0.28 2.29 5.64 1.04 3.92 1065
## 2 2.76 0.26 1.28 4.38 1.05 3.40 1050
## 3 3.24 0.30 2.81 5.68 1.03 3.17 1185
## 4 3.49 0.24 2.18 7.80 0.86 3.45 1480
## 5 2.69 0.39 1.82 4.32 1.04 2.93 735
## 6 3.39 0.34 1.97 6.75 1.05 2.85 1450
any(is.na(ds))
## [1] FALSE
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
m <- rpartXse(Proline ~ ., ds, model=TRUE)
2. Obtain a graph of the obtained regression tree
library(rpart.plot)
## Loading required package: rpart
prp(m,type=3,extra=101)

3. Apply the tree to the data used to obtain the model and calculate the mean squared error of the predictions
p <- predict(m, ds)
mse <- mean((ds$Proline - p)^2); mse
## [1] 31167.56
4. Split the data set in two parts: 70% of the tests (training data) and the remaining 30% (test data). Use the larger part to obtain a regression tree and apply it to the other part. Calculate the mean squared error again, and compare with the previous scores.
ind.train <- sample(1:nrow(ds),0.7*nrow(ds),replace = FALSE)
train <- ds[ind.train,]
test <- ds[-ind.train,]
m <- rpartXse(Proline ~ ., train, model=TRUE)
p <- predict(m, test)
mse <- mean((test$Proline - p)^2); mse
## [1] 31818.95