Hands-on Tree-Based Model

1. Build a regression tree for the data set

ds <- read.csv("~/Desktop/Academia/Teaching/2020 - 2021/2º Semestre/DF_MSI/Classes/Class 5 - Predictive Modelling 1/wine.data")
head(ds)
##   Class Alcohol Malic_acid  Ash Ash_Alcalinity Magnesium Total_Phenols
## 1     1   14.23       1.71 2.43           15.6       127          2.80
## 2     1   13.20       1.78 2.14           11.2       100          2.65
## 3     1   13.16       2.36 2.67           18.6       101          2.80
## 4     1   14.37       1.95 2.50           16.8       113          3.85
## 5     1   13.24       2.59 2.87           21.0       118          2.80
## 6     1   14.20       1.76 2.45           15.2       112          3.27
##   Flavanoids Nonfla_Phenols Proanthocyanins ColorInt  Hue   OD Proline
## 1       3.06           0.28            2.29     5.64 1.04 3.92    1065
## 2       2.76           0.26            1.28     4.38 1.05 3.40    1050
## 3       3.24           0.30            2.81     5.68 1.03 3.17    1185
## 4       3.49           0.24            2.18     7.80 0.86 3.45    1480
## 5       2.69           0.39            1.82     4.32 1.04 2.93     735
## 6       3.39           0.34            1.97     6.75 1.05 2.85    1450
any(is.na(ds))
## [1] FALSE
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
m <- rpartXse(Proline ~ ., ds, model=TRUE)

2. Obtain a graph of the obtained regression tree

library(rpart.plot)
## Loading required package: rpart
prp(m,type=3,extra=101)

3. Apply the tree to the data used to obtain the model and calculate the mean squared error of the predictions

p <- predict(m, ds)
mse <- mean((ds$Proline - p)^2); mse
## [1] 31167.56

4. Split the data set in two parts: 70% of the tests (training data) and the remaining 30% (test data). Use the larger part to obtain a regression tree and apply it to the other part. Calculate the mean squared error again, and compare with the previous scores.

ind.train <- sample(1:nrow(ds),0.7*nrow(ds),replace = FALSE)
train <- ds[ind.train,]
test <- ds[-ind.train,]

m <- rpartXse(Proline ~ ., train, model=TRUE)
p <- predict(m, test)
mse <- mean((test$Proline - p)^2); mse
## [1] 31818.95