1. How would you obtain a random forest to forecast the value of alga a4?
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(DMwR)
## Loading required package: lattice
## Loading required package: grid
data(algae)
algae <- algae[,c(1:11,15)]
algae <- algae[complete.cases(algae),]
rf.a4 <- randomForest(a4 ~ ., algae)
2. Repeat the previous exercise but now using a boosting model
library(gbm)
## Loaded gbm 2.1.5
boost.a4 <- gbm(a4 ~ ., data = algae, n.trees = 100)
## Distribution not specified, assuming gaussian ...
3. Obtain the predictions of the two previous models for the data used to obtain them (train set). Draw a scatterplot comparing these predictions.
preds.rf <- predict(rf.a4, algae)
preds.boost <- predict(boost.a4, algae, n.trees = 100)
plot(preds.rf, preds.boost, xlab="Random Forest Predictions", ylab="Boosting Predictions")
abline(0,1, col="red")

4. Load the data set testAlgae. It contains a data frame names test.algae with some extra 140 water samples for which we want predictions. Use the previous two models to obtain predictions for a4 on these new samples. Check what happened to the test cases with NA’s. Fill-in the NA’s on the test set (remember the function knnImputation()
from the DMwR
package) and repeat the experiment.
data(testAlgae)
preds.rf <- predict(rf.a4, test.algae)
preds.boost <- predict(boost.a4, test.algae, n.trees=100)
any(is.na(preds.rf)); any(is.na(preds.boost))
## [1] TRUE
## [1] FALSE
test.algae <- knnImputation(test.algae)
preds.rf <- predict(rf.a4, test.algae)
preds.boost <- predict(boost.a4, test.algae, n.trees=100)
any(is.na(preds.rf)); any(is.na(preds.boost))
## [1] FALSE
## [1] FALSE