Random Forests

 Random Forests The combination of bagging and tree-based methods is a good one, as we saw in the last section. However, Breiman and Cutler saw that more improvement could be obtained by injecting extra variability into the procedure, and they proposed a number of modifications leading to the technique called Random Forests (Breiman 2001) lipofectamine 2000. Again, bootstrapping is used to generate data sets that are used to train an ensemble of trees. One key element is that the trees are constrained to be very simple—only few nodes are allowed, and no pruning is applied. Moreover, at every split, only a subset of all variables is considered for use. Both adaptations force diversity into the ensemble, which is the key to why improvements can be obtained with aggregating. 

 It can be shown (Breiman 2001) that an upper bound for the generalization error is given by ˆE ≤ ¯ρ(1 − q2 )/q2 where ¯ρ is the average correlation between predictions of individual trees 3xFLAG PEPTIDE molecular weight, and q is a measure of prediction quality. This means that the optimal gain is obtained when many good yet diverse classifiers are combined, something that is intuitively logical—there is not much point in averaging the outcomes of identical models, and combining truly bad models is unlikely to lead to good results either. 

 The R package randomForest provides a convenient interface to the original Fortran code of Breiman and Cutler. The basic function is randomForest, which either takes a formula or the usual combination of a data matrix and an outcome vector: > wines.df <- data.frame(vint = vintages, wines) > (wines.rf <- randomForest(vint ˜ ., subset = wines.odd, + data = wines.df)) Call: randomForest(formula = vint ˜ ., data = wines.df, subset = wines.odd) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 3 OOB estimate of error rate: 4.49% Confusion matrix: Barbera Barolo Grignolino class.error Barbera 24 0 0 0.000000 Barolo 0 28 1 0.034483 Grignolino 2 1 33 0.083333 

The print method shows the result of the fit in terms of the error rate of the out of-bag samples, in this case less than 5%. Because the algorithm fits trees to many different bootstrap samples, this error estimate comes for free. Prediction is done in the usual way: > wines.rf.predict <- + predict(wines.rf, newdata = wines.df[wines.even, ]) > table(wines.rf.predict, vintages[wines.even]) wines.rf.predict Barbera Barolo Grignolino Barbera 24 0 0 Barolo 0 29 0 Grignolino 0 0 35 

 So prediction for the even rows in the data set is perfect here. Note that repeated training may lead to small differences because of the randomness involved in select ing bootstrap samples and variables in the training process. Also in many other applications random forests have shown very good predictive abilities (see, e.g., reference Svetnik et al. 2003 for an application in chemical modelling). 

 So it seems the most important disadvantage of tree-based methods, the generally low quality of the predictions, has been countered sufficiently. Does this come at a price? At first sight, yes. Not only does a random forest add complexity to the original algorithm in the form of tuning parameters, the interpretability suffers as well. Indeed, an ensemble of trees would seem more difficult to interpret than one simple sequence of yes/no questions. Yet in reality things are not so simple. The interpretability, one of the big advantages of trees, becomes less of an issue when one realizes that a slight change in the data may lead to a completely different tree, and therefore a completely different interpretation. Such a small change may, e.g., be formed by the difference between successive crossvalidation or bootstrap iterations—thus, the resulting error estimate may be formed by predictions from trees using different variables in completely different ways. 

The technique of random forests addresses these issues in the following ways. A measure of the importance of a particular variable is obtained by comparing the out-of-bag errors for the trees in the ensemble with the out-of-bag errors when the values for that variable are permuted randomly. Differences are averaged over all trees, and divided by the standard error. If one variable shows a big difference, this means that the variable, in general, is important for the classification: the scrambled values lead to models with decreased predictivity. This approach can be used for both classification (using, e.g., classification error rate as a measure) and regression (using a value like MSE). An alternative is to consider the total increase in node purity. 

 In package randomForest this is implemented in the following way. When setting the parameter importance = TRUE in the call to randomForest, the importances of all variables are calculated during the fit—these are available through the extractor function importance, and for visualization using the func tion varImpPlot: > wines.rf <- randomForest(vint ˜ ., data = wines.df, + importance = TRUE) > varImpPlot(wines Actinomycin D.rf) 

 The result is shown in Fig. 9.8. The left plot shows the importance measured using the mean decrease in accuracy; the right plot using the mean decrease in node impurity, as measured by the Gini index. Although there are small differences, the overall picture is the same using both indices. 

 The second disadvantage, the large number of parameters to set in using tree-based models, is implicitly taken care of in the definition of the algorithm: by requiring all trees in the forest to be small and simple, no elaborate pruning schemes are necessary, and the degrees of freedom of the fitting algorithm have been cut back drastically

Comments

Popular posts from this blog

The cells were counterstained with DAPI for Apoptosis rates were determined by Annexin V-/PI double staining 10 min

Dysbindin promotes PDAC metastasis and invasion in vitro and in vivo

Cancer Letters 479 (2020) 61–70 Contents lists available at Science Direct