Decision Trees A ‘random forest’ is a supervised classification and regression method that analyzes the ‘votes’ of many trees. RFs provide higher-accuracy predictions and reliable feature importance ratings but suffer from lower interpretability.
For this example, we will use a random forest for regression - we want to estimate an actual value and not a classification group. The ‘leaf nodes’ of each tree will be the predicted values.
“…In a regression tree, since the target variable is a real valued number, we fit a regression model to the target variable using each of the independent variables. Then for each independent variable, the data is split at several split points. We calculate Sum of Squared Error(SSE) at each split point between the predicted value and the actual values. The variable resulting in minimum SSE is selected for the node. Then this process is recursively continued till the entire data is covered.”
Cross validation is not necessarily required for random forests.
In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.
#general tools
library(tidyverse) # data organization and plotting
library(knitr) # nice tables and for rmarkdown
#library(reticulate) # incorporate python (not used here)
#random forest analysis
library(randomForest) # 'random forest' tools
library(caTools) # splitting data into training/test sets
#sentiment analysis
library(SnowballC) # text analysis
library(tm) # text mining
library(twitteR) # Twitter API access
library(syuzhet) # sentiment analysis
#load data
data <- read.csv("fifa19.csv",header=TRUE)
#data dimensions (rows, columns)
dim(data)
## [1] 18207 89
#get classes for each column
#sapply(data, class)
#column names
names(data)
## [1] "X" "ID"
## [3] "Name" "Age"
## [5] "Photo" "Nationality"
## [7] "Flag" "Overall"
## [9] "Potential" "Club"
## [11] "Club.Logo" "Value"
## [13] "Wage" "Special"
## [15] "Preferred.Foot" "International.Reputation"
## [17] "Weak.Foot" "Skill.Moves"
## [19] "Work.Rate" "Body.Type"
## [21] "Real.Face" "Position"
## [23] "Jersey.Number" "Joined"
## [25] "Loaned.From" "Contract.Valid.Until"
## [27] "Height" "Weight"
## [29] "LS" "ST"
## [31] "RS" "LW"
## [33] "LF" "CF"
## [35] "RF" "RW"
## [37] "LAM" "CAM"
## [39] "RAM" "LM"
## [41] "LCM" "CM"
## [43] "RCM" "RM"
## [45] "LWB" "LDM"
## [47] "CDM" "RDM"
## [49] "RWB" "LB"
## [51] "LCB" "CB"
## [53] "RCB" "RB"
## [55] "Crossing" "Finishing"
## [57] "HeadingAccuracy" "ShortPassing"
## [59] "Volleys" "Dribbling"
## [61] "Curve" "FKAccuracy"
## [63] "LongPassing" "BallControl"
## [65] "Acceleration" "SprintSpeed"
## [67] "Agility" "Reactions"
## [69] "Balance" "ShotPower"
## [71] "Jumping" "Stamina"
## [73] "Strength" "LongShots"
## [75] "Aggression" "Interceptions"
## [77] "Positioning" "Vision"
## [79] "Penalties" "Composure"
## [81] "Marking" "StandingTackle"
## [83] "SlidingTackle" "GKDiving"
## [85] "GKHandling" "GKKicking"
## [87] "GKPositioning" "GKReflexes"
## [89] "Release.Clause"
#subset data to variables of interest
data <- select(data,Name,Club,Age,Overall,Potential,Height,Weight,Crossing,Dribbling,Agility,Strength,Vision)
#check data
head(data)
#correct weight (factor to integer)
#remove "lbs" from weight
data$Weight <- gsub("lbs", "", as.character(data$Weight), fixed=TRUE)
#convert character to integer
data$Weight <- as.integer(as.character(data$Weight))
#convert height (feet/inches to cm)
data <- data %>% separate(Height, c('feet', 'inches'), "'", convert = TRUE) %>% mutate(cm = (12*feet + inches)*2.54)
#drop rows with missing height data
data <- data %>% drop_na(cm)
#drop unnecessary columns
#we have a better height column so these are redundant and might skew analysis
data <- data %>% select (-c(Name, Club, feet, inches))
#create new binomial variable to classify players
#data$good = ifelse(data$Overall >= 90, 1, 0)
#convert numeric variable to factor
#data$good <- as.factor(data$good)
#check data
head(data)
We can preview our data using density plots to investigate the distribution if that is important. Is age normally distributed? This variable seems to be useful in predicting Overall score, but should it be considered at all?
plot(density(data$Age))
Our goal with this random forest analysis is to predict Overall player score using other variables in the data set. So, are we able to do this? How effective is our prediction?
#set seed
set.seed(12345)
#select variable for splitting
splitVariable<-data$Overall
#split data by 'good' variable
sample = sample.split(splitVariable, SplitRatio = .75)
#create training and test sets
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)
#get dimensions of training and test sets
dim(train)
## [1] 13620 10
dim(test)
## [1] 4539 10
#create random forest object
#Number of variables randomly sampled as candidates at each split.
#Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
rf=randomForest(Overall~.,data=train,importance=TRUE, mtry=6, ntree=500)
#display random forest object
rf
##
## Call:
## randomForest(formula = Overall ~ ., data = train, importance = TRUE, mtry = 6, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 6
##
## Mean of squared residuals: 1.92202
## % Var explained: 95.98
#predict test data using the random forest object and all test data except for the Overall scores
pred = predict(rf,newdata=test[-2])
#combine (P)redicted values and (T)est overall values in a data frame
pt<-data.frame(pred,test$Overall)
#rename dataframe containing (P)redicted and (T)est values
names(pt)<-c("pred","test")
A regression of test values on predicted values suggests a strong significant relationship, which is further suggested by the scatter plot of values.
#regression model
lm<-lm(pt$test ~ pt$pred)
#print regression summary
summary(lm)
##
## Call:
## lm(formula = pt$test ~ pt$pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1474 -0.5537 0.0023 0.5592 10.2817
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.244435 0.203446 -6.117 1.04e-09 ***
## pt$pred 1.019067 0.003056 333.425 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.368 on 4537 degrees of freedom
## Multiple R-squared: 0.9608, Adjusted R-squared: 0.9608
## F-statistic: 1.112e+05 on 1 and 4537 DF, p-value: < 2.2e-16
#MSE calculation
mse<-mean((pt$pred-pt$test)^2)
#print statement using MSE values
cat("So the test set MSE is", mse, "with its square root around", sqrt(mse),"meaning that this model gives predictions that are within around", round(sqrt(mse),digits=1),"of the true Overall score.")
## So the test set MSE is 1.888257 with its square root around 1.374139 meaning that this model gives predictions that are within around 1.4 of the true Overall score.
#adjust margins to fit
marginValue<-5
par(mar=c(marginValue,marginValue,marginValue,marginValue))
#simple plot of predicted values (x) vs test values (y)
plot(pt$pred, pt$test, xlab="Overall Score (Predicted)", ylab="Overall Score (Test)")
#plot 'perfect fit' line
abline(0,1, col="blue")
#plot the fitted regression line
abline(lm, col="red")
#correlation test
cor(pt$pred,pt$test)
## [1] 0.9801988
The variables ‘Potential’ and ‘Age’ are the most important variables in predicting ‘Overall’ score across all trees in the random forest. VarImpPlot produces a “dotchart of variable importance as measured by a Random Forest.” “For regression, (node impurity) is measured by residual sum of squares.” Higher values for %IncMSE and IncNodePurity are desirable.
More on interpretation of these output variables
#print importance ranking
importance(rf)
## %IncMSE IncNodePurity
## Age 358.89193 219390.344
## Potential 223.89555 265281.824
## Weight 26.87930 4171.452
## Crossing 39.37886 15593.835
## Dribbling 29.98280 85254.099
## Agility 35.19627 3968.709
## Strength 43.26772 20138.127
## Vision 34.13022 33083.628
## cm 24.49219 2453.574
#dotchart of importance
varImpPlot(rf)
Important features mean the features that are more closely related with dependent variable and contribute more for variation of the dependent variable.
“The importance measures show how much MSE or Impurity increase when that variable is randomly permuted. If you randomly permute a variable that does not gain you anything in prediction, then predictions won’t change much and you will only see small changes in impurity and mse. On the other hand the important variables will change the predictions by quite a bit if randomly permuted, so you will see bigger changes. Turn this around and you see big changes indicate important variables.”
“%IncMSE is the most robust and informative measure. It is the increase in mse of predictions(estimated with out-of-bag-CV) as a result of variable j being permuted(values randomly shuffled).”
If you like, you can print and visualize one of the trees in the ‘forest’. This could perhaps be useful in understanding the variable used for each branching event in a tree.
#print a tree structure
getTree(rf, 1, labelVar=TRUE)
Let’s try this again using only a few variables.
data <- data %>% select (c(Overall,Crossing,Dribbling,Agility))
#set seed
set.seed(12345)
#select variable for splitting
splitVariable<-data$Overall
#split data by 'good' variable
sample = sample.split(splitVariable, SplitRatio = .75)
#create training and test sets
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)
#create random forest object
rf=randomForest(Overall~.,data=train,importance=TRUE, mtry=6, ntree=500)
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within
## valid range
#predict test data using the random forest object and all test data except for the Overall scores
pred = predict(rf,newdata=test[-1])
#combine (P)redicted values and (T)est overall values in a data frame
pt<-data.frame(pred,test$Overall)
#rename dataframe containing (P)redicted and (T)est values
names(pt)<-c("pred","test")
#regression model
lm<-lm(pt$test ~ pt$pred)
#print regression summary
summary(lm)
##
## Call:
## lm(formula = pt$test ~ pt$pred)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4232 -3.3119 -0.1601 3.0662 25.8714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.54566 1.04169 10.12 <2e-16 ***
## pt$pred 0.84150 0.01569 53.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.406 on 4537 degrees of freedom
## Multiple R-squared: 0.388, Adjusted R-squared: 0.3879
## F-statistic: 2877 on 1 and 4537 DF, p-value: < 2.2e-16
#adjust margins to fit
marginValue<-5
par(mar=c(marginValue,marginValue,marginValue,marginValue))
#simple plot of predicted values (x) vs test values (y)
plot(pt$pred, pt$test, xlab="Overall Score (Predicted)", ylab="Overall Score (Test)")
#plot 'perfect fit' line
abline(0,1, col="blue")
#plot the fitted regression line
abline(lm, col="red")
#print importance ranking
importance(rf)
## %IncMSE IncNodePurity
## Crossing 100.60081 141439.4
## Dribbling 165.19141 305109.3
## Agility 94.47846 130372.4
Hidden: consumer_key, consumer_secret, access_token, access_secret. You’ll need a Twitter Developer Account.
# load your keys
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
## [1] "Using direct authentication"
#define the account of interest and number of tweets to obtain
tweets <- userTimeline("MIT", n=50) #up to 3200
#store tweets in data frame
df <- twListToDF(tweets)
#clean up the text and store separately
dfClean <- gsub("http.*","", df$text)
dfClean <- gsub("https.*","",dfClean)
dfClean <- gsub("#.*","", dfClean)
dfClean <- gsub("@.*","", dfClean)
#check output
head(dfClean)
## [1] "Graduate student Elizabeth Wallace and her colleagues examined sediment found in blue holes in the Caribbean for cl… "
## [2] "SMART discovers nondisruptive way to characterize the surface of nanoparticles "
## [3] "Max NG, a Nigerian startup co-founded by two MIT alumni, provides package delivery services and on-demand transport… "
## [4] "New process could make hydrogen peroxide available in remote places: MIT-developed method may lead to portable devi… "
## [5] "Associate Professor Sangbae Kim’s research group recently demonstrated the agility and physical prowess of the robo… "
## [6] "New tools could improve the way cement seals oil wells "
This analysis uses the ‘syuzhet’ library.
“The package comes with four sentiment dictionaries and provides a method for accessing the robust, but computationally expensive, sentiment extraction tool developed in the NLP group at Stanford.” - https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html
# vectorize the data
word.df <- as.vector(dfClean)
#get sentiment scores
emotion.df <- get_nrc_sentiment(word.df)
#creates boolean table for emotion categories
emotion.df2 <- cbind(dfClean, emotion.df)
#shorten the name a bit
eDF<-emotion.df2
Examine all of the columns in this data set by clicking the arrow.
#convert to tibble
eDF<-as_tibble(eDF)
#check the data
head(eDF)
#get sentiment values
sent.value <- get_sentiment(word.df)
#use values to determine most positive/negative tweets
most.positive <- word.df[sent.value == max(sent.value)]
most.negative <- word.df[sent.value <= min(sent.value)]
#print
most.positive
## [1] "Associate Professor Sangbae Kim’s research group recently demonstrated the agility and physical prowess of the robo… "
most.negative
## [1] "Chemists observe “spooky” quantum tunneling: Extremely large electric fields can prevent umbrella-shaped ammonia mo… "
#sent.value
#separate positive, neutral, and negative tweets
category_senti <- ifelse(sent.value < 0, "Negative", ifelse(sent.value > 0, "Positive", "Neutral"))
#bind the output to the original data frame
category_senti2 <- cbind(dfClean,category_senti,sent.value)
#head(category_senti2)
#create table of sentiment scores
table(category_senti)
## category_senti
## Negative Neutral Positive
## 3 4 26