Machine Learning

Intended Learning Objectives

Describe the purpose for using the Random Forest method
Describe and explorethe steps of a classification task using the Random Forest method
Bonus: Sentiment Analyis with Twitter

Random Forest Regression

Background

Decision Trees A ‘random forest’ is a supervised classification and regression method that analyzes the ‘votes’ of many trees. RFs provide higher-accuracy predictions and reliable feature importance ratings but suffer from lower interpretability.

Jansen

Resources

Method Overview

Create a bootstrapped, aggregated (‘bagged’) data set – randomly select data with replacement.
Create decision trees using randomly selected variables
Repeat the first two steps to create many trees
Run data through each tree to predict an outcome
Which outcome received the most ‘votes’?
How well does the outcome match the out-of-bag (OOB) data that were not included in the bootstrapped data set?

For this example, we will use a random forest for regression - we want to estimate an actual value and not a classification group. The ‘leaf nodes’ of each tree will be the predicted values.

“…In a regression tree, since the target variable is a real valued number, we fit a regression model to the target variable using each of the independent variables. Then for each independent variable, the data is split at several split points. We calculate Sum of Squared Error(SSE) at each split point between the predicted value and the actual values. The variable resulting in minimum SSE is selected for the node. Then this process is recursively continued till the entire data is covered.”

Ankit Sharma

Cross Validation

Cross validation is not necessarily required for random forests.

In random forests, there is no need for cross-validation or a separate test set to get an unbiased estimate of the test set error. It is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.Put each case left out in the construction of the kth tree down the kth tree to get a classification. In this way, a test set classification is obtained for each case in about one-third of the trees. At the end of the run, take j to be the class that got most of the votes every time case n was oob. The proportion of times that j is not equal to the true class of n averaged over all cases is the oob error estimate. This has proven to be unbiased in many tests.

Leo Breiman and Adele Cutler

Libraries

#general tools
library(tidyverse)  # data organization and plotting
library(knitr)      # nice tables and for rmarkdown
#library(reticulate) # incorporate python (not used here) 

#random forest analysis 
library(randomForest) # 'random forest' tools
library(caTools)      # splitting data into training/test sets

#sentiment analysis 
library(SnowballC) # text analysis
library(tm)        # text mining
library(twitteR)   # Twitter API access
library(syuzhet)   # sentiment analysis

Data Preparation

#load data
data <- read.csv("fifa19.csv",header=TRUE)

#data dimensions (rows, columns)
dim(data)

## [1] 18207    89

#get classes for each column
#sapply(data, class)

#column names
names(data)

##  [1] "X"                        "ID"                      
##  [3] "Name"                     "Age"                     
##  [5] "Photo"                    "Nationality"             
##  [7] "Flag"                     "Overall"                 
##  [9] "Potential"                "Club"                    
## [11] "Club.Logo"                "Value"                   
## [13] "Wage"                     "Special"                 
## [15] "Preferred.Foot"           "International.Reputation"
## [17] "Weak.Foot"                "Skill.Moves"             
## [19] "Work.Rate"                "Body.Type"               
## [21] "Real.Face"                "Position"                
## [23] "Jersey.Number"            "Joined"                  
## [25] "Loaned.From"              "Contract.Valid.Until"    
## [27] "Height"                   "Weight"                  
## [29] "LS"                       "ST"                      
## [31] "RS"                       "LW"                      
## [33] "LF"                       "CF"                      
## [35] "RF"                       "RW"                      
## [37] "LAM"                      "CAM"                     
## [39] "RAM"                      "LM"                      
## [41] "LCM"                      "CM"                      
## [43] "RCM"                      "RM"                      
## [45] "LWB"                      "LDM"                     
## [47] "CDM"                      "RDM"                     
## [49] "RWB"                      "LB"                      
## [51] "LCB"                      "CB"                      
## [53] "RCB"                      "RB"                      
## [55] "Crossing"                 "Finishing"               
## [57] "HeadingAccuracy"          "ShortPassing"            
## [59] "Volleys"                  "Dribbling"               
## [61] "Curve"                    "FKAccuracy"              
## [63] "LongPassing"              "BallControl"             
## [65] "Acceleration"             "SprintSpeed"             
## [67] "Agility"                  "Reactions"               
## [69] "Balance"                  "ShotPower"               
## [71] "Jumping"                  "Stamina"                 
## [73] "Strength"                 "LongShots"               
## [75] "Aggression"               "Interceptions"           
## [77] "Positioning"              "Vision"                  
## [79] "Penalties"                "Composure"               
## [81] "Marking"                  "StandingTackle"          
## [83] "SlidingTackle"            "GKDiving"                
## [85] "GKHandling"               "GKKicking"               
## [87] "GKPositioning"            "GKReflexes"              
## [89] "Release.Clause"

#subset data to variables of interest
data <- select(data,Name,Club,Age,Overall,Potential,Height,Weight,Crossing,Dribbling,Agility,Strength,Vision)

#check data
head(data)

#correct weight (factor to integer)
#remove "lbs" from weight
data$Weight <- gsub("lbs", "", as.character(data$Weight), fixed=TRUE)

#convert character to integer
data$Weight <- as.integer(as.character(data$Weight))

#convert height (feet/inches to cm)
data <- data %>% separate(Height, c('feet', 'inches'), "'", convert = TRUE) %>% mutate(cm = (12*feet + inches)*2.54)

#drop rows with missing height data
data <- data %>% drop_na(cm)

#drop unnecessary columns 
#we have a better height column so these are redundant and might skew analysis
data <- data %>% select (-c(Name, Club, feet, inches))

#create new binomial variable to classify players 
#data$good = ifelse(data$Overall >= 90, 1, 0)

#convert numeric variable to factor
#data$good <- as.factor(data$good)

#check data
head(data)

Data Preview

We can preview our data using density plots to investigate the distribution if that is important. Is age normally distributed? This variable seems to be useful in predicting Overall score, but should it be considered at all?

plot(density(data$Age))

Analysis

Our goal with this random forest analysis is to predict Overall player score using other variables in the data set. So, are we able to do this? How effective is our prediction?

Create Training and Test Data Sets

#set seed
set.seed(12345)

#select variable for splitting
splitVariable<-data$Overall

#split data by 'good' variable
sample = sample.split(splitVariable, SplitRatio = .75)

#create training and test sets
train = subset(data, sample == TRUE)
test  = subset(data, sample == FALSE)

#get dimensions of training and test sets
dim(train)

## [1] 13620    10

dim(test)

## [1] 4539   10

Initialize Random Forest Instance

#create random forest object
#Number of variables randomly sampled as candidates at each split. 
#Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
rf=randomForest(Overall~.,data=train,importance=TRUE, mtry=6, ntree=500)

#display random forest object
rf

## 
## Call:
##  randomForest(formula = Overall ~ ., data = train, importance = TRUE,      mtry = 6, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 1.92202
##                     % Var explained: 95.98

Generate Prediction

#predict test data using the random forest object and all test data except for the Overall scores
pred = predict(rf,newdata=test[-2])

#combine (P)redicted values and (T)est overall values in a data frame
pt<-data.frame(pred,test$Overall)

#rename dataframe containing (P)redicted and (T)est values 
names(pt)<-c("pred","test")

Performance

A regression of test values on predicted values suggests a strong significant relationship, which is further suggested by the scatter plot of values.

#regression model
lm<-lm(pt$test ~ pt$pred)

#print regression summary
summary(lm)

## 
## Call:
## lm(formula = pt$test ~ pt$pred)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1474 -0.5537  0.0023  0.5592 10.2817 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.244435   0.203446  -6.117 1.04e-09 ***
## pt$pred      1.019067   0.003056 333.425  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.368 on 4537 degrees of freedom
## Multiple R-squared:  0.9608, Adjusted R-squared:  0.9608 
## F-statistic: 1.112e+05 on 1 and 4537 DF,  p-value: < 2.2e-16

#MSE calculation
mse<-mean((pt$pred-pt$test)^2)

#print statement using MSE values
cat("So the test set MSE is", mse, "with its square root around", sqrt(mse),"meaning that this model gives predictions that are within around", round(sqrt(mse),digits=1),"of the true Overall score.")

## So the test set MSE is 1.888257 with its square root around 1.374139 meaning that this model gives predictions that are within around 1.4 of the true Overall score.

#adjust margins to fit
marginValue<-5
par(mar=c(marginValue,marginValue,marginValue,marginValue))

#simple plot of predicted values (x) vs test values (y)
plot(pt$pred, pt$test, xlab="Overall Score (Predicted)", ylab="Overall Score (Test)")

#plot 'perfect fit' line
abline(0,1, col="blue")

#plot the fitted regression line
abline(lm, col="red")

#correlation test
cor(pt$pred,pt$test)

## [1] 0.9801988

Variable Importance

The variables ‘Potential’ and ‘Age’ are the most important variables in predicting ‘Overall’ score across all trees in the random forest. VarImpPlot produces a “dotchart of variable importance as measured by a Random Forest.” “For regression, (node impurity) is measured by residual sum of squares.” Higher values for %IncMSE and IncNodePurity are desirable.

More on interpretation of these output variables

#print importance ranking
importance(rf)

##             %IncMSE IncNodePurity
## Age       358.89193    219390.344
## Potential 223.89555    265281.824
## Weight     26.87930      4171.452
## Crossing   39.37886     15593.835
## Dribbling  29.98280     85254.099
## Agility    35.19627      3968.709
## Strength   43.26772     20138.127
## Vision     34.13022     33083.628
## cm         24.49219      2453.574

#dotchart of importance
varImpPlot(rf)

Discussion

Important features mean the features that are more closely related with dependent variable and contribute more for variation of the dependent variable.

Prince Grover, Intuitive Interpretation of Random Forest

%IncMSE and IncNodePurity

“The importance measures show how much MSE or Impurity increase when that variable is randomly permuted. If you randomly permute a variable that does not gain you anything in prediction, then predictions won’t change much and you will only see small changes in impurity and mse. On the other hand the important variables will change the predictions by quite a bit if randomly permuted, so you will see bigger changes. Turn this around and you see big changes indicate important variables.”

Greg Snow, PhD

“%IncMSE is the most robust and informative measure. It is the increase in mse of predictions(estimated with out-of-bag-CV) as a result of variable j being permuted(values randomly shuffled).”

Soren Havelund Welling

Miscellaneous

If you like, you can print and visualize one of the trees in the ‘forest’. This could perhaps be useful in understanding the variable used for each branching event in a tree.

More information

#print a tree structure
getTree(rf, 1, labelVar=TRUE)

Analysis (Part 2)

A Minimal Prediction Model

Let’s try this again using only a few variables.

data <- data %>% select (c(Overall,Crossing,Dribbling,Agility))

#set seed
set.seed(12345)

#select variable for splitting
splitVariable<-data$Overall

#split data by 'good' variable
sample = sample.split(splitVariable, SplitRatio = .75)

#create training and test sets
train = subset(data, sample == TRUE)
test  = subset(data, sample == FALSE)

#create random forest object
rf=randomForest(Overall~.,data=train,importance=TRUE, mtry=6, ntree=500)

## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within
## valid range

#predict test data using the random forest object and all test data except for the Overall scores
pred = predict(rf,newdata=test[-1])

#combine (P)redicted values and (T)est overall values in a data frame
pt<-data.frame(pred,test$Overall)

#rename dataframe containing (P)redicted and (T)est values 
names(pt)<-c("pred","test")

#regression model
lm<-lm(pt$test ~ pt$pred)

#print regression summary
summary(lm)

## 
## Call:
## lm(formula = pt$test ~ pt$pred)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4232  -3.3119  -0.1601   3.0662  25.8714 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.54566    1.04169   10.12   <2e-16 ***
## pt$pred      0.84150    0.01569   53.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.406 on 4537 degrees of freedom
## Multiple R-squared:  0.388,  Adjusted R-squared:  0.3879 
## F-statistic:  2877 on 1 and 4537 DF,  p-value: < 2.2e-16

#adjust margins to fit
marginValue<-5
par(mar=c(marginValue,marginValue,marginValue,marginValue))

#simple plot of predicted values (x) vs test values (y)
plot(pt$pred, pt$test, xlab="Overall Score (Predicted)", ylab="Overall Score (Test)")

#plot 'perfect fit' line
abline(0,1, col="blue")

#plot the fitted regression line
abline(lm, col="red")

#print importance ranking
importance(rf)

##             %IncMSE IncNodePurity
## Crossing  100.60081      141439.4
## Dribbling 165.19141      305109.3
## Agility    94.47846      130372.4

Twitter Sentiment Analysis

Idea Source

Hidden: consumer_key, consumer_secret, access_token, access_secret. You’ll need a Twitter Developer Account.

Initialization

# load your keys
setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)

## [1] "Using direct authentication"

#define the account of interest and number of tweets to obtain
tweets <- userTimeline("MIT", n=50) #up to 3200
 
#store tweets in data frame
df <- twListToDF(tweets) 

#clean up the text and store separately
dfClean <- gsub("http.*","", df$text)
dfClean <- gsub("https.*","",dfClean)
dfClean <- gsub("#.*","",    dfClean)
dfClean <- gsub("@.*","",    dfClean)

#check output
head(dfClean)

## [1] "Graduate student Elizabeth Wallace and her colleagues examined sediment found in blue holes in the Caribbean for cl… "
## [2] "SMART discovers nondisruptive way to characterize the surface of nanoparticles "                                      
## [3] "Max NG, a Nigerian startup co-founded by two MIT alumni, provides package delivery services and on-demand transport… "
## [4] "New process could make hydrogen peroxide available in remote places: MIT-developed method may lead to portable devi… "
## [5] "Associate Professor Sangbae Kim’s research group recently demonstrated the agility and physical prowess of the robo… "
## [6] "New tools could improve the way cement seals oil wells "

Sentiment Analysis

This analysis uses the ‘syuzhet’ library.

“The package comes with four sentiment dictionaries and provides a method for accessing the robust, but computationally expensive, sentiment extraction tool developed in the NLP group at Stanford.” - https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html

# vectorize the data
word.df <- as.vector(dfClean)

#get sentiment scores
emotion.df <- get_nrc_sentiment(word.df)

#creates boolean table for emotion categories
emotion.df2 <- cbind(dfClean, emotion.df) 

#shorten the name a bit
eDF<-emotion.df2

Examine all of the columns in this data set by clicking the arrow.

#convert to tibble
eDF<-as_tibble(eDF)

#check the data
head(eDF)

#get sentiment values 
sent.value <- get_sentiment(word.df)

#use values to determine most positive/negative tweets
most.positive <- word.df[sent.value == max(sent.value)]
most.negative <- word.df[sent.value <= min(sent.value)]

#print 
most.positive

## [1] "Associate Professor Sangbae Kim’s research group recently demonstrated the agility and physical prowess of the robo… "

most.negative

## [1] "Chemists observe “spooky” quantum tunneling: Extremely large electric fields can prevent umbrella-shaped ammonia mo… "

#sent.value

#separate positive, neutral, and negative tweets
category_senti <- ifelse(sent.value < 0, "Negative", ifelse(sent.value > 0, "Positive", "Neutral"))

#bind the output to the original data frame
category_senti2 <- cbind(dfClean,category_senti,sent.value) 

#head(category_senti2)

#create table of sentiment scores
table(category_senti)

## category_senti
## Negative  Neutral Positive 
##        3        4       26

Data Analysis for Social Scientists