Session 02: Fundamentals

Goals: introduce some useful R libraries and demonstrate common probability scenarios with R code.

Solve the following using R.

Permutations and Combinations

Optional: use the library “arrangements” for this section.

Scenario

The English alphabet contains 26 unique letters.

Question

  • How many three-letter permutations can be created when no letter is used more than once?

  • How many three-letter permutations can be created when replacement is allowed?

  • How many three-letter combinations can be created when no letter is used more than once?

  • How many three-letter combinations can be created when replacement is allowed?

Recall: Order is considered for permutations but not combinations. For example, 2345, 2435, 4325 are all different permutations of {2,3,4,5} but they all are equivalent combinations of these numbers.

Intended Learning Objective(s)

Find solutions for common permutation and combination problems.

Solution

Permutations without replacement

#count the number of permutations without replacement
npermutations(LETTERS[1:26], k=3)
## [1] 15600

Permutations with replacement

#count the number of permutations with replacement
npermutations(LETTERS[1:26], k=3, replace=TRUE)
## [1] 17576

Combinations without replacement

#count the number of permutations without replacement
ncombinations(LETTERS[1:26], k=3)
## [1] 2600

Combinations with replacement

#count the number of permutations with replacement
ncombinations(LETTERS[1:26], k=3, replace=TRUE)
## [1] 3276

Independence

Scenario

Patients (n=105) in a medical study were treated with either a new medication or a placebo. Each patient was observed for a positive or negative response to the treatment.

Your null hypothesis (\(H_0\)) states that the treatment and patient response are independent variables.

Positive Negative
Medication 35 15
Placebo 26 29

Question

Determine whether you should reject or not reject \(H_0\) using the chi-square method and a significance level of \(\alpha = 0.05\). Interpret your finding in the context of the problem scenario.

Note: “The Chi-square test is intended to test how likely it is that an observed distribution is due to chance. It is also called a”goodness of fit" statistic, because it measures how well the observed distribution of data fits with the distribution that is expected if the variables are independent." (https://www.ling.upenn.edu/~clight/chisquared.htm)

Intended Learning Objective(s)

Test independence of two variables using the chi-square test and interpret the results.

Solution

“The Chi-Square statistic is commonly used for testing relationships between categorical variables. The null hypothesis of the Chi-Square test is that no relationship exists on the categorical variables in the population; they are independent.”

“The Chi-Square statistic is most commonly used to evaluate Tests of Independence when using a crosstabulation (also known as a bivariate table). Crosstabulation presents the distributions of two categorical variables simultaneously, with the intersections of the categories of the variables appearing in the cells of the table. The Test of Independence assesses whether an association exists between the two variables by comparing the observed pattern of responses in the cells to the pattern that would be expected if the variables were truly independent of each other. Calculating the Chi-Square statistic and comparing it against a critical value from the Chi-Square distribution allows the researcher to assess whether the observed cell counts are significantly different from the expected cell counts.”

Source: https://www.statisticssolutions.com/using-chi-square-statistic-in-research/

#import the data from a .csv file
df <- read.csv("medicalTrial.csv", header=TRUE)

#check header for info
#colnames(df)
head(df)
#run the chi-square test with Yates' continuity correction off
chisq.test(df$treatment, df$response, correct=FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  df$treatment and df$response
## X-squared = 0.27592, df = 1, p-value = 0.5994

Our chi-square test results yield a p-value of 0.5994. This p-value is higher than the pre-determined \(\alpha\) and so we do not reject the null hypothesis. We could say there is insufficient evidence that they are not independent. We conclude that the two variables are independent.

In the context of our problem, the medication treatment has no relationship with the patient response. Perhaps the medication is not valid for treatment of this type.

Probability Distribution

Scenario

A carnival is in town and you decide to play a dice game. To win, you must roll two dice for a sum of 12.

Question

  • Before playing, you would like to know the long-term (n=10,000) frequencies and relative frequencies for each possible outcome of a two-dice roll.

  • If you could convince the attendant to change the winning value, to what value would you change it to maximize your probability of winning?

Intended Learning Objective(s)

Plot frequency and relative frequency for a discrete random variable.

Solution

#first create a function to give the sum of n dice
#the function will take any number of dice (diceCount) when it is called
diceSum <- function(diceCount){
  diceValues <- sample(1:6, size = diceCount, replace = TRUE)
  return(sum(diceValues))
}

#set inputs
nRolls = 10000 #roll this many times
nDice = 2   #use this many dice

#use replicate to simulate many dice rolls with the dice
#store the resulting values in a vector
#replicate takes a count and an expression; it will run the expression 'n' times.
sums<-replicate(n = nRolls, expr = diceSum(nDice))

#load plyr to help with calculations
library(plyr)

#get frequency values for each sum possibility
sumFrequency<-count(sums)

#calculate the relative probability
relFreq<-sumFrequency$freq/nRolls

#store the values in a data frame
df<-data.frame(sumFrequency,relFreq)

#rename columns
colnames(df)<-c("sum", "freq", "relFreq")

#plot frequency
ggplot(data=df, aes(x=factor(sum), y=freq)) +
    geom_bar(stat="identity", fill="#779911")+
    theme_minimal()+
    xlab("Value") + 
    ylab("Frequency")+ 
    scale_y_continuous(limits = c(0, 2000), breaks = seq(0, 2000, by = 100))

#plot relative frequency
ggplot(data=df, aes(x=factor(sum), y=relFreq)) +
    geom_bar(stat="identity", fill="#991177")+
    theme_minimal()+
    xlab("Value") + 
    ylab("Relative Frequency")+ 
    scale_y_continuous(limits = c(0, 0.20), breaks = seq(0, 0.20, by = 0.05))

Cumulative Distribution Function

Scenario

Stock prices have fluctuating ‘open’ and ‘close’ prices daily as shares are traded. The stock volume refers to the number of shares traded in a given time period, usually daily. We will look at the daily volume for Google stock using the quantmod library.

Question

Around what proportion of Google’s daily trading volume totaled less than 10,000,000 units?

Intended Learning Objective(s)

Plot a cumulative density function for a continuous random variable.

Solution

library(quantmod)

#optional: disable scientific notation
options(scipen=999)

#retrieve stock data
getSymbols("GOOG",src="yahoo")
## [1] "GOOG"
#check column names for info
#colnames(GOOG)

#plot with quantmod's barChart function
chartSeries(GOOG)

#bar chart of closing
barChart(GOOG$GOOG.Volume)

#Calculate and plot CDF
ggplot(GOOG, aes(GOOG.Volume)) + 
  stat_ecdf(geom = "step") +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, by = 0.05))+
  theme_minimal()+
  labs(title="Empirical Cumulative Density Function \n (Google: GOOG)", y = "F(volume)", x="volume (v)")

The CDF (\(F(x)\)) describes gives the probability that \(X\) is less than or equal to \(x\). Here, we can write this as \(F(volume) = P(Volume_{observed} \leq volume_{predicted})\). From this plot, we can see that the probability of an observed stock volume being less than ten million units is around 87%.

Continuous Random Variables

Scenario

A Pokemon trainer has approached you, the Senior Pokemon Data Scientist, with a question: “Are attack and defense strength related to each other in a meaningful way?”

Question

Find and remove outliers, determine the normality of the attack and defense variables, find the appropriate correlation coefficient between them, and create a contour plot to show the density of Pokemon for each level of Attack and Defense.

Intended Learning Objective(s)

  • Examine, compare, and interpret plots of continuous variables.
  • Complete a variety of tests to assess normality of data prior to assessing correlation.
  • Use appropriate correlation tests.

Solution

library(tidyverse) #call all the tidyverse libraries
pokemonData <- as_tibble(read.csv("pokemon.csv")) 

#begin by taking a look at the distribution of the variables
boxplot(pokemonData$Attack)

boxplot(pokemonData$Defense)

#store the values of the outliers 
#boxplot can tell us the outliers
#we don't need to see the plot again
outliersAttack  <- boxplot(pokemonData$Attack,  plot=FALSE)$out
outliersDefense <- boxplot(pokemonData$Defense, plot=FALSE)$out

#re-store the values minus the outlier rows
pokemonDataNoOut <- pokemonData[-which(pokemonData$Attack %in% outliersAttack),]
pokemonDataNoOut <- pokemonData[-which(pokemonData$Defense %in% outliersDefense),]

#review for normality checks
#https://www.sheffield.ac.uk/polopoly_fs/1.579191!/file/stcp-karadimitriou-normalR.pdf
#http://www.sthda.com/english/wiki/normality-test-in-r

#ggpubr has some tools for normality checks
library("ggpubr")

#check normality by examining density curve and 
ggdensity(pokemonDataNoOut$Attack, main = "Density plot", xlab = "Attack Strength")

ggdensity(pokemonDataNoOut$Defense, main = "Density plot", xlab = "Defense Strength")

#overlapping translucent density plots to show difference with/without outliers

#define some colors to use
cols <- c("Outliers"="#FF000055","NoOutliers"="#0000FF55")

#attack comparison with and without outliers
ggplot() + 
  geom_density(data = pokemonData,      aes(x = Attack, fill = "Outliers")) +
  geom_density(data = pokemonDataNoOut, aes(x = Attack, fill = "NoOutliers")) +
  scale_fill_manual(name="Outliers",values=cols)+
  xlab('Attack') +
  ylab('Density')+
  theme_minimal()

#defense comparison with and without outliers
ggplot() + 
  geom_density(data = pokemonData,      aes(x = Defense, fill = "Outliers")) +
  geom_density(data = pokemonDataNoOut, aes(x = Defense, fill = "NoOutliers")) +
  scale_fill_manual(name="Outliers",values=cols)+
  xlab('Defense') +
  ylab('Density')+
  theme_minimal()

#check normality with a qq plot
#points should lie close to the line
ggqqplot(pokemonDataNoOut$Attack, conf.int = TRUE, conf.int.level = 0.95)

ggqqplot(pokemonDataNoOut$Defense, conf.int = TRUE, conf.int.level = 0.95)

#follow up with a hypothesis test
#Shapiro-Wilk Normality test
#if p>0.05, assume normality
shapiro.test(pokemonDataNoOut$Attack)
## 
##  Shapiro-Wilk normality test
## 
## data:  pokemonDataNoOut$Attack
## W = 0.97862, p-value = 0.000000002551
shapiro.test(pokemonDataNoOut$Defense)
## 
##  Shapiro-Wilk normality test
## 
## data:  pokemonDataNoOut$Defense
## W = 0.97668, p-value = 0.0000000006932
#results: not normal data, use non-parametric tests 
#use Spearman's Correlation Coefficient instead of Pearson
cor.test(x=pokemonDataNoOut$Attack, y=pokemonDataNoOut$Defense, method = 'spearman')
## Warning in cor.test.default(x = pokemonDataNoOut$Attack, y =
## pokemonDataNoOut$Defense, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  pokemonDataNoOut$Attack and pokemonDataNoOut$Defense
## S = 38769652, p-value < 0.00000000000000022
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.5227789
#results: Spearman's rho = ~ 0.526, not very strong relationship

#contour plot with margins
#reference: https://plot.ly/r/
#reference: https://plot.ly/r/contour-plots/
library(plotly)

p <- plot_ly(
  type = 'histogram2dcontour',
  x = pokemonDataNoOut$Attack,
  y = pokemonDataNoOut$Defense
)
p

Discrete Random Variables: Fire Counts by Year

Question

Draw some conclusions from fire data by year and by state.

Intended Learning Objective(s)

Examine, compare, and interpret plots of discrete variables.

Solution

library(tidyverse) #call all the tidyverse libraries

#import data
amazonData <- as_tibble(read.csv("amazon.csv")) 

#check data 
#head(amazonData)

#create a new summary tibble but group by year and summarize the count by year
summaryDataY <- amazonData %>%
  group_by(year) %>%
  summarise(count = sum(number))

#check data
#head(summaryDataY)

#define a plot color
plotColor<-"#fc4e03"

#plot 
summaryDataY %>%
  ggplot(aes(x = year, y = count)) +
      geom_bar(stat = "identity", fill = plotColor) +
      labs(title = "Fires in Brazil, 1998-2107",
           subtitle = "Fire Count by Year",
           y = "Count",
           x = "") + theme_bw(base_size = 15) +
  scale_x_discrete(limits=1998:2017, labels = 1998:2017)+
  theme(axis.text.x = element_text(angle = 45,vjust=-0.025))+
  theme_light()

library(ggpubr)
ggdotchart(summaryDataY, x = "year", y = "count",
           color = plotColor,                            # Color by groups
           sorting = "descending",                       # Sort value in descending order
           rotate = TRUE,                                # Rotate vertically
           dot.size = 10,                                # Large dot size
           ggtheme = theme_pubr()                        # ggplot2 theme
           )+
  theme_cleveland()

#summarize data by year and state
summaryDataYS <- amazonData %>%
  group_by(year,state) %>%
  summarise(count = sum(number))

#check data
#head(summaryDataYS)

#facet plot by state by year
summaryDataYS %>%
  ggplot(aes(x = year, y = count)) +
      geom_bar(stat = "identity", fill = plotColor) +
  facet_wrap(~ state, ncol = 4) +
      labs(title = "Fires in Brazil, 1998-2107",
           subtitle = "Fire Count by State",
           y = "Count",
           x = "Year") + theme_bw(base_size = 15)

Follow-up Questions

Are there any issues with these plots? Brasil has 27 states. How might this influence your views on the previous plots? What would you change, if anything?