Module 05: Distributions

Scenario

Students entering medical school take the MCAT test. Each medical school has a different threshold for entry. For example, a more prestigious private school requires students to score in the top 10% of all test takers that year. A more affordable school requires students to score in the top 25%. The MCAT scores from last year were normally distributed and had a mean of 500 and a standard deviation of 10.6.

Goal

Use the ggplot2 library to create a plot that describes MCAT score as a function of the threshold 0 to 1. Add vertical lines and appropriate annotations to highlight scores that would place students among the top 25% and 10%of all test takers. Use pleasing design choices.

#set up a sequence of percentages
p<-seq(0,1,by=0.01)

#create a function using qnorm
mcat <- function(x) {
  mean <-500
  sd   <-10.6
  
  output <- qnorm(x,mean,sd)
  return(output)
}

#run the function for values of the sequence
score<-mcat(p)

#create a data frame
df<-data.frame(p,score)

#use dplyr slice() to remove first and last values
#these show up as Inf and we don't want to plot Inf values
#"second value (2) through the total count (n()) minus 1"
#we can store the output using the same dataframe name
df<-slice(df, 2:(n()-1))

#add new column for color factors
df$Rank <- cut(df$p,
               breaks = c(-Inf, 0.75, 0.9, Inf),
               labels = c("Bottom 75%","Top 25%","Top 10%"))

#Rank Colors
#use light grey to de-emphasize some points
#use color and alpha to reinforce points of interest
under75<-"#bbbbbb"
top25<-"#10798f99"
top10<-"#10798fff"

#thresholds of interest
t1<-0.75 #(top 25%)
t2<-0.9  #(top 10%)

#set up the plot
p <- ggplot(data = df, aes(x = p, y = score, color = Rank)) +
  theme_minimal()+
  scale_x_continuous(breaks = seq(0,1,by=0.1), 
                     limits = c(0,1))+
  scale_y_continuous(breaks = seq(470,530,by=20),
                     limits=c(470,530))+ 
  # geom_rect(aes(xmin = 0.75, xmax = 1, ymin = 470, ymax = 530),
  #           fill = "#efefef", 
  #           alpha = 0.1)+
  # geom_rect(aes(xmin = 0.9, xmax = 1, ymin = 470, ymax = 530),
  #           fill = "#ff0000", 
  #           alpha = 0.1)+
  # geom_point(shape=21,fill="#149AB5", col="#333333", stroke=1, alpha=0.75, size=2)+
  geom_point(size=2)+
  scale_color_manual(values = c(under75,top25,top10))+
  labs(x="Threshold", 
       y="Score",
       title="MCAT Scores",
       subtitle="",
       caption="Data Source: -[ artificial data ]- ")+
  theme(plot.title = element_text(hjust = 0.5), 
        plot.subtitle = element_text(hjust = 0.5))+
  theme(axis.text=element_text(size=14),
        axis.title=element_text(size=14,
                                face="plain"),
        plot.title = element_text(size=16,
                                  face="bold"))+
  geom_vline(xintercept=t1, 
             col="#888888", 
             linetype = "longdash")+
  geom_vline(xintercept=t2, 
             col="#888888", 
             linetype = "longdash")+
  annotate("text", 
           x=t1,
           y=530, #to locate near the line: df$score[df$p==t1]+10
           size=5,
           hjust=-0.2, 
           label = expression(paste("Top 25%"))
           )+
  annotate("text", 
           x=t2,
           y=530,
           size=5,
           hjust=-0.2, 
           label = expression(paste("Top 10%"))
           )

#draw the plot
p

#save the image
ggsave("mcat.png", width = 35, units = "cm")

Sampling, Estimators, Mean, and Median

Let’s suppose MCAT scores are independent and normally distributed. We can simulate a sample of 239,681 scores from this hypothetical dataset using the mean and standard deviation for MCAT scores, 2015 to 2017.

#https://www.aamc.org/system/files/c/2/462316-mcatguide.pdf

set.seed(12345)

n<-239681
mean <-500.5
sd   <-10.5
  
mcatNorm <- rnorm(n, mean, sd)
df<-data.frame(mcatNorm)

mainFill <- "#00a18e"
mainBorder <- "#005c51"

ggplot(df, aes(mcatNorm)) + geom_density(fill = mainFill, colour=mainBorder, alpha = 0.5) + theme_minimal()

The admissions process for each school differs. Medical School A uses mean scores while Medical School B uses median scores. Which school will consider more students with top scores?

#sample size
n<-100

#mean
mean <-500.5

#standard deviation
sd   <-10.5

#number of repetitions
reps<-1000

#replicate a function n times 
#this creates 1000 samples with sample size (n=100) from our defined distribution
rep<-replicate(n = reps, 
               expr = rnorm(n, mean, sd), 
               simplify = FALSE)

#calculate means from each sample
means<-1*sapply(rep, mean)

#calculate median from each sample
medians<-1*sapply(rep, median)

#organize output in a dataframe
df<-data.frame(means,medians)

#melt the data (library: reshape2)
df.x <- melt(df)

#plot the data
#two overlapping density plots
ggplot(df.x) + geom_density(aes(x = value, color = variable)) + theme_minimal()