Students entering medical school take the MCAT test. Each medical school has a different threshold for entry. For example, a more prestigious private school requires students to score in the top 10% of all test takers that year. A more affordable school requires students to score in the top 25%. The MCAT scores from last year were normally distributed and had a mean of 500 and a standard deviation of 10.6.
Use the ggplot2 library to create a plot that describes MCAT score as a function of the threshold 0 to 1. Add vertical lines and appropriate annotations to highlight scores that would place students among the top 25% and 10%of all test takers. Use pleasing design choices.
#set up a sequence of percentages
p<-seq(0,1,by=0.01)
#create a function using qnorm
mcat <- function(x) {
mean <-500
sd <-10.6
output <- qnorm(x,mean,sd)
return(output)
}
#run the function for values of the sequence
score<-mcat(p)
#create a data frame
df<-data.frame(p,score)
#use dplyr slice() to remove first and last values
#these show up as Inf and we don't want to plot Inf values
#"second value (2) through the total count (n()) minus 1"
#we can store the output using the same dataframe name
df<-slice(df, 2:(n()-1))
#add new column for color factors
df$Rank <- cut(df$p,
breaks = c(-Inf, 0.75, 0.9, Inf),
labels = c("Bottom 75%","Top 25%","Top 10%"))
#Rank Colors
#use light grey to de-emphasize some points
#use color and alpha to reinforce points of interest
under75<-"#bbbbbb"
top25<-"#10798f99"
top10<-"#10798fff"
#thresholds of interest
t1<-0.75 #(top 25%)
t2<-0.9 #(top 10%)
#set up the plot
p <- ggplot(data = df, aes(x = p, y = score, color = Rank)) +
theme_minimal()+
scale_x_continuous(breaks = seq(0,1,by=0.1),
limits = c(0,1))+
scale_y_continuous(breaks = seq(470,530,by=20),
limits=c(470,530))+
# geom_rect(aes(xmin = 0.75, xmax = 1, ymin = 470, ymax = 530),
# fill = "#efefef",
# alpha = 0.1)+
# geom_rect(aes(xmin = 0.9, xmax = 1, ymin = 470, ymax = 530),
# fill = "#ff0000",
# alpha = 0.1)+
# geom_point(shape=21,fill="#149AB5", col="#333333", stroke=1, alpha=0.75, size=2)+
geom_point(size=2)+
scale_color_manual(values = c(under75,top25,top10))+
labs(x="Threshold",
y="Score",
title="MCAT Scores",
subtitle="",
caption="Data Source: -[ artificial data ]- ")+
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5))+
theme(axis.text=element_text(size=14),
axis.title=element_text(size=14,
face="plain"),
plot.title = element_text(size=16,
face="bold"))+
geom_vline(xintercept=t1,
col="#888888",
linetype = "longdash")+
geom_vline(xintercept=t2,
col="#888888",
linetype = "longdash")+
annotate("text",
x=t1,
y=530, #to locate near the line: df$score[df$p==t1]+10
size=5,
hjust=-0.2,
label = expression(paste("Top 25%"))
)+
annotate("text",
x=t2,
y=530,
size=5,
hjust=-0.2,
label = expression(paste("Top 10%"))
)
#draw the plot
p
#save the image
ggsave("mcat.png", width = 35, units = "cm")
Let’s suppose MCAT scores are independent and normally distributed. We can simulate a sample of 239,681 scores from this hypothetical dataset using the mean and standard deviation for MCAT scores, 2015 to 2017.
#https://www.aamc.org/system/files/c/2/462316-mcatguide.pdf
set.seed(12345)
n<-239681
mean <-500.5
sd <-10.5
mcatNorm <- rnorm(n, mean, sd)
df<-data.frame(mcatNorm)
mainFill <- "#00a18e"
mainBorder <- "#005c51"
ggplot(df, aes(mcatNorm)) + geom_density(fill = mainFill, colour=mainBorder, alpha = 0.5) + theme_minimal()
The admissions process for each school differs. Medical School A uses mean scores while Medical School B uses median scores. Which school will consider more students with top scores?
#sample size
n<-100
#mean
mean <-500.5
#standard deviation
sd <-10.5
#number of repetitions
reps<-1000
#replicate a function n times
#this creates 1000 samples with sample size (n=100) from our defined distribution
rep<-replicate(n = reps,
expr = rnorm(n, mean, sd),
simplify = FALSE)
#calculate means from each sample
means<-1*sapply(rep, mean)
#calculate median from each sample
medians<-1*sapply(rep, median)
#organize output in a dataframe
df<-data.frame(means,medians)
#melt the data (library: reshape2)
df.x <- melt(df)
#plot the data
#two overlapping density plots
ggplot(df.x) + geom_density(aes(x = value, color = variable)) + theme_minimal()