Final Review

Practice Session!

If you want to speak, please click the “raise your hand” icon
If you agree with the statement or answer, click the “agree” icon
If you disagree with the statement or answer, click the “disagree” icon

The icons will appear by your user name. All three options should be available on mobile as well. Please practice using the icons when prompted to do so by the host.

Intended Learning Objectives

Identify a situation where Regression Discontinuity Design (RDD) analysis is appropriate and useful
Explain the general method of Regression Discontinuity Design (RDD) analysis
Compare results from OLS and 2SLS
Identify a situation where nonparametric regression is appropriate
Explain the general method of nonparametric regression

Libraries

# general libraries
library(tidyverse)  # data organization and plotting
library(knitr)      # nice tables and for rmarkdown
library(ggplot2)    # graphics and data

# libraries for RDD example
library(foreign)    # read .dta data format
library(rdd)        # regression discontinuity
library(rddtools)   # regression discontinuity

Sharp Regression Discontinuity Design (RDD)

Let’s first investigate the three terms in this title.

Regression - this refers to measurement of the relationship of variables.
Discontinuity - the relationship between variables is expected to change depending on some treatment condition (e.g. credit score < X). RDD is defined by this characteristic of ‘discontinuity at a cut-point (threshold)’.
Sharp - the probability of treatment below the threshold is 0 and the probability is 1 above the threshold.

RDD examines a small range of the independent variable. For example, if examining credit scores, you may be interested in customers with scores immediately above and below some threshold value, perhaps 700. For an RDD analysis, you might want to look very closely around this score, perhaps with customers having scores between 690 and 710. You are not investigating a relationship for customers having any score, only customers close to your target threshold of 700. You might, for example, want to know if there is a discontinuity in the relationship at this level due to the influence of some other variable. Perhaps customers with a score above 700 qualify for a loan and getting that loan increases their probability of success. Therefore, you may notice that customers above this threshold tend to have a higher ‘success’ outcome variable value than those with scores below 700.

Participation Question: When is Regression Discontinuity useful? Please use the ‘raise your hand’ command to speak and again to agree/disagree later.



"Regression discontinuity (RD) analysis is a rigorous nonexperimental approach that can be
used to estimate program impacts in situations in which candidates are selected for treatment based on whether their value for a numeric rating exceeds a designated threshold or cut-point." - Jacob and Zhu, 2012.

Source

One important point to recognize is that you must assume that, independent of the effect of this other variable (loans in this case), the relationship of credit scores and success would normally not be expected to be different between customers with scores from 690 to 710. That is, they are all really closely grouped so what could be so different for someone with a score of, say, 699 and 701? The answer is that you would not expect a difference, yet an RDD analysis might reveal that someone with any score greater than or equal to 700 could be expected to have a significantly higher ‘success’ score compared with anyone with a score below 700. In this case, you might attribute the difference to the effect of having a loan, which could positively impact an individual’s ability to start or expand a business, pay for a home, or other similar life-changing events. Therefore, another defining characteristic of RDD is that of ‘local randomization’.

More information:

Regression Discontinuity

Example: Mortality and Drinking Age

In this example, we will examine the mortality rate of young adults in the context of legal access to alcohol. That is, to what extent does legal access (age=21, USA) affect mortality rate for young adults?

Participation Question: What should we do first?



Load the data and do a broad examination. Explore it just a bit to ensure you have what you want before analyzing the information. It might be good to get a very high-level view of your data and any potential trends without immediately jumping into your analysis. Get the big picture. 

Let's first examine the relationship between the variables Age and Mortality Rate (per 100,000) without consideration for any other variables.

library(foreign)
#read in the data
data=read.dta("AEJfigs.dta")

#print some of the data
head(data)

#name some data as variables
age<-data$agecell
mortalityRate<-data$all

#simple linear model
lm<-lm(mortalityRate~age)

#simple plot
plot(age,mortalityRate, 
     xlab="Age (years)",
     ylab="Deaths per 100,000 (all causes)",
     ylim=range(85,110),
     main="Age vs Death Rate: Simple Regession"
     )

#add the regression line
abline(lm, col="#FF3300")

#print the model summary
summary(lm)

## 
## Call:
## lm(formula = mortalityRate ~ age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9451 -2.0900 -0.8026  1.7649  9.5159 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.9326     8.3901   6.547 4.36e-08 ***
## age           1.9400     0.3989   4.863 1.39e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.147 on 46 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.3395, Adjusted R-squared:  0.3252 
## F-statistic: 23.65 on 1 and 46 DF,  p-value: 1.391e-05

Participation Question: What do you notice about the relationship in the plot “Age vs Death Rate: Simple Regression”?



Our first analysis seems to indicate a significant positive relationship between age and mortality rate. That is, as age increases, mortality rate increases. This is not a huge surprise as we might suspect job-related injuries, driving, and perhaps disease. Remember however, that the data only include individuals whose ages range from 19 to 23. Because of the proximity of the ages, we might suspect that there would be almost no difference, but that is not what we see here. Perhaps some other variable is at work.

Participation Question: Should we use a sharp or fuzzy RDD to investigate the data? Explain.



We do know that the legal access to alcohol begins at age 21 (USA) so let's now split the data set at this threshold. We will do a *sharp* regression discontinuity analysis by separating our population into two categories with forced membership based on this value. That is, when age is less than 21, individuals are grouped into 'below' status while those individuals age 21 or older are grouped into 'above' status. To be clear, we are including age 21 in the 'above' group.  

This will be 'sharp' because you cannot choose to not be 21 after you reach that age. You are automatically a member of that group. Contrast this with choosing to accept a scholarship, a loan, or some other optional treatment.

Now let’s carry out the RDD analysis.

More information:

# Construct the base RDD object
rddData <- rdd_data(data$all, data$agecell, cutpoint = 21)

#estimate the standard discontinuity regression
#rdd_reg_lm produces an OLS RD estimate
#two linear models, one on each side of the threshold
#this estimates the treatment effect
#choose 'same' slope if you want to force the slope on both sides
#coefficient of interest is 'D'
rddLM <- rdd_reg_lm(rdd_object = rddData, slope = "same")

#print out the summary
rddLM

## ### RDD regression: parametric ###
##  Polynomial order:  1 
##  Slopes:  same 
##  Number of obs: 50 (left: 25, right: 25)
## 
##  Coefficient:
##   Estimate Std. Error t value  Pr(>|t|)    
## D   7.6627     1.4403  5.3203 3.145e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Participation Question: What was the estimated treatment effect? That is, what is the effect of ‘turning 21’ on mortality rate? Explain using the RDD model output.



From our results, we can see that the 21-and-older group experiences a significant increase of about 8 deaths per 100,000 versus the below-21 group.

We can easily plot our data with our RDD model and a line for our threshold value as well.

# plot RD results
plot(rddLM, 
     cex = 1, 
     lwd = 0.4,
     ylim = c(85,110),
     xlab = "Age (years)",
     ylab = "Deaths per 100,000 (all causes)",
     main="Age vs Death Rate: Regression Discontinuity"
     )

# add a dashed vertical line at cutoff
abline(v = 21, lty = 2)

Examine what happens to predicted values for individuals below and above the age of 21. Notice that the line of best fit for each group is separated by about 7 to 8 points on the mortality axis. The highest mortality rate in this group happens just after the age of 21 at about 105 deaths per 100,000 individuals.

Participation Question: Are you able to explain two defining characteristics of Regression Discontinuity analysis?



1) 'Discontinuity at a cut-point' -- The RD design is characterized by a treatment assignment that is based on whether an applicant falls above or below a cut-point on a rating variable, generating a discontinuity in the probability of treatment receipt at that point. 

2) 'Local randomization' -- Differences between candidates who just miss and just make a threshold are random. Candidates who just miss the cut-point are thus, on average, identical to those who just make it, except for exposure to treatment. Any difference in subsequent mean outcomes must therefore be caused by treatment. 

 - Jacob and Zhu, 2012.

Source

Nonparametric Regression

“Nonparametric regression analysis relaxes the assumption of linearity, substituting the much weaker assumption of a smooth population regression function.”

The cost of relaxing the assumption of linearity is much greater computation and, in some instances, a more difficult-to-understand result
The gain is potentially a more accurate estimate of the regression function

Source

In what important way does nonparametric regression contrast with linear regression?

No assumption of linearity

Nonparametric simple regression is nevertheless useful for two reasons:

Nonparametric simple regression is called scatterplot smoothing, because the method passes a smooth curve through the points in a scatterplot of y against x.
Nonparametric simple regression forms the basis, by extension, for nonparametric multiple regression, and directly supplies the building blocks for a particular kind of nonparametric multiple regression called additive regression.

Source

The concept of binning is central to nonparametric regression. Consider a continuous independent (X) variable and some continuous dependent (Y) outcome variable. Perhaps age and its relationship to salary. You might expect there to be a positive relationship, but not necessarily a linear relationship. What the process of binning does is create local groups in your X variable and then find the average Y value for that group. This works well when you have a large data set but not so well when you have a small data set.

What happens during ‘binning’?



Average dependent variable values are found within small subsets of the independent variable. You end up with a unique y-hat for each bin of x values.

What is better for estimation: larger bins or smaller bins?



It depends. Smaller bins will capture more detail in variation in your data set but will yield a more erratic function. Larger bins provide a smoother estimate but may not capture sufficient variation. The bin size (*h* or bandwidth) is up to you.

More information on Nonparametric Regression in R

Loess Smoothing

Let’s take a look at an example of Local Polynomial Regression Fitting (Loess) smoothing, a type of nonparametric regression. For this example, we will use the economics data set found in ggplot2. This data set represents a US economic time series. We will examine median unemployment percentage by date.

date
pce
pop
psavert
uempmed
unemploy

Idea Source

Load and prepare the data.

#load the data
data(economics, package="ggplot2") 

#create an index variable
economics$index <- 1:nrow(economics)

#limit data to 80 rows
#this is easier for understanding the smoothing results
economics <- economics[1:80, ]  

#examine the data
head(economics)

Carry out the smoothing process.

#Fit a polynomial surface determined by one or more numerical predictors, using local fitting.
loessMod10  <- loess(uempmed ~ index, data=economics, span=0.10) # 10% smoothing span
loessMod25  <- loess(uempmed ~ index, data=economics, span=0.25) # 25% smoothing span
loessMod50  <- loess(uempmed ~ index, data=economics, span=0.50) # 50% smoothing span
loessMod100 <- loess(uempmed ~ index, data=economics, span=1.0)  # 100% smoothing span

#Get smoothed output (predict values)
smoothed10  <- predict(loessMod10) 
smoothed25  <- predict(loessMod25) 
smoothed50  <- predict(loessMod50)
smoothed100 <- predict(loessMod100)

Now the original data can be plotted with the smoothed, predicted values. We can take a look at a few different levels of smoothing to see the effect.

#Create plots
#10% smoothing
plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction (span=0.10)", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed10, x=economics$date, col="red")

#25% smoothing
plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction (span=0.25)", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed25, x=economics$date, col="red")

#50% smoothing
plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction (span=0.50)", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed50, x=economics$date, col="red")

#100% smoothing
plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction (span=1.0)", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed100, x=economics$date, col="red")

What would be a really good question here about the smoothing level?



An excellent question would be: "How do you determine the optimized smoothing value for Loess smoothing?" (not covered here) or "What is the optimal level of smoothing?"

Are you able to identify cases of underfitting and overfitting in the previous plots? Explain.


10% smoothing appears to be a case of overfitting as the 'smooth' function still retains a lot of noise and erratic variation. 100% is very obviously underfit as it does not capture the necessary detail of the peaks. A good value might lie between 0.25 and 0.5 but determining that would require further analysis to optimize the smoothing value.

Predictions can be made using the loess function within your data range.

selectedIndex = 47

#get the date for this data point
economics$date[selectedIndex]

## [1] "1971-05-01"

#get the actual value for this data point
economics$uempmed[selectedIndex]

## [1] 6.7

#get the preicted value for this data point using our model
predict(loessMod50, newdata=data.frame(index=selectedIndex))

##        1 
## 6.215168

Data Analysis for Social Scientists

Final Review

Practice Session!

Intended Learning Objectives

Libraries

Sharp Regression Discontinuity Design (RDD)

Example: Mortality and Drinking Age

Nonparametric Regression

Loess Smoothing