The icons will appear by your user name. All three options should be available on mobile as well. Please practice using the icons when prompted to do so by the host.
# general libraries
library(tidyverse) # data organization and plotting
library(knitr) # nice tables and for rmarkdown
library(ggplot2) # graphics and data
# libraries for RDD example
library(foreign) # read .dta data format
library(rdd) # regression discontinuity
library(rddtools) # regression discontinuity
“Nonparametric regression analysis relaxes the assumption of linearity, substituting the much weaker assumption of a smooth population regression function.”
In what important way does nonparametric regression contrast with linear regression?
No assumption of linearity
Nonparametric simple regression is nevertheless useful for two reasons:
The concept of binning is central to nonparametric regression. Consider a continuous independent (X) variable and some continuous dependent (Y) outcome variable. Perhaps age and its relationship to salary. You might expect there to be a positive relationship, but not necessarily a linear relationship. What the process of binning does is create local groups in your X variable and then find the average Y value for that group. This works well when you have a large data set but not so well when you have a small data set.
What happens during ‘binning’?
Average dependent variable values are found within small subsets of the independent variable. You end up with a unique y-hat for each bin of x values.
What is better for estimation: larger bins or smaller bins?
It depends. Smaller bins will capture more detail in variation in your data set but will yield a more erratic function. Larger bins provide a smoother estimate but may not capture sufficient variation. The bin size (*h* or bandwidth) is up to you.
More information on Nonparametric Regression in R
Let’s take a look at an example of Local Polynomial Regression Fitting (Loess) smoothing, a type of nonparametric regression. For this example, we will use the economics data set found in ggplot2. This data set represents a US economic time series. We will examine median unemployment percentage by date.
Month of data collection
personal consumption expenditures, in billions of dollars
total population, in thousands
median duration of unemployment, in weeks
Load and prepare the data.
#load the data
data(economics, package="ggplot2")
#create an index variable
economics$index <- 1:nrow(economics)
#limit data to 80 rows
#this is easier for understanding the smoothing results
economics <- economics[1:80, ]
#examine the data
head(economics)
Carry out the smoothing process.
#Fit a polynomial surface determined by one or more numerical predictors, using local fitting.
loessMod10 <- loess(uempmed ~ index, data=economics, span=0.10) # 10% smoothing span
loessMod25 <- loess(uempmed ~ index, data=economics, span=0.25) # 25% smoothing span
loessMod50 <- loess(uempmed ~ index, data=economics, span=0.50) # 50% smoothing span
loessMod100 <- loess(uempmed ~ index, data=economics, span=1.0) # 100% smoothing span
#Get smoothed output (predict values)
smoothed10 <- predict(loessMod10)
smoothed25 <- predict(loessMod25)
smoothed50 <- predict(loessMod50)
smoothed100 <- predict(loessMod100)
Now the original data can be plotted with the smoothed, predicted values. We can take a look at a few different levels of smoothing to see the effect.
#Create plots
#10% smoothing
plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction (span=0.10)", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed10, x=economics$date, col="red")
#25% smoothing
plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction (span=0.25)", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed25, x=economics$date, col="red")
#50% smoothing
plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction (span=0.50)", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed50, x=economics$date, col="red")
#100% smoothing
plot(economics$uempmed, x=economics$date, type="l", main="Loess Smoothing and Prediction (span=1.0)", xlab="Date", ylab="Unemployment (Median)")
lines(smoothed100, x=economics$date, col="red")
What would be a really good question here about the smoothing level?
An excellent question would be: "How do you determine the optimized smoothing value for Loess smoothing?" (not covered here) or "What is the optimal level of smoothing?"
Are you able to identify cases of underfitting and overfitting in the previous plots? Explain.
10% smoothing appears to be a case of overfitting as the 'smooth' function still retains a lot of noise and erratic variation. 100% is very obviously underfit as it does not capture the necessary detail of the peaks. A good value might lie between 0.25 and 0.5 but determining that would require further analysis to optimize the smoothing value.
Predictions can be made using the loess function within your data range.
selectedIndex = 47
#get the date for this data point
economics$date[selectedIndex]
## [1] "1971-05-01"
#get the actual value for this data point
economics$uempmed[selectedIndex]
## [1] 6.7
#get the preicted value for this data point using our model
predict(loessMod50, newdata=data.frame(index=selectedIndex))
## 1
## 6.215168