You are a data scientist working for a research team studying circadian rhythms of floral scent emission. Your team is aware that some species of flowering plants emit insect-attracting scents on a periodic 24-hour cycle. One such real-world example is the Petunia.
Assume your team has collected the following data from flowers of the totally-not-real species Megaflora bostoniensis (simulated below). Let X represent the time in hours and let Y represent the level of production of scent compounds in parts per million (ppm).
set.seed(123)
sampleSize<-300
X = runif(sampleSize, min=0, max=30*pi)
a = 3
h = 5.8
b = 3.8
k = 2
Y = a * (sin((X-h)/b)+k) + rnorm(sampleSize, sd=0.7)
We can use a boxplot to explore quartiles.
#create a boxplot for the 'scent' response variable
boxplot(Y, col="#F8F8F8", las=2, ylab="Scent (ppm)")
We can use a histogram to explore distribution.
#create a histogram for the 'scent' response variable
#no y axis so we can make a custom one
hist(Y, col="#F8F8F8", xlab="Scent (ppm)", yaxt="none")
#custom y axis with rotated values and new sequence
axis(2, seq(0,60,5),las=2)
An exploratory plot reveals the cyclical pattern in the data. What is the period (time for one cycle) of this wave?
#create plot with no axis labels
plot(X,Y, pch=20,
xlim=c(0,96),ylim=c(0,12),
axes=FALSE,
xlab="Time (h)",
ylab="Scent (ppm)")
#add custom X axis labels
axis(1, at = seq(0, 96, by = 2), las=1)
#add custom Y axis labels
axis(2, at = seq(0, 12, by = 1), las=1)
#create a data frame from the data set
df<-data.frame(X,Y)
#head(df)
#set up the plot
plot<-ggplot(df,aes(x=X,y=Y))+
geom_point(color="#FF3300")+
labs(x="Time (h)", y="Scent (ppm)",title="Circadian Rhythm of Floral Scent", subtitle="Megaflora bostoniensis", caption="Data Source: simulated data")+
theme_classic()+
scale_x_continuous(limits=c(0,100), breaks=seq(0,100,by=5))+
scale_y_continuous(limits=c(0,12), breaks=seq(0,12,by=2))+
theme(
text = element_text(size=18),
axis.title= element_text(size=20, color = "#333333"),
axis.text.x = element_text(angle=0, hjust=1),
plot.title = element_text(color = "#333333", size = 24, face = "bold"),
plot.subtitle = element_text(color = "#333333", face = "italic"),
plot.caption = element_text(color = "#333333")
)
#draw the plot
plot
Linear regression models the underlying linear relationship between X and Y. Here, the relationship is clearly nonlinear so this is a poor approximation of the relationship of the two variables. A different method is required to more accurately capture the relationship.
#create linear regression model
fit = lm(Y~X)
#create plot with no axis labels
plot(X,Y, pch=21,
col="#CCCCCC", bg="#F0F0F0",
xlim=c(0,96),ylim=c(0,12),
axes=FALSE,
xlab="Time (h)",
ylab="Scent (ppm)")
#add custom X axis labels
axis(1, at = seq(0, 96, by = 2), las=1)
#add custom Y axis labels
axis(2, at = seq(0, 12, by = 1), las=1)
#add the linear regression line of best fit
abline(fit, lwd=2, col="#FF3300")
Note “se=FALSE”. If that value is set to ‘TRUE’ the regression line will include a standard error band.
ggplot(df,aes(x=X,y=Y))+
geom_point(color="#CCCCCC",fill="#F0F0F0", pch=21,size=3)+
labs(x="Time (h)", y="Scent (ppm)",title="Circadian Rhythm of Floral Scent", subtitle="Megaflora bostoniensis", caption="Data Source: simulated data")+
theme_classic()+
scale_x_continuous(limits=c(0,100), breaks=seq(0,100,by=5))+
scale_y_continuous(limits=c(0,12), breaks=seq(0,12,by=2))+
theme(
text = element_text(size=18),
axis.title= element_text(size=20, color = "#333333"),
axis.text.x = element_text(angle=0, hjust=1),
plot.title = element_text(color = "#333333", size = 24, face = "bold"),
plot.subtitle = element_text(color = "#333333", face = "italic"),
plot.caption = element_text(color = "#333333")
)+
geom_smooth(method='lm', se=FALSE, aes(col="#0033FF"))+
theme(legend.position = "none")
The idea of a kernel regression is that for a given point x, we then use a weighted average of the nearby points’ response (Y) as the estimated value of the regression function. Points whose value of covariate (X) is closer to x will be given a higher weight. Observations whose covariate value are away from x will be given a lower weight (or even no weight). For example, at point x = 4, points whose covariate X is closer to 4 will be given a higher weight. We then estimate the regression function using the weight average of all the response.
The kernel function (k(x)) determines the decreasing pattern of the weight for observations whose covariate value is away from x. The smoothing bandwidth (h) controls the decreasing rate. In a sense, we are using the kernel function and smoothing bandwidth to smooth out the data points to obtain a local estimate of the regression function.
When we set kernel=“normal”, we are using the Gaussian function as the kernel. An alternative is to choose kernel=“box”; in this case, we will give equal weight to points whose distance to x is less than h.
Let’s create three versions of an estimate function using different bandwidth (h) values. The kernel type is ‘normal’ which uses a Gaussian function to give preferential weighting to points for observations close to x.
#calculate kernel regression (kr) line using a few different bandwidth values
kr1 = ksmooth(x=X,y=Y,kernel = "normal",bandwidth = 0.1)
kr2 = ksmooth(x=X,y=Y,kernel = "normal",bandwidth = 1.0)
kr3 = ksmooth(x=X,y=Y,kernel = "normal",bandwidth = 10.0)
This model in this first plot does not fit very well. It isn’t smooth at all. The bandwidth value is too low.
#PLOT 1
#create plot with no axis labels
plot(X,Y, pch=21,
col="#CCCCCC", bg="#F0F0F0",
xlim=c(0,96),ylim=c(0,12),
axes=FALSE,
xlab="Time (h)",
ylab="Scent (ppm)")
#add custom X axis labels
axis(1, at = seq(0, 96, by = 2), las=1)
#add custom Y axis labels
axis(2, at = seq(0, 12, by = 1), las=1)
lines(kr1, lwd=4, col="#FF3300")
legend("topright", c("h=0.1"), lwd=6, col=c("#FF3300"))
This model fits the data better, but is still not very good because it is still not smooth enough. The shape indicates the presence of noise in the estimated function. This bandwidth value is better, but still too low.
#PLOT 2
#create plot with no axis labels
plot(X,Y, pch=21,
col="#CCCCCC", bg="#F0F0F0",
xlim=c(0,96),ylim=c(0,12),
axes=FALSE,
xlab="Time (h)",
ylab="Scent (ppm)")
#add custom X axis labels
axis(1, at = seq(0, 96, by = 2), las=1)
#add custom Y axis labels
axis(2, at = seq(0, 12, by = 1), las=1)
lines(kr2, lwd=4, col="#FF3300")
legend("topright", c("h=1.0"), lwd=6, col=c("#FF3300"))
This model is too smooth and does not match the data very well either. This bandwidth value is too high.
#PLOT 3
#create plot with no axis labels
plot(X,Y, pch=21,
col="#CCCCCC", bg="#F0F0F0",
xlim=c(0,96),ylim=c(0,12),
axes=FALSE,
xlab="Time (h)",
ylab="Scent (ppm)")
#add custom X axis labels
axis(1, at = seq(0, 96, by = 2), las=1)
#add custom Y axis labels
axis(2, at = seq(0, 12, by = 1), las=1)
lines(kr3, lwd=4, col="#FF3300")
legend("topright", c("h=10.0"), lwd=6, col=c("#FF3300"))
Notice that when bandwidth (h) is small, the function variability is high with small bias as the curve lies close to the data. In effect, a small bandwidth leads to a smaller amount of smoothing. With a high bandwidth, smoothing increases (low variability) but the bias increases. The goal then is to balance variability and bias by choosing the optimal value for the bandwidth. We can accomplish this selection by a process called cross validation.
Cross validation is used to estimate prediction error of our regression estimator. The process involves dividing the data set into two parts. One part is a ‘training’ set and the other is a ‘validation’ set. By splitting the data set into two parts, the problem of ‘overfitting’ is mitigated.
The cross-validation method used here is the leave-one-out cross-validation [.pdf download]. The data set is split many times and an error value is calculated each time. In each split, one observation is omitted as the validation set and the remaining values are used to fit the data and predict the omitted value. The mean of the error, calculated as output, provides an estimate of the prediction error given the bandwidth value. Many values of h are used to investigate the relationship of the bandwidth value and the error mean (prediction quality).
Note: The ksmooth() function calculates the Nadaraya–Watson kernel regression estimate.
Steps:
#create a sample size variable equal to the number of hours
n = length(X)
#create the smoothing bandwidths (h)
h_seq = seq(from=1.0,to=5.0, by=0.1)
#create an empty vector
CV_err_h = rep(NA,length(h_seq))
#for loop
for(j in 1:length(h_seq)){
#select a temporary value for h
h_using = h_seq[j]
#create a temporary empty vector
CV_err = rep(NA, n)
for(i in 1:n){
# create validation set
X_val = X[i]
Y_val = Y[i]
# create training set
X_tr = X[-i]
Y_tr = Y[-i]
#calculate predicted Y value using ksmooth() and the selected h value
Y_val_predict = ksmooth(x=X_tr,
y=Y_tr,
kernel = "normal",
bandwidth=h_using,
x.points = X_val)
#calculate the error as the squared difference of Y and Y_pred
CV_err[i] = (Y_val - Y_val_predict$y)^2
}
#calculate the mean of the squared errors
CV_err_h[j] = mean(CV_err)
}
We can investigate the validation and training sets to see what is inside.
head(data.frame(X_val,Y_val))
head(data.frame(X_tr,Y_tr))
Finally, let’s print the list of errors to see the values.
#print the list of mean squared errors for each value of h
CV_err_h
## [1] 0.5986115 0.5909694 0.5847113 0.5793305 0.5746780 0.5706373 0.5670490
## [8] 0.5638779 0.5610762 0.5585901 0.5564044 0.5544821 0.5528276 0.5513792
## [15] 0.5501665 0.5491311 0.5482943 0.5476274 0.5471176 0.5467580 0.5464952
## [22] 0.5463934 0.5464156 0.5465476 0.5468091 0.5471883 0.5476957 0.5483336
## [29] 0.5490952 0.5500051 0.5510583 0.5522540 0.5536037 0.5551381 0.5568374
## [36] 0.5587220 0.5608079 0.5630801 0.5655675 0.5682751 0.5712234
We can print the minimum value for the error and then find and print the bandwidth value associated with that error value.
cat("The minimum error is:", min(CV_err_h))
## The minimum error is: 0.5463934
#store the errors in a dataframe with the sequence of h values
df<-data.frame(h_seq,CV_err_h)
#head(df)
#get the bandwidth (h) that corresponds to minimum error
min<-df %>% slice(which.min(CV_err_h))
min
cat("The bandwith corresponding to the minimum error:", min$h_seq)
## The bandwith corresponding to the minimum error: 3.1
Our goal was to find the ‘best’ bandwidth value which means we needed to identify the value that is associated with the lowest calculated mean error. We can also plot the bandwidth values versus the resulting error values to see the relationship.
#plot bandwidth vs error
#look for lowest point
plot(x=h_seq, y=CV_err_h,
type="p", pch=21,
col="#CCCCCC",
bg="#F0F0F0",
xlim=c(1,5),
ylim=c(0.54,0.6),
axes=FALSE,
xlab="Smoothing bandwidth (h)",
ylab="LOOCV prediction error (MSE)")
#add lines for clarity
abline(v=min$h_seq,col="red",lty=2)
abline(h=min$CV_err_h,col="red",lty=2)
#add a single, slightly larger accent point for the lowest data point
#place code after the lines so the point draws on top of them
points(min$h_seq,min$CV_err_h,bg="red",col="red", pch=21,cex=1.5)
#add custom X axis labels
axis(1, at = seq(1, 5, by = 0.2), las=1)
#add custom Y axis labels
axis(2, at = seq(0.54, 0.6, by = 0.01), las=1)
The cross validation analysis provides us with a ‘best’ or optimized value for h so we can just repeat the previous plot using this suggested value and visualize the predicted function. We have some confidence that this bandwidth is better than other choices because of the process of cross validation.
kregCV = ksmooth(x=X,y=Y,kernel = "normal",bandwidth = min$h_seq)
plot(X,Y, pch=21,
col="#CCCCCC", bg="#F0F0F0",
xlim=c(0,96),ylim=c(0,12),
axes=FALSE,
xlab="Time (h)",
ylab="Scent (ppm)")
#add custom X axis labels
axis(1, at = seq(0, 96, by = 2), las=1)
#add custom Y axis labels
axis(2, at = seq(0, 12, by = 1), las=1)
#add ksmooth spline
lines(kregCV, lwd=2, col="#FF3300")
#add legend
legend("topright", paste("h=",min$h_seq), lwd=4, col=c("#FF3300"))
Our final modeled function is our current best guess as to how Scent (ppm) responds to time (h). We can see from the prediction that the scent levels vary over each 24-hour period, are highest around noon, and are lowest around midnight.
Other methods exist for fitting a smooth curve to nonlinear data. One such option is the smooth.spline() function.
fit<-smooth.spline(X, Y)
df<-data.frame(X,Y)
plot(df$X,df$Y,pch=21,
col="#CCCCCC", bg="#F0F0F0",
xlab="Time (h)", ylab="Scent (ppm)",
xlim=c(0,96),ylim=c(0,12),axes=FALSE)
#add custom X axis labels
axis(1, at = seq(0, 96, by = 2), las=1)
#add custom Y axis labels
axis(2, at = seq(0, 12, by = 1), las=1)
#add spline
lines(fit$x, fit$y, col = "#FF3300", lwd = 2)
You may be interested in visualizing the joint distribution of a data set. Here is a method with ggplot2. Note the use of geom_density2d() and the bunching of data around the peaks that occur during the middle of the day and around midnight.
ggplot(df,aes(x=X,y=Y))+
geom_point(color="#CCCCCC",fill="#F0F0F0", pch=21,size=3)+
geom_density2d(col="#FF3300")+
labs(x="Time (h)", y="Scent (ppm)",title="Circadian Rhythm of Floral Scent", subtitle="Megaflora bostoniensis", caption="Data Source: simulated data")+
theme_classic()+
scale_x_continuous(limits=c(0,100), breaks=seq(0,100,by=5))+
scale_y_continuous(limits=c(0,12), breaks=seq(0,12,by=2))+
theme(
text = element_text(size=18),
axis.title= element_text(size=20, color = "#333333"),
axis.text.x = element_text(angle=0, hjust=1),
plot.title = element_text(color = "#333333", size = 24, face = "bold"),
plot.subtitle = element_text(color = "#333333", face = "italic"),
plot.caption = element_text(color = "#333333")
)
Here is a variation on the same plot. Note the order of the density plot and the data points. The points are drawn last so they appear on top of the density information. The point colors have been changed and a zero-alpha fill is used to create hollow points. The points can be toggled on or off as desired by removing or leaving the comment mark in place.
ggplot(df,aes(x=X,y=Y))+
stat_density_2d(aes(fill = ..level..), geom = "polygon") +
scale_fill_continuous(low="#F0F0F0", high="#FF3300") +
#geom_point(color="#333333",fill="#33333300", pch=21,size=3)+
labs(x="Time (h)", y="Scent (ppm)",title="Circadian Rhythm of Floral Scent", subtitle="Megaflora bostoniensis", caption="Data Source: simulated data",fill = "Density")+
theme_classic()+
scale_x_continuous(limits=c(0,100), breaks=seq(0,100,by=5))+
scale_y_continuous(limits=c(0,12), breaks=seq(0,12,by=2))+
theme(
text = element_text(size=18),
axis.title= element_text(size=20, color = "#333333"),
axis.text.x = element_text(angle=0, hjust=1),
plot.title = element_text(color = "#333333", size = 24, face = "bold"),
plot.subtitle = element_text(color = "#333333", face = "italic"),
plot.caption = element_text(color = "#333333")
)
Kernel Regression: Mastering Scientific Computing with R by Paul Gerrard, Radia M. Johnson
5 Reasons why you should use Cross-Validation in your Data Science Projects