Data Analysis for Social Scientists

Helpful Hints

You might find some cheat sheets11 R Cheatsheets helpful in remembering many of the useful R commands. You may also be interested in “Semiology of Graphics” (1967) by Jacques Bertin22 Jacques Bertin or any of the books by Edward Tufte33 Edward Tufte.

Intended Learning Objectives

This module will focus specifically on the improvement of a stock price graphic by improving the data-ink ratio44 “A large share of ink on a graphic should present data-information, the ink changing as the data change. Data-ink is the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented.” - Edward Tufte and reducing chartjunk55 “The interior decoration of graphics generates a lot of ink that does not tell the viewer anything new. The purpose of decoration varies — to make the graphic appear more scientific and precise, to enliven the display, to give the designer an opportunity to exercise artistic skills. Regardless of its cause, it is all non-data-ink or redundant data-ink, and it is often chartjunk.” - Edward Tufte.

Recognize when plots violate Tufte’s design principles
Determine plot types that most accurately represent the data
Describe how you would create an improved plot using Tufte’s design principles
Bonus: Create & justify improved plots

Netflix vs Tesla: 100 Days

The original plot showed a comparison of two different windows of time for closing prices of Netflix (NFLX) and Tesla (TSLA) stock. The comparison suggests that TSLA is following a similar trend seen with the closing price for Netflix. Is this an accurate comparison?

Eddie Yoon, founder of think tank EddieWouldGrow, …says that Tesla’s stock in 2019 looks very similar to Netflix’s stock performance in 2011. Image Source.

Questions

Is the message clear?
Are the colors helpful?
Do the data support the intended story?
Is the 3D effect of the line justified?
Is the Netflix axis equivalent to the Tesla axis?
Assuming the time frame is the same for both stocks, what is the approximate slope for NFLX and for TSLA? Is that clear from the visualization?

The Good, the Bad, and the Ugly

We can summarize some thoughts about the original graphic before starting on an improved version.

good<-c("complementary colors", "strong visual impact")
bad<-c("scale", "distracting graphics")
ugly<-c("3D lines", "misleading")

df<-data.frame(good,bad,ugly)
names(df)<-c("The Good", "The Bad", "The Ugly")

kable(df)

The Good	The Bad	The Ugly
complementary colors	scale	3D lines
strong visual impact	distracting graphics	misleading

Analysis

Our goal is to recreate the data visualization but with data integrity as the priority. We also will attempt to recreate the overlapped data as presented earlier to try and determine the accuracy of the plot.

First, we use library(quantmod)66 “The quantmod package for R is designed to assist the quantitative trader in the development, testing, and deployment of statistically based trading models.” (quantmod documentation) to import stock data.

library(quantmod)

#set dates for stock price retrieval
start <- as.Date("2011-06-01")
end <- as.Date("2013-11-01")

start2 <- as.Date("2019-01-01")
end2 <- as.Date("2019-07-01")

#import data using the dates above
getSymbols("NFLX", src = "yahoo", from = start, to = end)

## [1] "NFLX"

getSymbols("TSLA", src = "yahoo", from = start2, to = end2)

## [1] "TSLA"

#plot(NFLX[,"NFLX.Close"], main = "NFLX Close")
#plot(TSLA[,"TSLA.Close"], main = "TSLA Close")

#creates an xts time series
nf <- as.xts(data.frame(NFLX = NFLX[,4]))
ts <- as.xts(data.frame(TSLA = TSLA[,4]))

#fortify will convert xts to a dataframe with Index as a column
nff <- fortify(nf)
tsf <- fortify(ts)

#convert Index columns in each dataframe to date type 
#(scale_x_date below only works with date types)
nff$Index<-as.Date(nff$Index)
tsf$Index<-as.Date(tsf$Index)

Use ggplot() to plot a more honest comparison with accurate scales for time and price. Note the use of ‘scale_x_date’ for the X axis77 Be careful when using dates! Check the data type using class(myVariable). If the data are not ‘Date’ type, convert using as.Date(myVariable).. The axis text angle is modified after the theme_minimal(). Highlighting88 Highlighting can help draw user attention to desired features. Color matching can help reinforce associations. is implemented here to draw attention to the regions of interest in each stock data set. Notice that the compared regions represent very little of the Netflix data, but a large proportion of the Tesla data.

ggplot() +
  geom_rect(data=NULL,aes(xmin=as.Date("2011-06-01"),xmax=as.Date("2011-10-20"),ymin=-Inf,ymax=Inf), fill="#00000011")+
  geom_rect(data=NULL,aes(xmin=as.Date("2019-01-02"),xmax=as.Date("2019-05-24"),ymin=-Inf,ymax=Inf), fill="#FF000011")+
  geom_line(data=nff, aes(x=Index, y=NFLX.Close, color="Netflix"))+
  geom_line(data=tsf, aes(x=Index, y=TSLA.Close, color="Tesla"))+
  scale_x_date(date_labels = "%b %y",date_breaks = "6 months")+
  scale_y_continuous(breaks = seq(0, 400, by=50), name="Closing Price ($USD)")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))+
  scale_colour_manual(
  values = c("#333333", "#FF0000"),
  labels = c("Netflix", "Tesla"))+
  labs(color="Company")

Now we can force an overlap by scaling TSLA down. We can also add an additional Y axis for the Tesla data. Note the use of ‘sec.axis’ to create a scale equal to the existing Y values times 10. This converts the scaled-down data values back to the correct value for the axis. 99 Is this figure easier or more difficult to interpret?

#add ID columns using tibble function rowid_to_column
nff <- rowid_to_column(nff, "ID")
tsf <- rowid_to_column(tsf, "ID")

#merge the dataframes by ID number
mergeDF<-merge(nff,tsf,by="ID")

newPlot<-ggplot() +
  geom_line(data=nff, aes(x=ID, y=NFLX.Close, color="Netflix"))+
  geom_line(data=tsf, aes(x=ID, y=TSLA.Close/10, color="Tesla"))+
  scale_y_continuous("Closing Price ($USD, NFLX)",
                     breaks = seq(15, 50, by=5), 
                     sec.axis = sec_axis(~.*10, name = "Closing Price ($USD, TSLA)",
                                         breaks = seq(150, 500, by=50)))+
  #scale_y_continuous(name="Closing Price ($USD, NFLX)", breaks = seq(15, 40, by=5))+
  #scale_y_continuous(sec.axis = sec_axis(~.*10, name = "Closing Price ($USD, TSLA)",breaks = seq(150, 400, by=50)))+
  scale_colour_manual(values = c("#CCCCCC","#FF3300"),name = "Company")+
  scale_x_continuous(name = "Day")+
  theme_minimal()+
  theme(axis.title.y      = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)))+
  theme(axis.title.y.right= element_text(margin = margin(t = 0, r = 0, b = 0, l = 10)))+
  theme(axis.text.y.right = element_text(color = "red"))+
  labs(title = "TECH STOCK THROWBACK?",
              subtitle = "Comparison of closing price (NFLX: 2011-06-01 to 2011-10-20; TSLA: 2019-01-02 to 2019-05-24)",
              caption = "Data Source: Yahoo")
newPlot

How does this graphic compare to the new one? Image Source.

We can restrict values to 100 days for a more direct comparison.

⊕Zooming in on the important message can help with clarity. The original plot suggests that perhaps Tesla will rebound like Netflix. Is there any reason to believe this given the current trend? Why or why not?

#select first 100 rows for comparison
selectedDF<-slice(mergeDF, 1:100)

Here’s a focused version of the data. Does the Tesla trend remind you of Netflix as suggested by Eddie Yoon? Why or why not?

ggplot() +
  geom_line(data=selectedDF, aes(x=ID, y=NFLX.Close, color="Netflix"))+
  geom_line(data=selectedDF, aes(x=ID, y=TSLA.Close/10, color="Tesla"))+
  scale_y_continuous("Closing Price ($USD, NFLX)",
                     breaks = seq(15, 40, by=5), 
                     sec.axis = sec_axis(~.*10, name = "Closing Price ($USD, TSLA)",
                                         breaks = seq(150, 400, by=50)))+
  #scale_y_continuous(name="Closing Price ($USD, NFLX)", breaks = seq(15, 40, by=5))+
  #scale_y_continuous(sec.axis = sec_axis(~.*10, name = "Closing Price ($USD, TSLA)",breaks = seq(150, 400, by=50)))+
  scale_colour_manual(values = c("#CCCCCC","#FF3300"),name = "Company")+
  scale_x_continuous(name = "Day")+
  theme_minimal()+
  theme(axis.title.y      = element_text(margin = margin(t = 0, r = 10, b = 0, l = 0)))+
  theme(axis.title.y.right= element_text(margin = margin(t = 0, r = 0, b = 0, l = 10)))+
  theme(axis.text.y.right = element_text(color = "red"))+
  labs(title = "TECH STOCK THROWBACK?",
              subtitle = "Comparison of closing price (NFLX: 2011-06-01 to 2011-10-20; TSLA: 2019-01-02 to 2019-05-24)",
              caption = "Data Source: Yahoo")

Intended Learning Objectives

Are you confident you can:

Recognize when plots violate Tufte’s design principles
Determine plot types that most accurately represent the data
Describe how you would create an improved plot using Tufte’s design principles
Bonus: Create & justify improved plots

Data Analysis for Social Scientists

Data Visualization

Helpful Hints

Intended Learning Objectives

Netflix vs Tesla: 100 Days

Questions

The Good, the Bad, and the Ugly

Analysis

Intended Learning Objectives