Seminar I: Principles of Design

Design for Scientific Communication

Introduction

Effective scientific communication requires attention to accuracy along with consideration for a simple message in what could otherwise be complicated content. In this module, we will explore some principles of design philosophy that may be useful to you as students and professionals. This list is not exhaustive and is intended as a starting point. At the end of this module, you should be capable of creating and explaining improved plots that are more-accurate, well-formatted charts designed to visualize biological data.

Data

The data sets used in this module include:

# Load useful libraries
library(igraph)    #for network data
library(networkD3) #for network data / interactives
library(tibble)    #for data organization
library(plyr)      # data tools
library(tidyverse) # a collection of data organization and visualization tools
library(dplyr)     # used to wrangle data
library(ggplot2)   # plotting with the 'grammar of graphics'
library(lubridate) # makes date data easier to work with
library(readr)     # helps with time information
library(stringr)   # helps with strings (text)
library(viridisLite) #color palettes
library(knitr)     # for tables

#read in the original data sets
gsaf5<-read_csv("data/global-shark-attacks/GSAF5.csv")
flyData<-read_csv("data/drosophila/genes-genscan.csv")
lsData<-read_csv("data/unitedNations/lifespan.csv")

#save the data as some other name so you can retain the original safely
#we will use names that are easy to remember and type
attacks<-as_tibble(gsaf5)
fly<-as_tibble(flyData)
ls<-as_tibble(lsData)

#take a look at the first few rows
#head(attacks, n=3)
#head(seizures, n=3)

Marks

The ‘marks’ of a visualization are the main components independent of any decoration or placement.

Points
Lines
Areas

Points

Let’s take a look at a simple example using the ‘ggplot2’ library and points.

Lines

Lines are another type of mark. What do the points and lines in this visualization of dolphins’ frequent associations represent?

Areas

If points are 0-dimensional, and lines are 1-dimensional, then areas are 2-dimensional. We won’t cover it here, but volumes are the next step with a 3-dimensional appearance. They have more limited use in data visualization, but do play a role. Areas represent two dimensions of data. Here we see the United States distribution of the tree species Sassafras albidum as an area plotted on a map. We use the ‘leaflet’ library to display the map and the ‘rgdal’ library to read in the shapefile data. This type of information is common in Geographic Information Systems (GIS) research.

Channels

Position
- x,y coordinate
- represents: magnitude
Size
- relative scale
- represents: magnitude
Shape
- data point style
- represents: category
Value
- greyscale value
- represents: magnitude
Color
- hue
- represents: category, magnitude

Position

Position is a very strong channel and is used frequently to indicate differences in data. Here we see the magnitude along the X axis represent the year while the magnitude along the Y axis represents time of day. The position of each data point indicates the year and time of day of the recorded incident.

Size

Size can be incorporated into a plot to indicate an additional variable. Here we see the magnitude along the X axis represent the year while the magnitude along the Y axis represents time of day. A third variable, age of the victim, is represented by the size of the data point. The alpha channel, or transparency, of the data point has been reduced from 1 to 0.1 to improve visibility. Do you think this is a good way to represent the data in this case? Why or why not?

Color

Color is often used to represent categorical data such as groups, countries, and regions. Color also can be effectively used to represent magnitude quite well. Eye-catching color is a good way to represent groups but requires some attention due to the prevalence of color-blindness and the ability of too many colors muddying the visual waters. Color is most effective when its use is limited and contrast is maximized to highlight the underlying message of the visualization.

Color to Highlight

In this plot, color is used to highlight a specific age range for individuals over age 65. This would be useful if you are interested in telling a story about elderly victims or were interested in exploring trends for this particular age group. It is much easier for the viewer to see your intentions when highlighting is used effectively versus expecting the viewer to visually separate the data points to find your message.

Pop-Quiz! Do you have any ideas why there are no reported attacks for the over-65 group before the mid-1980s?

Color for Magnitude

A quick note on gradients and color choice: consider carefully what your color is intended to convey. Do you have sequential data that are intended to suggest a linear effect? Do you have diverging data that suggest a ‘left’ and a ‘right’ side to a midpoint? Do you have qualitative data that represent categories? Using a poorly-chosen palette for these types of information can be misleading or confusing. Here is an example using a commonly-misused rainbow palette. Ask youself:

what do these colors mean to the viewer?
what is the meaning of green in this example?
How far away from red is the blue?
Does red imply higher or lower altitude than pink?
Why is pink not next to red?

Here we see a diverging color palette. Does this help? It could be better because it looks like the blue area is perhaps not important. Again, what does the white area signify here?

Now consider the use of a gradient that work a little more clearly with terrain data. This color palette is inteded to mimic actual terrain which helps the viewer to make sense of the data intuitively. The result is improved but still a bit confusing perhaps. The green could be the ground, but what does the intense yellow mean?

These data represent positional data and height for a volcano. From the dataset description: “Maunga Whau (Mt Eden) is one of about 50 volcanos in the Auckland volcanic field. This data set gives topographic information for Maunga Whau on a 10m by 10m grid.”

Hexadecimal Colors

Color is most easily represented as, well, a color. Because computers need to have instructions written as text, we must use other representations when programming.

In R code, you may see the color red (isn’t that a bit subjective?) written as ‘red’, ‘rgb(1,0,0)’, or ‘#FF0000’. You probably recognize that first one, but the other two use the red-green-blue and hexadecimal notation. These methods are usually interchangable without too much trouble, but I find the hexadecimal notation to be more broadly useful for web development and other needs.

What is hexadecimal notation and what does it have to do with color?

Very briefly, representing color as a hexadecimal form allows computers to store information more effectively. The values for each digit go from 0 to F, that is: 0-9 and then A,B,C,D,E,F. “FF” is equivalent to 255 and 1 in the other systems. So, FFFFFF is white and 000000 is black.

Web Tools for Color

You can search “color chooser” in the Google Browser to find this tool, but you can also select from many other online tools to find a hexadecimal code.

Fonts

Keep it simple. Remember your audience and the setting of your presentation. I like sans-serif fonts because they are cleaner. Prioritize legible text for whatever medium you are using. Presentation in a room? Make the text larger. Poster? Make the main text larger. Computer display? Smaller is usually okay as the audience will be ‘seated’ closely. If you are unsure, make the text a little larger than usual.

The Story

Start with a message in mind and edit to ensure you are delivering that message. Show your data visualization to a friend or two to hear their interpretation of your intended message. Use the title and caption to clarify if the message is a bit complicated (and there is nothing wrong with that). Simplify if you can: remove extraneous colors, backgrounds, decoration, line weight, and other distracting details. Move the legend into a position that doesn’t distract. Follow principles for accurate representation of the data.

Tufte’s Principles

A compact reading of Tufte in three points:

Prioritize the data
Maximize the data-ink ratio
Eliminate chart junk

Real-world example!

We have access to the fruit fly (Drosophila melanogaster) genome and we would like to determine, from a predicted gene dataset, the approximate frequency of genes that contain three to four exons (coding regions).

We will use a .csv file from the Drosophila data. These particular data were generated using AUGUSTUS software to predict genes from the genome sequence. Specifically, we will examine the ‘genes-genscan.csv’ file and observe the ‘exonCount’ column. Essentially, we will be plotting the exon count frequency for the fruit fly genome as predicted by AUGUSTUS.

To do this, we will use a histogram which groups count values into ranges such as 0-5, 6-10, etc. Histograms are not bar charts and typically display the bars together. The bars touch with no gaps in between representing a continuous series of values along the x axis.

Excel Histogram

Here’s a chart to start. This one (cropped) is the default histogram created in MS Excel and the only change was to the title.

What problems do you see with this chart?

What actions would you take to improve it?

R Histogram

Let’s make some changes! First let’s recreate the histogram with R and the ggplot2 library.

ggplot(fly, aes(x=exonCount)) + 
  geom_histogram()

Let’s add labels and a caption for our data source.

ggplot(fly, aes(x=exonCount)) + 
  geom_histogram() + 
  labs(title = "Predicted Genome Exon Frequency", 
       subtitle = "Species: Drosophila melanogaster",
       x="Count",
       y="Frequency",
       caption="Data: https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome")

If you think the ‘bins’ on the x axis are not quite to your liking, you can adjust them easily with ‘binwidth’. We’ll set the binwidth to 2.

ggplot(fly, aes(x=exonCount)) + 
  geom_histogram(binwidth = 2) + 
  labs(title = "Predicted Genome Exon Frequency", 
       subtitle = "Species: Drosophila melanogaster",
       x="Count",
       y="Frequency",
       caption="Data: https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome")

Let’s adjust the scales on the axes. This way we can see values more clearly and adjust the spacing to our preferred setting.

ggplot(fly, aes(x=exonCount)) + 
  geom_histogram(binwidth = 2) + 
  labs(title = "Predicted Genome Exon Frequency", 
       subtitle = "Species: Drosophila melanogaster",
       x="Count",
       y="Frequency",
       caption="Data: https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome")+
  scale_x_continuous(breaks=seq(2,60,2))+
  scale_y_continuous(breaks=seq(0,10000,1000))

We can make this look a lot better by diminishing the background and adding a pop of color.

ggplot(fly, aes(x=exonCount)) + 
  geom_histogram(binwidth = 2, fill="#03a5fc", color="#00517d") + 
  labs(title = "Predicted Genome Exon Frequency", 
       subtitle = "Species: Drosophila melanogaster",
       x="Count",
       y="Frequency",
       caption="Data: https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome")+
  scale_x_continuous(breaks=seq(2,60,2))+
  scale_y_continuous(breaks=seq(0,10000,1000))+
  theme_minimal()

We can add a highlight to call attention to the frequency in the second bar (assuming this is important to the story). We can also increase font size and tweak the axes just a bit to limit the size of the unused white space.

custom <- c(rep("#eeeeee",1), rep("#03a5fc",1), rep("#eeeeee",35))

ggplot(fly, aes(x=exonCount)) + 
  geom_histogram(binwidth = 2, fill=custom, color="#cccccc") + 
  labs(title = "Predicted Genome Exon Frequency", 
       subtitle = "Species: Drosophila melanogaster",
       x="Count",
       y="Frequency",
       caption="Data: https://www.kaggle.com/mylesoneill/drosophila-melanogaster-genome")+
  scale_x_continuous(breaks=seq(2,60,2))+
  scale_y_continuous(breaks=seq(0,9000,1000))+
  coord_cartesian(xlim=c(2,60), ylim=c(0,9000))+
  theme_minimal()+
  theme(plot.title     = element_text(size = 25),
        plot.subtitle  = element_text(size = 15),
        axis.title  = element_text(size = 20),
        axis.text.x = element_text(size = 10),
        axis.text.y = element_text(size = 10))

Open Data

Scientists often create data but you may be interested in creating practice visualizations using data made available online from a variety of contributors. You can find a wealth of data resources published by government entities such as the US Census Bureau but also through sites such as Kaggle.com.

Introduction to R and RStudio (Cloud)

Please open RStudio Cloud and navigate to the materials for Seminar 1. We will work through these activities together.

BioBuilder Virtual Bootcamp