Seminar II: Data Visualization
Thinking about Data
Let’s start by thinking about the following:
- What are data?
- How do we communicate our data to an audience?
- What audience cares about our data?
- What is a plot?
- What types of plots are there?
- What kinds of data are appropriate for which plots?
Organizing your data
In order to do effective research, you’ll need to organize any data such as observations, notes, quantitative values, and measurements. For numeric values, you will often use a table. For large sets of values, you will likely use a spreadsheet such as with Excel or Google Sheets.
Simple data sets can be stored as a table.
id | date | value |
---|---|---|
1 | 2020-07-01 | 0.335 |
2 | 2020-07-02 | 0.909 |
3 | 2020-07-03 | 0.412 |
4 | 2020-07-04 | 0.044 |
5 | 2020-07-05 | 0.764 |
6 | 2020-07-06 | 0.750 |
7 | 2020-07-07 | 0.800 |
8 | 2020-07-08 | 0.885 |
9 | 2020-07-09 | 0.449 |
10 | 2020-07-10 | 0.488 |
11 | 2020-07-11 | 0.946 |
12 | 2020-07-12 | 0.106 |
13 | 2020-07-13 | 0.383 |
14 | 2020-07-14 | 0.626 |
15 | 2020-07-15 | 0.578 |
16 | 2020-07-16 | 0.371 |
17 | 2020-07-17 | 0.893 |
18 | 2020-07-18 | 0.064 |
19 | 2020-07-19 | 0.236 |
20 | 2020-07-20 | 0.116 |
21 | 2020-07-21 | 0.473 |
22 | 2020-07-22 | 0.502 |
23 | 2020-07-23 | 0.635 |
24 | 2020-07-24 | 0.771 |
25 | 2020-07-25 | 0.104 |
26 | 2020-07-26 | 0.183 |
27 | 2020-07-27 | 0.873 |
28 | 2020-07-28 | 0.188 |
29 | 2020-07-29 | 0.111 |
30 | 2020-07-30 | 0.084 |
31 | 2020-07-31 | 0.942 |
32 | 2020-08-01 | 0.972 |
33 | 2020-08-02 | 0.529 |
34 | 2020-08-03 | 0.653 |
35 | 2020-08-04 | 0.120 |
36 | 2020-08-05 | 0.043 |
37 | 2020-08-06 | 0.056 |
38 | 2020-08-07 | 0.164 |
39 | 2020-08-08 | 0.227 |
40 | 2020-08-09 | 0.924 |
41 | 2020-08-10 | 0.686 |
42 | 2020-08-11 | 0.426 |
43 | 2020-08-12 | 0.499 |
44 | 2020-08-13 | 0.361 |
45 | 2020-08-14 | 0.496 |
46 | 2020-08-15 | 0.849 |
47 | 2020-08-16 | 0.724 |
48 | 2020-08-17 | 0.898 |
49 | 2020-08-18 | 0.460 |
50 | 2020-08-19 | 0.435 |
51 | 2020-08-20 | 0.406 |
52 | 2020-08-21 | 0.338 |
53 | 2020-08-22 | 0.519 |
54 | 2020-08-23 | 0.240 |
55 | 2020-08-24 | 0.617 |
56 | 2020-08-25 | 0.648 |
57 | 2020-08-26 | 0.202 |
58 | 2020-08-27 | 0.398 |
59 | 2020-08-28 | 0.085 |
60 | 2020-08-29 | 0.937 |
61 | 2020-08-30 | 0.556 |
62 | 2020-08-31 | 0.196 |
63 | 2020-09-01 | 0.746 |
64 | 2020-09-02 | 0.642 |
65 | 2020-09-03 | 0.871 |
66 | 2020-09-04 | 0.319 |
67 | 2020-09-05 | 0.317 |
68 | 2020-09-06 | 0.471 |
69 | 2020-09-07 | 0.803 |
70 | 2020-09-08 | 0.852 |
71 | 2020-09-09 | 0.226 |
72 | 2020-09-10 | 0.619 |
73 | 2020-09-11 | 0.430 |
74 | 2020-09-12 | 0.581 |
75 | 2020-09-13 | 0.774 |
76 | 2020-09-14 | 0.340 |
77 | 2020-09-15 | 0.594 |
78 | 2020-09-16 | 0.799 |
79 | 2020-09-17 | 0.898 |
80 | 2020-09-18 | 0.975 |
81 | 2020-09-19 | 0.442 |
82 | 2020-09-20 | 0.369 |
83 | 2020-09-21 | 0.882 |
84 | 2020-09-22 | 0.971 |
85 | 2020-09-23 | 0.662 |
86 | 2020-09-24 | 0.559 |
87 | 2020-09-25 | 0.220 |
88 | 2020-09-26 | 0.963 |
89 | 2020-09-27 | 0.763 |
90 | 2020-09-28 | 0.617 |
91 | 2020-09-29 | 0.540 |
92 | 2020-09-30 | 0.455 |
93 | 2020-10-01 | 0.850 |
94 | 2020-10-02 | 0.503 |
95 | 2020-10-03 | 0.800 |
96 | 2020-10-04 | 0.061 |
97 | 2020-10-05 | 0.536 |
98 | 2020-10-06 | 0.553 |
99 | 2020-10-07 | 0.508 |
100 | 2020-10-08 | 0.276 |
- Histograms
- Useful for examining the shape of the distribution. These are NOT bar charts because they display frequency and not counts, means, or other values.
Example: consider the set of values in the table. How many of those values are between 0 and 0.1? How many between 0.2 and 0.3? Are there relatively more large values or small values? Are the values distributed ‘normally’? A histogram will give us a clue.
- Scatterplot
- Used to evaluate the relationship between two variables. Each data point is a coordinate represented by one member of each variable.
Example: Let’s examine the relationship between the ID of each data point and the value of each data point. Do you predict these might be related? If so, what would the plot look like? If not, what do you think the plot would look like?
- Bar chart
- Used to show the relationship of a numeric and categorical value such as totals, means, raw values, or similar quantities to allow for comparison. These often will have other decorations to help the viewer understand variability within the data set, such as ‘error bars’.
- Pie Chart
- Used ONLY to show percentage of a whole. Not recommended due to perception issues. Use a bar chart or a dot plot instead.
- Dot plot
- Displays frequency by category. Preferable to a pie chart.
- Line plot
- Used to display a trend of a continuous variable, such as height, relative to another continuous variable, such as time. If your data are not continuous, consider a bar chart or similar type of plot. The line suggests there are data in between observations so be careful with that assumption.
- Box plot
- Shows distribution and quartile information about a continuous data vector. Can be useful for comparing variables at a glance. Showing the data points provides additional useful information that can help viewers to interpret the data. Additional reading
- 3D plots
- Sometimes you need to show the relationship of three variables but this is somewhat rare. Excel and other programs offer options to ‘enhance’ your plots with shading, 3D effects, and other decorations that can be very distracting at best and misleading at worst. 3D plots are okay and even good if they are required to convey certain information, but 3D effects to entertain the viewer are not recommended.
Do you recall the volcano data that showed height as color? This is a time when 3D is justified. We can’t really understand the difference in height with color very well. It helps, but what if we did color AND 3D? AND we make it interactive? Super helpful!
Here is the 3D interactive version. Source
Hands-on Plotting
Please open RStudio Cloud and navigate to the materials for Seminar 2. We will work through these activities together.