Code
#data manipulation
library(readr)
library(dplyr)
library(tidyr)
library(tibble)
library(stringr)
library(data.table)
#network graphing and saving
library(networkD3)
library(htmlwidgets)
Network graphs are great for demonstrating the relationships between items. This could be a food web showing how organisms transfer energy, a diagram of how a business team communicates and shares information, or in this case how various items are crafted from components in a video game. This method isn’t necessarily good for analysis but it’s a great way to visualize these relationships and it can help you perhaps discover patterns that weren’t readily apparent.
These data visualizations contain elements called nodes
and edges
where the nodes are typically circles and the edges or connections are typically lines or perhaps arrows. If your graph contains arrows, it’s called a directed graph because it indicates the flow of information in a particular direction. This could be something like a bacteriophage attacking a bacterium or an employee reporting up to a boss or perhaps a distributor providing goods to a vendor.
In this example, I’m using a dataset of craftable recipes from the game No Man’s Sky from Hello Games. The data table describes how various ingredients are combined to generate other items as output.
To get the data, I used the IMPORTHTML
function in the first cell of a spreadsheet in Google Sheets to grab the data from the page:
=IMPORTHTML("https://www.xainesworld.com/all-refiner-recipes-in-no-mans-sky/","table",1)
Then, I exported that sheet as a .csv
file.
First I load a few libraries to help. Many of these are from the tidyverse
collection and a few others are specifically for the network diagram itself. Specifically, the networkD3 library will be used for generating the diagram.
#data manipulation
library(readr)
library(dplyr)
library(tidyr)
library(tibble)
library(stringr)
library(data.table)
#network graphing and saving
library(networkD3)
library(htmlwidgets)
Next, I load the dataset from a .csv
file.
#read in the original dataset and preview it
<-read_csv("data/nms_recipes.csv")
main
#result
main
# A tibble: 308 × 9
Output Qty...2 Value `Input 1` Qty...5 `Input 2` Qty...7 `Input 3` Qty...9
<chr> <dbl> <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
1 Nanite C… 15 20 (… Salvaged… 1 <NA> NA <NA> NA
2 Gold 100 353 … Living P… 1 <NA> NA <NA> NA
3 Living S… 50 20 (… Hypnotic… 1 <NA> NA <NA> NA
4 Nanite C… 50 20 (… Hadal Co… 1 <NA> NA <NA> NA
5 Sodium N… 50 82 (… Crystal … 1 <NA> NA <NA> NA
6 Nanite C… 50 20 (… Larval C… 1 <NA> NA <NA> NA
7 Condense… 1 24 (… Cyto-Pho… 1 <NA> NA <NA> NA
8 Gold 1 353 … Hexite 1 <NA> NA <NA> NA
9 Glass 1 200 … Frost Cr… 40 <NA> NA <NA> NA
10 Pyrite 1 62 (… Gold 1 <NA> NA <NA> NA
# ℹ 298 more rows
Then, I select specific columns because I don’t want need all of the information for the diagram.
#select certain columns
<- main %>%
selected select(c("Output", "Input 1", "Input 2", "Input 3")) %>%
rename(output = "Output", input_1 = "Input 1", input_2 = "Input 2", input_3 = "Input 3")
#result
selected
# A tibble: 308 × 4
output input_1 input_2 input_3
<chr> <chr> <chr> <chr>
1 Nanite Cluster Salvaged Data <NA> <NA>
2 Gold Living Pearl <NA> <NA>
3 Living Slime Hypnotic Eye <NA> <NA>
4 Nanite Cluster Hadal Core <NA> <NA>
5 Sodium Nitrate Crystal Sulphide <NA> <NA>
6 Nanite Cluster Larval Core <NA> <NA>
7 Condensed Carbon Cyto-Phosphate <NA> <NA>
8 Gold Hexite <NA> <NA>
9 Glass Frost Crystal <NA> <NA>
10 Pyrite Gold <NA> <NA>
# ℹ 298 more rows
Here, I want to think about my diagram a bit. Some of these recipes are for individual items that are also themselves used as components in other recipes. So what I want to do now is create recipe groups by generating new nodes using the base ingredient names and an index value. So, recipe 1, recipe 2, etc. The str_c
function is used to concatenate the information from two columns with an underscore between the data. The mutate
function is again used to create the new column.
#add an indexed id column to indicate each unique recipe
<- selected %>% mutate(id=str_c(selected$output,"_",rowid(output)))
mutated
#result
mutated
# A tibble: 308 × 5
output input_1 input_2 input_3 id
<chr> <chr> <chr> <chr> <chr>
1 Nanite Cluster Salvaged Data <NA> <NA> Nanite Cluster_1
2 Gold Living Pearl <NA> <NA> Gold_1
3 Living Slime Hypnotic Eye <NA> <NA> Living Slime_1
4 Nanite Cluster Hadal Core <NA> <NA> Nanite Cluster_2
5 Sodium Nitrate Crystal Sulphide <NA> <NA> Sodium Nitrate_1
6 Nanite Cluster Larval Core <NA> <NA> Nanite Cluster_3
7 Condensed Carbon Cyto-Phosphate <NA> <NA> Condensed Carbon_1
8 Gold Hexite <NA> <NA> Gold_2
9 Glass Frost Crystal <NA> <NA> Glass_1
10 Pyrite Gold <NA> <NA> Pyrite_1
# ℹ 298 more rows
Now I’m going to split my data so I can do a couple of things with it. First, I get distinct combinations of the output
and id
columns which contain the base elements and the unique recipe identifiers, respectively. Then, I add the string _recipes
to the base elements. This sets up the top-level node labels. I want all of the unique recipes to connect up to a main node for that type of recipe. You might think of this like a top-level director over all the regional directors in a business context. I’m doing this to add some organization to the graph even though it doesn’t exist in the original data.
#get unique to-from pairs, sort, rename columns as to and from
<- mutated %>%
df_1 distinct(output, id) %>%
arrange(output) %>%
rename(to=output, from=id)
#tack on "recipes" to the top-level node for each group
$to <- paste(df_1$to, "recipes", sep="_")
df_1
#result
df_1
# A tibble: 308 × 2
to from
<chr> <chr>
1 Ammonia_recipes Ammonia_1
2 Ammonia_recipes Ammonia_2
3 Ammonia_recipes Ammonia_3
4 Ammonia_recipes Ammonia_4
5 Aronium_recipes Aronium_1
6 Aronium_recipes Aronium_2
7 Aronium_recipes Aronium_3
8 Aronium_recipes Aronium_4
9 Aronium_recipes Aronium_5
10 Aronium_recipes Aronium_6
# ℹ 298 more rows
Next, I take the specific recipes and all the input lists of ingredients and transform the arrangement by pivoting the “wide” data to a “long” format. You can read more about it in the pivot_longer
documentation.
#there is a lot going on here:
#select everything except the output column (so, id and three ingredient lists)
#use pivot_longer to rearrange the ingredient data (ignoring the id column) into a type and input column
#drop the type column because it's not needed
#drop any NA values
#rename the remaining id and input lists as to and from
<- mutated %>%
df_2 select(!output) %>%
pivot_longer(!id, names_to="type",values_to="input") %>%
select(!type) %>%
drop_na() %>%
rename(to=id, from=input)
#result
df_2
# A tibble: 629 × 2
to from
<chr> <chr>
1 Nanite Cluster_1 Salvaged Data
2 Gold_1 Living Pearl
3 Living Slime_1 Hypnotic Eye
4 Nanite Cluster_2 Hadal Core
5 Sodium Nitrate_1 Crystal Sulphide
6 Nanite Cluster_3 Larval Core
7 Condensed Carbon_1 Cyto-Phosphate
8 Gold_2 Hexite
9 Glass_1 Frost Crystal
10 Pyrite_1 Gold
# ℹ 619 more rows
Now that I sorted the relationships of the top-level nodes, the specific recipes, and the ingredients, I can stack the two datasets since they both contain only two columns named to
and from
. I’ll use bind_rows
(documentation) from dplyr.
#stack the two to-from pair datasets using bind_rows
#they need to have the same column names
<-bind_rows(df_1, df_2)
data
#result
data
# A tibble: 937 × 2
to from
<chr> <chr>
1 Ammonia_recipes Ammonia_1
2 Ammonia_recipes Ammonia_2
3 Ammonia_recipes Ammonia_3
4 Ammonia_recipes Ammonia_4
5 Aronium_recipes Aronium_1
6 Aronium_recipes Aronium_2
7 Aronium_recipes Aronium_3
8 Aronium_recipes Aronium_4
9 Aronium_recipes Aronium_5
10 Aronium_recipes Aronium_6
# ℹ 927 more rows
Finally, we can use the simple to-from formatted data to generate a network diagram with the simpleNetwork
function from networkD3
. You’ll notice that this graph is a bit difficult to read in part due to the lack of color groups. This is where the increased control of the forceNetwork
function is useful and that’s where Part 2 picks up.
# Plot
<- simpleNetwork(data, # data source
p height="", # output height
width="", # output width
Source = 2, # source column number ("from")
Target = 1, # target column number ("to")
linkDistance = 4, # distance between nodes
charge = -300, # force affecting the nodes (repulsion (-) or attraction (+))
fontSize = 14, # node label font size
fontFamily = "serif", # node label font family
linkColour = "#cccccc", # edge color (applies to all edges)
nodeColour = "#22a1ab", # node color (applies to all edges)
opacity = 0.8, # node opacity (0 to 1)
zoom = TRUE # zoom allowed or not
)
#result
p
In part one, you learned how to use simple to-from data pairs to generate a simple network diagram. In this next part, you’ll see how to generate a more sophisticated diagram which uses two-part data as input. That complicates things a bit but ultimately offers more control over the appearance of the diagram.
First, I create a vector containing all unique elements of the to-from data. Note that they are being converted to factors.
#generate a vector of all unique elements in the list of to and from elements
#convert it to a factor
#store it in a tibble
<- tibble(name = factor(sort(unique(c(data$to, data$from)))))
nodes nodes
# A tibble: 449 × 1
name
<fct>
1 Activated Cadmium
2 Activated Copper
3 Activated Emeril
4 Activated Indium
5 Ammonia
6 Ammonia_1
7 Ammonia_2
8 Ammonia_3
9 Ammonia_4
10 Ammonia_recipes
# ℹ 439 more rows
Next, I split the data so I can do group assignment. Any item with an underscore (top-level nodes) goes in one pile and everything else (base nodes) goes into the other. The only difference in this code is the use of !
to say “not”. The base elements are first. They will all be grouped together by assigning a group value of 1 and they will eventually receive the same color.
#logic to reassign group values so each recipe cluster has the same group value
#subset nodes data to return the base ingredients/items
#these will all be group 1 (one color)
<-nodes %>%
nodes_basefilter(!grepl('_', name)) %>%
mutate(group = 1, node_size=1)
nodes_base
# A tibble: 77 × 3
name group node_size
<fct> <dbl> <dbl>
1 Activated Cadmium 1 1
2 Activated Copper 1 1
3 Activated Emeril 1 1
4 Activated Indium 1 1
5 Ammonia 1 1
6 Aronium 1 1
7 Atlantideum 1 1
8 Cactus Flesh 1 1
9 Cadmium 1 1
10 Carbon 1 1
# ℹ 67 more rows
Next I split the top-level nodes containing an underscore in their name. These will be grouped separately and a different color will be applied to each group (including the base elements).
#logic to get the higher-level clusters
#these will all be different colors
#filter 'nodes' to only the names with underscores (the clustering groups)
<-nodes %>% filter(grepl('_', name))
nodes_clusters nodes_clusters
# A tibble: 372 × 1
name
<fct>
1 Ammonia_1
2 Ammonia_2
3 Ammonia_3
4 Ammonia_4
5 Ammonia_recipes
6 Aronium_1
7 Aronium_2
8 Aronium_3
9 Aronium_4
10 Aronium_5
# ℹ 362 more rows
So, here’s my logic to get the top-level node groups. First, I grab the first few characters from each string. That is, what letters they start with. It’s a unique string for each element and I can use it to group them. Then, I group those mini-strings and add a group identifier as a column. I add one because the base elements will all be in group 1 so these will be group 2, 3, … n.
#extract the first few characters from each string and get unique values
<- as_tibble(str_sub(nodes_clusters$name,1,5))
first_chars
#set the group value based on the current group id and *add one* to make these different from the base elements
<- first_chars %>%
first_chars group_by(value) %>%
mutate(group = cur_group_id()+1)
first_chars
# A tibble: 372 × 2
# Groups: value [61]
value group
<chr> <dbl>
1 Ammon 2
2 Ammon 2
3 Ammon 2
4 Ammon 2
5 Ammon 2
6 Aroni 3
7 Aroni 3
8 Aroni 3
9 Aroni 3
10 Aroni 3
# ℹ 362 more rows
I add that group value to the nodes information so I now have the node names and their appropriate group identifier. I also use some logic here to style the very highest “recipe” nodes with a large node_size value and give the specific recipe nodes a value that falls between the base elements and the recipe nodes.
#differential styling for node sizes
<- nodes_clusters %>%
nodes_clusters mutate(group=first_chars$group, node_size=10)
#use an if-else statement and text matching with grep
#documentation:
# The word “grepl” stands for “grep logical”.
# The grepl() function in R simply searches for matches in characters or sequences of characters present in a given string.
# fixed: This is a logical value. If TRUE, then the pattern of the characters or sequence of characters is matched.
<- nodes_clusters %>%
nodes_clusters mutate(node_size = if_else(grepl("recipe", name, fixed = TRUE),100,node_size))
nodes_clusters
# A tibble: 372 × 3
name group node_size
<fct> <dbl> <dbl>
1 Ammonia_1 2 10
2 Ammonia_2 2 10
3 Ammonia_3 2 10
4 Ammonia_4 2 10
5 Ammonia_recipes 2 100
6 Aronium_1 3 10
7 Aronium_2 3 10
8 Aronium_3 3 10
9 Aronium_4 3 10
10 Aronium_5 3 10
# ℹ 362 more rows
Now I can bind the split data back together and sort based on the name. To recap, all base elements are in group 1 now while all higher-level nodes like specific recipes or recipe groups are all separately grouped by the type of element they generate. This will be the basis of the color styling.
#merge nodes_base and nodes_clusters to create group list
<-bind_rows(nodes_base, nodes_clusters)%>% arrange(name)
nodes_groups nodes_groups
# A tibble: 449 × 3
name group node_size
<fct> <dbl> <dbl>
1 Activated Cadmium 1 1
2 Activated Copper 1 1
3 Activated Emeril 1 1
4 Activated Indium 1 1
5 Ammonia 1 1
6 Ammonia_1 2 10
7 Ammonia_2 2 10
8 Ammonia_3 2 10
9 Ammonia_4 2 10
10 Ammonia_recipes 2 100
# ℹ 439 more rows
Now the data must be converted a bit for the two-part input needed for the forceNetwork
function. According to the documentation, it needs separate dataframes for nodes and links. That is, what is in the network and how is it connected. It also requires that you specify the source (from) and target (to) columns. Compare this to the simpleNetwork which extracts all of that. This function also uses the argument value
for the edge (line) thickness and group
to help with style.
Here, I use the method suggested by CJ Yetman to convert the existing to-from data to numeric values as required by forceNetwork
. I also add a default value of 1 for the value so all edges will look the same.
#convert the existing to-from data so it can be used in the networkD3 graph
#use 'match' which gives the positions of (first) matches of its first argument in its second.
<- match(data$to, levels(nodes$name))-1
to <- match(data$from, levels(nodes$name))-1
from <- tibble(to,from,value=1)
links links
# A tibble: 937 × 3
to from value
<dbl> <dbl> <dbl>
1 9 5 1
2 9 6 1
3 9 7 1
4 9 8 1
5 19 11 1
6 19 12 1
7 19 13 1
8 19 14 1
9 19 15 1
10 19 16 1
# ℹ 927 more rows
Now I can generate the final plot. My links are contained in the links
tibble while the nodes are described in the nodes_groups
tibble. There are many groups here and the colors are assigned automatically. I didn’t bother trying to create unique colors for each group but that’s something that can be done using additional JavaScript to style the nodes.
#create the plot
<-forceNetwork(Links = links,
pNodes = nodes_groups,
Source = 'from',
Target = 'to',
Value = 'value',
NodeID = 'name',
Group = 'group',
Nodesize = 'node_size',
charge = -20,
zoom=TRUE,
opacity=0.8)
Links is a tbl_df. Converting to a plain data frame.
Nodes is a tbl_df. Converting to a plain data frame.
p
You can save the file as its own .html file if you want to keep it separate.
#save the output as its own file if you like
#saveWidget(p, file="nms_recipes_nodegraph.html")