This lesson will teach you how to work with and analyze cultural domain data using R. This lesson draws in part on datasets and formulas from the Anthrotools package. You can get the data for this lesson here and here.
# Load packages install.packages('devtools') #devtools lets us install packages
# from github
library("devtools")
# install_github('alastair-JL/AnthroTools')
library(AnthroTools)
library(tidyverse)
library(ggplot2)
library(tidyr)
Second, let’s load the freelist data and assign a name to the
dataframe object. Here we will load the FruitList data from the
Anthrotools package, which has been made available to you as a .csv
file. Take a look at the head() and str() of
this spreadsheet. How do you think the data have been organized? What do
the rows and columns represent?
FruitList <- read.csv("https://maddiebrown.github.io/ethnoecology/FruitList.csv")
head(FruitList)
X Subj Order CODE
1 1 1 1 pear
2 2 1 2 orange
3 3 1 3 apple
4 4 2 1 apple
5 5 2 2 apple
6 6 2 3 strawberry
str(FruitList)
'data.frame': 75 obs. of 4 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ Subj : int 1 1 1 2 2 2 3 3 4 4 ...
$ Order: int 1 2 3 1 2 3 1 2 1 2 ...
$ CODE : chr "pear" "orange" "apple" "apple" ...
How many respondents and unique fruits (‘items’) are in the
FruitList data?
# How many respondents are there?
length(unique(FruitList$Subj))
# how many unique fruits are named and what are they?
length(unique(FruitList$CODE))
unique(FruitList$CODE)
From these few operations we now know a bit more about the dataset.
There are 20 individuals who collectively named a total of 8 unique
fruits.
Next, let’s calculate the average number of fruits named per person and explore the distribution of list lengths across responses.
# How many responses per subject
FruitList %>%
group_by(Subj) %>%
count()
# A tibble: 20 × 2
# Groups: Subj [20]
Subj n
<int> <int>
1 1 3
2 2 3
3 3 2
4 4 5
5 5 3
6 6 3
7 7 4
8 8 4
9 9 3
10 10 3
11 11 5
12 12 5
13 13 4
14 14 5
15 15 3
16 16 5
17 17 2
18 18 5
19 19 3
20 20 5
# What is the average number of responses per subject?
FruitList %>%
group_by(Subj) %>%
count() %>%
ungroup() %>%
summarise(mean = mean(n))
# A tibble: 1 × 1
mean
<dbl>
1 3.75
We can also do this in two parts by creating a new dataframe of the summarized data and then calculating the mean value in the “n” column.
BySubject <- FruitList %>%
group_by(Subj) %>%
count()
mean(BySubject$n)
[1] 3.75
# This method also allows you to generate additional summary information
summary(BySubject$n)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 3.00 3.50 3.75 5.00 5.00
# We can also make a table showing how many respondents had lists of length 2,
# 3, 4, and 5
table(BySubject$n)
2 3 4 5
2 8 3 7
Next, let’s analyze how frequently each fruit was mentioned. For this analysis we are going to create a new dataframe object called ‘ByFruit’ that groups the data according to each unique fruit and counts the number of rows.
ByFruit <- FruitList %>%
group_by(CODE) %>%
count() %>%
arrange(desc(n))
ByFruit
# A tibble: 8 × 2
# Groups: CODE [8]
CODE n
<chr> <int>
1 apple 25
2 pear 10
3 plum 9
4 strawberry 9
5 banana 8
6 orange 7
7 lemon 5
8 peach 2
Plot the number of times each type of fruit is mentioned by interviewees. Make a bar plot with flipped coordinates so that each fruit name is on the y axis and the number of mentions are on the x axis.
ggplot(ByFruit, aes(x = reorder(CODE, n), y = n)) + geom_bar(stat = "identity") +
coord_flip() + ggtitle("Frequency of fruit mentions") + labs(x = "Fruit", y = "n")
Another way to visualize these data is through a scree plot. In these types of plots data are arranged such that each point represents a single item, the y-value for which is its frequency of mention in the dataset. These types of plots can be used to quickly identify trends and cut-off points in the data.
ggplot(ByFruit, aes(x = reorder(CODE, desc(n)), y = n, group = 1)) + geom_point() +
geom_line()
One of the main insights than can be learned from freelist data is the relative cultural salience of items in particular domains. For example, in the domain of household chores, we could determine whether vacuuming, washing dishes, shoveling the driveway, or feeding the snake are considered more salient or central to the idea of household chores compared to other tasks. We might find variations based on individual attributes or the cultural context in which the question is asked. Salience often mirrors frequency, but the calculation is a bit more complicated as it considers both an item’s frequency of mention and the order in which it is usually listed.
Luckily, AnthroTools has a built in salience calculation function that can do the math for us. The code below calculates the salience of each fruit listed in the context of each individual interviewee.
FruitListSalience <- CalculateSalience(FruitList, Order = "Order", Subj = "Subj",
CODE = "CODE")
# Note: I have included the arguments for Order, Subj and CODE for illustrative
# purposes. Because these column names match the arguments, you do not actually
# need to include them in this case. However, when you are working with your
# own datasets you may have different column names, so it is helpful to keep
# the underlying structure of functions in mind when you deploy them.
FruitListSalience
X Subj Order CODE Salience
1 1 1 1 pear 1.0000000
2 2 1 2 orange 0.6666667
3 3 1 3 apple 0.3333333
4 4 2 1 apple 1.0000000
5 5 2 2 apple 0.6666667
6 6 2 3 strawberry 0.3333333
7 7 3 1 banana 1.0000000
8 8 3 2 strawberry 0.5000000
9 9 4 1 apple 1.0000000
10 10 4 2 plum 0.8000000
11 11 4 3 banana 0.6000000
12 12 4 4 orange 0.4000000
13 13 4 5 apple 0.2000000
14 14 5 1 strawberry 1.0000000
15 15 5 2 apple 0.6666667
16 16 5 3 apple 0.3333333
17 17 6 1 apple 1.0000000
18 18 6 2 lemon 0.6666667
19 19 6 3 apple 0.3333333
20 20 7 1 banana 1.0000000
21 21 7 2 plum 0.7500000
22 22 7 3 lemon 0.5000000
23 23 7 4 lemon 0.2500000
24 24 8 1 strawberry 1.0000000
25 25 8 2 apple 0.7500000
26 26 8 3 pear 0.5000000
27 27 8 4 apple 0.2500000
28 28 9 1 apple 1.0000000
29 29 9 2 apple 0.6666667
30 30 9 3 plum 0.3333333
31 31 10 1 apple 1.0000000
32 32 10 2 pear 0.6666667
33 33 10 3 apple 0.3333333
34 34 11 1 apple 1.0000000
35 35 11 2 strawberry 0.8000000
36 36 11 3 banana 0.6000000
37 37 11 4 apple 0.4000000
38 38 11 5 lemon 0.2000000
39 39 12 1 apple 1.0000000
40 40 12 2 orange 0.8000000
41 41 12 3 pear 0.6000000
42 42 12 4 apple 0.4000000
43 43 12 5 orange 0.2000000
44 44 13 1 plum 1.0000000
45 45 13 2 peach 0.7500000
46 46 13 3 strawberry 0.5000000
47 47 13 4 banana 0.2500000
48 48 14 1 pear 1.0000000
49 49 14 2 apple 0.8000000
50 50 14 3 banana 0.6000000
51 51 14 4 pear 0.4000000
52 52 14 5 strawberry 0.2000000
53 53 15 1 plum 1.0000000
54 54 15 2 strawberry 0.6666667
55 55 15 3 apple 0.3333333
56 56 16 1 banana 1.0000000
57 57 16 2 plum 0.8000000
58 58 16 3 plum 0.6000000
59 59 16 4 pear 0.4000000
60 60 16 5 pear 0.2000000
61 61 17 1 orange 1.0000000
62 62 17 2 apple 0.5000000
63 63 18 1 peach 1.0000000
64 64 18 2 banana 0.8000000
65 65 18 3 orange 0.6000000
66 66 18 4 pear 0.4000000
67 67 18 5 apple 0.2000000
68 68 19 1 pear 1.0000000
69 69 19 2 lemon 0.6666667
70 70 19 3 strawberry 0.3333333
71 71 20 1 plum 1.0000000
72 72 20 2 apple 0.8000000
73 73 20 3 plum 0.6000000
74 74 20 4 orange 0.4000000
75 75 20 5 apple 0.2000000
The above code calculates the salience for each item by respondent. If you inspect the results, you’ll see that the first item in each list has a salience of 1, with each subsequent item decreasing in relative salience. This is useful for understanding how an individual thinks about the domain of fruits, but what if we are interested in knowing how salient apples are across all responses? We can calculate the salience of particular items as well with the SalienceByCode() function.
Try running the SalienceByCode() function on the new salience dataframe we made above.
SalienceByFruit <- SalienceByCode(FruitListSalience, dealWithDoubles = "MAX")
CODE MeanSalience SumSalience SmithsS
1 pear 0.6958333 5.566667 0.2783333
2 orange 0.6444444 3.866667 0.1933333
3 apple 0.7588889 11.383333 0.5691667
4 strawberry 0.5925926 5.333333 0.2666667
5 banana 0.7312500 5.850000 0.2925000
6 plum 0.8119048 5.683333 0.2841667
7 lemon 0.5083333 2.033333 0.1016667
8 peach 0.8750000 1.750000 0.0875000
# The dealwithdoubles argument tells R what to do if a respondent lists the
# same item twice. There are a few different options available for this, the
# right one to pick will depend on your data and research question.
From this analysis we can see that there are slight differences in the mean salience and Smith’s S. Smith’s S considers the length of lists in its calculation of salience (More info here).
Now let’s plot the Smith’s S results in decreasing order and add a vertical line at the 0.1 mark. This range is generally considered a benchmark level for assessing item salience in freelists.
ggplot(SalienceByFruit, aes(x = reorder(CODE, SmithsS), y = SmithsS)) + geom_bar(stat = "identity") +
coord_flip() + ggtitle("Fruit Salience") + labs(x = "Fruit", y = "Smith's S") +
geom_hline(yintercept = 0.1)
From this plot, it looks like most of the fruits could be considered salient in this dataset. However, although pears were mentioned slightly more frequently than bananas and plums, their salience is lower when their overall order within lists and other factors are taken into account.
We can use multidimensional scaling to determine which items are more central or peripheral in the domain of “fruits” for interviewees. First we need to make a presence absence table to compare the fruits to one another.
Using FreeListTable() make a table indicating whether or not each
respondent mentioned a particular fruit. Hint: This is a
presence/absence table. Hint: Run ?FreeListTable() if you
aren’t sure where to start.
FruitListTable <- FreeListTable(FruitList, CODE = "CODE", Salience = "Salience",
Subj = "Subj", tableType = "PRESENCE")
Now we will transform the data to be in the right format for MDS.
# add new count column
FruitList$present <- rep(1)
# Spread into wide datatable. Note: spread function requires unique identifiers
# for rows, so here we remove any duplicate rows
FruitListWide <- FruitList %>%
select(Subj, CODE, present) %>%
unique %>%
spread(CODE, present)
# convert NAs to 0
FruitListWide[is.na(FruitListWide)] <- 0
head(FruitListWide)
Subj apple banana lemon orange peach pear plum strawberry
1 1 1 0 0 1 0 1 0 0
2 2 1 0 0 0 0 0 0 1
3 3 0 1 0 0 0 0 0 1
4 4 1 1 0 1 0 0 1 0
5 5 1 0 0 0 0 0 0 1
6 6 1 0 1 0 0 0 0 0
## Convert dataframe into presence/absence only by removing the first column.
## First we can assign the Subj column to the rownames
rownames(FruitListWide) <- FruitListWide$Subj
# Remove the first column
FruitListWide <- FruitListWide[, -1]
# Convert to matrix
FruitListWide <- as.matrix(FruitListWide)
head(FruitListWide)
apple banana lemon orange peach pear plum strawberry
1 1 0 0 1 0 1 0 0
2 1 0 0 0 0 0 0 1
3 0 1 0 0 0 0 0 1
4 1 1 0 1 0 0 1 0
5 1 0 0 0 0 0 0 1
6 1 0 1 0 0 0 0 0
# this is a people by fruit table. now we want to make a fruit by fruit table
FruitbyFruit <- crossprod(FruitListWide)
There are many ways to conduct MDS in R. Some examples and tutorials on completing MDS in R are included below.
We can use the matrix we just created in our multidimensional scaling analysis. First, let’s make a distance/similarity matrix.
Fruit_mds <- cmdscale(proxy::dist(FruitbyFruit, convert_similarities = TRUE), k = 2)
# add in labels and dimension names
Fruit_mds_df <- as.data.frame(Fruit_mds)
Fruit_mds_df$FruitName <- row.names(Fruit_mds)
# plot the results
ggplot(Fruit_mds_df, aes(x = V1, y = V2, label = FruitName)) + geom_text(size = 2.5,
position = position_jitter(width = 1, height = 2)) + labs(x = "Dimension 1",
y = "Dimension 2") + ggtitle("MDS of Fruit Pilesort")
# https://stackoverflow.com/questions/40542685/how-to-jitter-remove-overlap-for-geom-text-labels
It is also possible to compare item salience across different groups of respondents. The following example comes from the AnthroTools package which includes a sample grouping of the FruitList data. First, let’s load the new dataset. Because it is included in the AnthroTools package, we can load it directly with the data() function.
data("WorldList")
head(WorldList)
Subj Order CODE GROUPING
1 1 1 plum MAINLAND
2 1 2 strawberry MAINLAND
3 1 3 pear MAINLAND
4 1 4 peach MAINLAND
5 2 1 apple MAINLAND
6 2 2 orange MAINLAND
Try calculating the salience for each fruit, adding in an argument to
the CalculateSalience() function that differentiates
responses by GROUPING. The argument is conveniently called GROUPING.
First calculate the salience of each item by response
(CalculateSalience()), then calculate the salience of each
item (SalienceByCode()). Refer to the code we ran earlier
in this lesson as a template to write your solution.
FL1 <- CalculateSalience(WorldList, Order = "Order", Subj = "Subj", CODE = "CODE",
GROUPING = "GROUPING")
# Note, this function will not run without adding in the GROUPING argument.
FL2 <- SalienceByCode(FL1, GROUPING = "GROUPING", dealWithDoubles = "MAX")
[1] "plum"
[1] "strawberry"
[1] "pear"
[1] "peach"
[1] "apple"
[1] "orange"
[1] "lemon"
[1] "banana"
[1] "plum"
[1] "strawberry"
[1] "pear"
[1] "peach"
[1] "apple"
[1] "orange"
[1] "lemon"
[1] "banana"
[1] "plum"
[1] "strawberry"
[1] "pear"
[1] "peach"
[1] "apple"
[1] "orange"
[1] "lemon"
[1] "banana"
GROUPING CODE MeanSalience SumSalience SmithsS
1 MAINLAND plum 0.8333333 5.000000 0.29411765
2 ISLAND plum 0.7729167 6.183333 0.47564103
3 MOON plum 0.5833333 1.750000 0.12500000
4 MAINLAND strawberry 0.6875000 2.750000 0.16176471
5 ISLAND strawberry 0.5000000 2.000000 0.15384615
6 MOON strawberry 0.4722222 1.416667 0.10119048
7 MAINLAND pear 0.5833333 2.916667 0.17156863
8 ISLAND pear 0.5733333 2.866667 0.22051282
9 MOON pear 0.3500000 1.400000 0.10000000
10 MAINLAND peach 0.6333333 3.166667 0.18627451
11 ISLAND peach 0.7708333 3.083333 0.23717949
12 MOON peach 0.6800000 3.400000 0.24285714
13 MAINLAND apple 0.7320513 9.516667 0.55980392
14 ISLAND apple 0.6500000 5.850000 0.45000000
15 MOON apple 0.8555556 10.266667 0.73333333
16 MAINLAND orange 0.6861111 4.116667 0.24215686
17 ISLAND orange 0.2500000 0.250000 0.01923077
18 MOON orange 0.5000000 1.000000 0.07142857
19 MAINLAND lemon 0.7000000 1.400000 0.08235294
20 ISLAND lemon 0.8000000 2.400000 0.18461538
21 MOON lemon 0.7200000 3.600000 0.25714286
22 MAINLAND banana 0.8888889 2.666667 0.15686275
23 ISLAND banana 0.6708333 5.366667 0.41282051
24 MOON banana 0.8000000 2.400000 0.17142857
In addition to evaluating salience, we might be interested in the percentage of respondents from each group who named a particular item. The code below creates a new object where the data are grouped according to the GROUPING variable.
frequencybygroup <- WorldList %>%
group_by(GROUPING) %>%
mutate(GroupN = length(unique(Subj))) %>%
ungroup %>%
group_by(GROUPING, CODE, GroupN) %>%
summarise(totalResponses = n(), nRespondents = length(unique(Subj)), percentRespondents = round(length(unique(Subj))/first(GroupN) *
100, 2)) %>%
arrange(GROUPING, desc(percentRespondents))
We can then plot the results of this grouping with a facet wrap graph.
ggplot(frequencybygroup, aes(x = reorder(CODE, desc(percentRespondents)), percentRespondents)) +
geom_bar(stat = "identity") + coord_flip() + ggtitle("Frequency of fruit mentions by site") +
labs(x = "Fruit", y = "n") + facet_wrap(vars(GROUPING))
It looks like in all cases, apples were mentioned more frequently, but the different sample populations differ in their rates of mentioning oranges and bananas. One thing you will notice in this graph is that although the fruits are listed in descending order for the “ISLAND” sample, the order is not meaningful for the other sites. This is because in the faceted plot, the y axis is the same for each site, making it easier to compare across samples.
We can also compare the lengths of lists across each group. This shows us that the average list lengths were not very different across the groups, but were slightly higher in the Island population.
WorldList %>%
group_by(GROUPING, Subj) %>%
summarise(n = n()) %>%
ungroup %>%
group_by(GROUPING) %>%
summarize(nResponse = n(), avgLength = mean(n), maxLength = max(n), minLength = min(n),
medianLength = median(n))
# A tibble: 3 × 6
GROUPING nResponse avgLength maxLength minLength medianLength
<chr> <int> <dbl> <int> <int> <dbl>
1 ISLAND 13 3.54 5 2 3
2 MAINLAND 17 3.35 5 2 3
3 MOON 14 3.29 5 2 3
Pile sorts are used to collect data about how people categorize a set of things in a particular cultural/cognitive domain. In this example, I’ve created a sample pile sort dataset of tree species.
Let’s read in and take a look at the data.
treepilesort <- read.csv("https://maddiebrown.github.io/ANTH475/data/tree_pilesort_sampledata_2026.csv")
head(treepilesort)
Person Pile_Num Pile_Name Card_Num Tree_Name
1 Alice 1 nonfood 1 maple
2 Alice 1 nonfood 2 oak
3 Alice 2 fruit 3 apple
4 Alice 2 fruit 4 orange
5 Alice 2 fruit 5 pear
6 Alice 1 nonfood 6 palm
One challenge you might face with pile sort data is that each pile is only given a number, rather than a name. For this example, I already included the tree name in the pile sort data. Sometimes however, you might only have the card number in your observational data. In this case, you will first need to link the card number with the card name. The code below shows how to link these two dataframes.
treecardID <- read.csv("https://maddiebrown.github.io/ANTH475/data/pilesortcards_sampledata.csv")
tree_join <- left_join(treepilesort, treecardID, by = c(Card_Num = "Card_Num"))
str(tree_join)
'data.frame': 380 obs. of 6 variables:
$ Person : chr "Alice" "Alice" "Alice" "Alice" ...
$ Pile_Num : int 1 1 2 2 2 1 1 1 1 1 ...
$ Pile_Name: chr "nonfood" "nonfood" "fruit" "fruit" ...
$ Card_Num : int 1 2 3 4 5 6 7 8 9 10 ...
$ Tree_Name: chr "maple" "oak" "apple" "orange" ...
$ Card_Name: chr "maple" "oak" "apple" "orange" ...
To start working with the pile sort data, we first add in a new unique identifier for each person_pile combination. This step is required because the piles have numeric names that repeat across respondents. Next we can make a presence/absence column that will be useful for creating a new matrix based on whether or not a tree occurs in a particular pile.
treepilesort <- treepilesort %>%
mutate(PersonPile = paste(Person, Pile_Num, sep = "_"))
# make a 'presence' column
treepilesort$present <- rep(1)
Now we can convert the data back into a wide format, where each column is a different tree species and the values represent presence or absence of the tree.
# Spread into wide data table. Note: spread function requires unique
# identifiers for rows, so here we remove any duplicate rows
TreeWide <- treepilesort %>%
select(PersonPile, Tree_Name, present) %>%
unique %>%
spread(Tree_Name, present)
# Convert NAs to 0
TreeWide[is.na(TreeWide)] <- 0
head(TreeWide)
PersonPile alder apple aspen banana beech birch black walnut cedar chestnut
1 Alice_1 1 0 1 0 1 1 0 1 0
2 Alice_2 0 1 0 1 0 0 0 0 0
3 Alice_3 0 0 0 0 0 0 1 0 1
4 Bob_1 1 1 1 0 1 1 1 0 1
5 Bob_2 0 0 0 1 0 0 0 0 0
6 Bob_3 0 0 0 0 0 0 0 1 0
coconut dawn redwood douglas fir gingko horse chestnut japanese maple
1 0 1 1 1 0 1
2 1 0 0 0 0 0
3 0 0 0 0 1 0
4 0 0 0 1 1 1
5 1 0 0 0 0 0
6 0 1 1 0 0 0
laurel oak live oak lodgepole pine longleaf pine mango maple oak orange palm
1 1 1 1 1 0 1 1 0 1
2 0 0 0 0 1 0 0 1 0
3 0 0 0 0 0 0 0 0 0
4 1 1 0 0 1 1 1 1 0
5 0 0 0 0 0 0 0 0 1
6 0 0 1 1 0 0 0 0 0
pear pecan pine ponderosa pine red oak redwood sequioa silver maple spruce
1 0 0 1 1 1 1 1 1 1
2 1 0 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0 0
4 1 1 0 0 1 0 0 1 0
5 0 0 0 0 0 0 0 0 0
6 0 0 1 1 0 1 1 0 1
sugar maple sycamore walnut white oak white pine
1 1 1 0 1 1
2 0 0 0 0 0
3 0 0 1 0 0
4 1 1 1 1 0
5 0 0 0 0 0
6 0 0 0 0 1
## Convert dataframe into presence/absence only by removing the first column.
## First we can assign the PersonPile column to the rownames
rownames(TreeWide) <- TreeWide$PersonPile
# Remove the PersonPile column
TreeWide <- TreeWide[, -1]
# Convert to matrix
TreeWide <- as.matrix(TreeWide)
head(TreeWide)
alder apple aspen banana beech birch black walnut cedar chestnut
Alice_1 1 0 1 0 1 1 0 1 0
Alice_2 0 1 0 1 0 0 0 0 0
Alice_3 0 0 0 0 0 0 1 0 1
Bob_1 1 1 1 0 1 1 1 0 1
Bob_2 0 0 0 1 0 0 0 0 0
Bob_3 0 0 0 0 0 0 0 1 0
coconut dawn redwood douglas fir gingko horse chestnut japanese maple
Alice_1 0 1 1 1 0 1
Alice_2 1 0 0 0 0 0
Alice_3 0 0 0 0 1 0
Bob_1 0 0 0 1 1 1
Bob_2 1 0 0 0 0 0
Bob_3 0 1 1 0 0 0
laurel oak live oak lodgepole pine longleaf pine mango maple oak orange
Alice_1 1 1 1 1 0 1 1 0
Alice_2 0 0 0 0 1 0 0 1
Alice_3 0 0 0 0 0 0 0 0
Bob_1 1 1 0 0 1 1 1 1
Bob_2 0 0 0 0 0 0 0 0
Bob_3 0 0 1 1 0 0 0 0
palm pear pecan pine ponderosa pine red oak redwood sequioa
Alice_1 1 0 0 1 1 1 1 1
Alice_2 0 1 0 0 0 0 0 0
Alice_3 0 0 1 0 0 0 0 0
Bob_1 0 1 1 0 0 1 0 0
Bob_2 1 0 0 0 0 0 0 0
Bob_3 0 0 0 1 1 0 1 1
silver maple spruce sugar maple sycamore walnut white oak white pine
Alice_1 1 1 1 1 0 1 1
Alice_2 0 0 0 0 0 0 0
Alice_3 0 0 0 0 1 0 0
Bob_1 1 0 1 1 1 1 0
Bob_2 0 0 0 0 0 0 0
Bob_3 0 1 0 0 0 0 1
The matrix above is a pile-by-tree matrix. Now let’s create a tree-by-tree co-occurence matrix.
TreebyTree <- crossprod(TreeWide)
Now we can analyze our pile sort data with MDS. This will show us the emergent clusters of how similar or dissimilar the trees are in among interview respondents.
cmdscale() function we used to analyze
the freelist data, conduct a MDS analysis on the tree pile sort
data.tree_mds <- cmdscale(proxy::dist(TreebyTree, convert_similarities = TRUE), k = 2)
# add in labels and dimension names
tree_mds_df <- as.data.frame(tree_mds)
tree_mds_df$TreeName <- row.names(tree_mds)
ggplot(tree_mds_df, aes(x = V1, y = V2, label = TreeName)) + geom_text(size = 2.5,
position = position_jitter(width = 1, height = 2)) + labs(x = "Dimension 1",
y = "Dimension 2") + ggtitle("MDS of Tree Pilesort")
# https://stackoverflow.com/questions/40542685/how-to-jitter-remove-overlap-for-geom-text-labels
Sometimes it can be difficult to prevent the text labels from
overlapping. Here we can turn to the ggrepel package to
prevent overlapping.
# https://www.r-bloggers.com/2016/01/repel-overlapping-text-labels-in-ggplot2-2/
library(ggrepel)
ggplot(tree_mds_df, aes(x = V1, y = V2, label = TreeName)) + geom_text_repel(max.overlaps = 25) +
geom_point()
We can analyze pile sort data as a network, considering how the cards are linked to one another through co-membership in piles. Our original data already comes in a format similar to to a bipartite edgelist, where the Ego is the Tree_Name and the PersonPile is the group in which they are a member. We can also consider the pile names to be shared categories across respondents. In this case, we treat the Tree_Name values as members in the Pile_Name groups.
Let’s create a network graph of the tree pile sort based on the
Pile_Name column. First we select only the columns needed for the
edgelist. Then we can install and load the igraph library
and use graph.data.frame() to make a graph object. Finally,
we can plot the network with the trees in one color and the piles in
another color.
library(igraph)
treeedgelist <- treepilesort %>%
select("Tree_Name", "Pile_Name")
treegraph <- graph.data.frame(treeedgelist, directed = F)
V(treegraph)$Color <- ifelse(V(treegraph)$name %in% treeedgelist$Tree_Name, "lightpink",
"goldenrod")
plot(treegraph, vertex.label.family = "Helvetica", vertex.label.cex = 0.7, vertex.color = V(treegraph)$Color,
vertex.label.color = "navy", edge.arrow.size = 0, vertex.frame.color = "bisque",
edge.color = adjustcolor("grey"), vertex.size = 7)
We can make the graph look a bit nicer by simplifying the edges and adding in weights.
# adding edge weights and simplifying graph:
# https://stackoverflow.com/questions/48698126/adding-edge-weights-from-dataframe-with-igraph-in-r
E(treegraph)$weight <- 1
treegraph2 <- simplify(treegraph, edge.attr.comb = "sum")
E(treegraph2)$weight
[1] 5 5 5 1 1 1 1 1 5 5 5 5 1 1 1 1 1 5 1 1 1 1 1 5 1 1 1 1 1 5 1 1 1 1 1 2 1
[38] 1 1 3 1 1 2 3 1 1 1 1 2 3 1 1 1 1 2 3 1 1 1 1 2 3 1 1 1 1 2 1 1 1 2 1 1 1
[75] 2 1 1 1 2 1 1 1 6 1 1 1 6 1 1 1 1 6 1 1 1 4 1 1 2 1 1 2 1 1 1 2 1 1 1 2 1
[112] 1 3 1 3 1 1 3 1 3 1 1 1 2 5 1 1 1 2 5 1 1 1 2 5 1 1 4 4 4 4 1 3 1 1 1 3 2
[149] 1 1 1 3 2 1 1 1 3 1 1 1 3 1 1 1 3 2 2 1 1 1 3 1 1 1 3 1 1 1 2 1 1 1 2 1 1
[186] 2 1 6 1 1 1 3 1 3 1
plot(treegraph2, vertex.label.family = "Helvetica", vertex.label.cex = 0.7, vertex.color = V(treegraph2)$Color,
vertex.label.color = "navy", edge.arrow.size = 0, vertex.frame.color = "bisque",
edge.color = "lightgrey", vertex.size = 7, edge.width = E(treegraph2)$weight *
1.5)
Looking good, but there’s an issue. igraph is treating each node as the same type, when in fact this is a bipartite network. There are great examples in this tutorial about how to work with bipartite data.
# Select columns of interest
treeedgelist <- treepilesort %>%
select("Tree_Name", "Pile_Name")
# first make sure that the piles are labeled as pile type nodes
treeedgelist <- treeedgelist %>%
mutate(Pile_Name = paste(treeedgelist$Pile_Name, "pile"))
treegraph <- graph.data.frame(treeedgelist, directed = F)
# now we can use igraph's built in bipartite detector to assign type attributes
# to the network
bipartite.mapping(treegraph)
$res
[1] TRUE
$type
maple oak apple
FALSE FALSE FALSE
orange pear palm
FALSE FALSE FALSE
beech aspen birch
FALSE FALSE FALSE
alder white oak red oak
FALSE FALSE FALSE
live oak laurel oak black walnut
FALSE FALSE FALSE
walnut chestnut horse chestnut
FALSE FALSE FALSE
sugar maple silver maple japanese maple
FALSE FALSE FALSE
banana mango redwood
FALSE FALSE FALSE
sequioa dawn redwood pine
FALSE FALSE FALSE
cedar spruce longleaf pine
FALSE FALSE FALSE
white pine douglas fir ponderosa pine
FALSE FALSE FALSE
lodgepole pine gingko sycamore
FALSE FALSE FALSE
pecan coconut nonfood pile
FALSE FALSE TRUE
fruit pile nut pile deciduous pile
TRUE TRUE TRUE
monocot pile conifer pile pile
TRUE TRUE TRUE
maple pile oak pile tropical pile
TRUE TRUE TRUE
ovate leaf pile redwoods pile pine pile
TRUE TRUE TRUE
other conifer pile gingko pile sycamore pile
TRUE TRUE TRUE
street tree pile rosaceae pile citrus pile
TRUE TRUE TRUE
palm pile beech pile aspen pile
TRUE TRUE TRUE
birch pile alder pile walnut pile
TRUE TRUE TRUE
chestnut pile banana pile mango pile
TRUE TRUE TRUE
cedar pile spruce pile fir pile
TRUE TRUE TRUE
pecan pile edible pile nonedible pile
TRUE TRUE TRUE
temperate fruit pile other deciduous pile tropical fruit pile
TRUE TRUE TRUE
temperate pile
TRUE
V(treegraph)$type <- bipartite.mapping(treegraph)$type
treegraph
IGRAPH 46234d2 UN-B 76 380 --
+ attr: name (v/c), type (v/l)
+ edges from 46234d2 (vertex names):
[1] maple --nonfood pile oak --nonfood pile
[3] apple --fruit pile orange --fruit pile
[5] pear --fruit pile palm --nonfood pile
[7] beech --nonfood pile aspen --nonfood pile
[9] birch --nonfood pile alder --nonfood pile
[11] white oak --nonfood pile red oak --nonfood pile
[13] live oak --nonfood pile laurel oak --nonfood pile
[15] black walnut --nut pile walnut --nut pile
+ ... omitted several edges
Now that we have a bipartite network, we can plot the tree pile sort
network with different colors and shapes for the nodes according to
their type. First, we assign a new vertex attribute called
TypeColor using the same ifelse() syntax used
previously to assign vertex colors. Then we add in vertex shapes and
simplify the graph.
V(treegraph)$TypeColor <- ifelse(V(treegraph)$type == TRUE, "lightpink", "goldenrod")
V(treegraph)$TypeShape <- ifelse(V(treegraph)$type == TRUE, "square", "circle")
# adding edge weights and simplifying graph:
# https://stackoverflow.com/questions/48698126/adding-edge-weights-from-dataframe-with-igraph-in-r
E(treegraph)$weight <- 1
treegraph <- simplify(treegraph, edge.attr.comb = "sum")
E(treegraph)$weight
[1] 1 1 1 5 1 1 1 1 1 5 1 1 5 1 1 1 1 1 5 1 1 1 1 1 5 1 1 1 1 1 1 1 1 3 3 1 1
[38] 2 3 1 1 1 1 1 2 3 1 1 1 1 1 2 3 1 1 1 1 1 2 3 1 1 1 1 1 2 5 1 1 1 2 5 1 1
[75] 1 2 5 1 1 1 2 5 1 1 6 1 1 1 1 6 1 1 1 1 6 1 1 1 1 4 1 1 1 2 1 1 2 5 1 1 1
[112] 2 5 1 1 1 2 5 1 1 3 1 3 1 1 1 3 1 3 1 1 1 1 2 5 1 1 1 2 5 1 1 1 2 5 1 1 1
[149] 3 4 1 1 1 3 2 2 1 1 1 3 2 2 1 1 1 3 4 1 1 1 3 4 1 1 1 3 2 2 1 1 1 3 4 1 1
[186] 1 3 4 1 1 1 2 5 1 1 1 2 2 1 1 2 1 6 1 1 1 1 3 1 3 2 1
plot(treegraph, vertex.label.family = "Helvetica", vertex.label.cex = 0.7, vertex.color = V(treegraph)$TypeColor,
vertex.label.color = "navy", vertex.shape = V(treegraph)$TypeShape, edge.arrow.size = 0,
vertex.frame.color = "bisque", edge.color = "lightgrey", vertex.size = 7, edge.width = E(treegraph)$weight *
2)
Here we make an interactive chord diagram showing the links between the pile sort groups and the individual tree species.
More on chord diagrams at the R graph gallery.
# devtools::install_github('mattflor/chorddiag')
library(chorddiag)
# create a dummy variable
treepilesort$count <- rep(1, nrow(treepilesort))
# subset just the columns we want to use
treepilesort2 <- treepilesort[, c("Pile_Name", "Tree_Name", "count")]
# pivot data into wide format
treepilesort2 <- treepilesort2 %>%
pivot_wider(names_from = Tree_Name, values_from = count, values_fn = sum)
# remove first column
treepilesort3 <- treepilesort2 %>%
select(-Pile_Name)
# set rownamees to Pile_Name
row.names(treepilesort3) <- treepilesort2$Pile_Name
# convert to matrix
treepilesort.mat <- as.matrix(treepilesort3)
# make a palette
library(RColorBrewer)
greenspal <- brewer.pal(9, "Greens")[2:9]
purplespal <- brewer.pal(9, "Purples")[2:9]
purples2 <- colorRampPalette(purplespal)(length(unique(row.names(treepilesort.mat))))
greens2 <- colorRampPalette(greenspal)(length(unique(colnames(treepilesort.mat))))
allpal <- c(purples2, greens2)
# now we can create a chord diagram linking func.grp to decades by count.
chord <- chorddiag(treepilesort.mat, type = "bipartite", showTicks = F, groupnameFontsize = 10,
groupnamePadding = 10, margin = 150, groupColors = allpal)
chord
To export your interactive graphic, we can use the
htmlwidgets package.
library(htmlwidgets)
# htmlwidgets::saveWidget(chord, 'treechord.html') see where you graphic output
# to getwd()