ANTH630 Lecture Series Home

Getting started

This lesson will teach you how to work with and analyze cultural domain data using R. This lesson draws in part on datasets and formulas from the Anthrotools package. You can get the data for this lesson here and here.

Load packages and data

# Load packages install.packages('devtools') #devtools lets us install packages
# from github
library("devtools")
# install_github('alastair-JL/AnthroTools')
library(AnthroTools)
library(tidyverse)
library(ggplot2)
library(tidyr)

Second, let’s load the freelist data and assign a name to the dataframe object. Here we will load the FruitList data from the Anthrotools package, which has been made available to you as a .csv file. Take a look at the head() and str() of this spreadsheet. How do you think the data have been organized? What do the rows and columns represent?

FruitList <- read.csv("https://maddiebrown.github.io/ethnoecology/FruitList.csv")
head(FruitList)
  X Subj Order       CODE
1 1    1     1       pear
2 2    1     2     orange
3 3    1     3      apple
4 4    2     1      apple
5 5    2     2      apple
6 6    2     3 strawberry
str(FruitList)
'data.frame':   75 obs. of  4 variables:
 $ X    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Subj : int  1 1 1 2 2 2 3 3 4 4 ...
 $ Order: int  1 2 3 1 2 3 1 2 1 2 ...
 $ CODE : chr  "pear" "orange" "apple" "apple" ...
Click for solution We can see that we have 75 rows with 4 variables. The ‘X’ variable is a numeric index autocreated by excel. The ‘Subj’ variable is a numeric identifier for each interview respondent. The ‘Order’ variable represents the order in which each fruit was named by each respondent. The ‘CODE’ variable contains each fruit named in the freelists.

Try it

How many respondents and unique fruits (‘items’) are in the FruitList data?

Click for solution
# How many respondents are there?
length(unique(FruitList$Subj))
# how many unique fruits are named and what are they?
length(unique(FruitList$CODE))
unique(FruitList$CODE)
From these few operations we now know a bit more about the dataset. There are 20 individuals who collectively named a total of 8 unique fruits.

Next, let’s calculate the average number of fruits named per person and explore the distribution of list lengths across responses.

# How many responses per subject
FruitList %>%
    group_by(Subj) %>%
    count()
# A tibble: 20 × 2
# Groups:   Subj [20]
    Subj     n
   <int> <int>
 1     1     3
 2     2     3
 3     3     2
 4     4     5
 5     5     3
 6     6     3
 7     7     4
 8     8     4
 9     9     3
10    10     3
11    11     5
12    12     5
13    13     4
14    14     5
15    15     3
16    16     5
17    17     2
18    18     5
19    19     3
20    20     5
# What is the average number of responses per subject?
FruitList %>%
    group_by(Subj) %>%
    count() %>%
    ungroup() %>%
    summarise(mean = mean(n))
# A tibble: 1 × 1
   mean
  <dbl>
1  3.75

We can also do this in two parts by creating a new dataframe of the summarized data and then calculating the mean value in the “n” column.

BySubject <- FruitList %>%
    group_by(Subj) %>%
    count()
mean(BySubject$n)
[1] 3.75
# This method also allows you to generate additional summary information
summary(BySubject$n)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    3.00    3.50    3.75    5.00    5.00 
# We can also make a table showing how many respondents had lists of length 2,
# 3, 4, and 5
table(BySubject$n)

2 3 4 5 
2 8 3 7 

Item frequency analysis

Next, let’s analyze how frequently each fruit was mentioned. For this analysis we are going to create a new dataframe object called ‘ByFruit’ that groups the data according to each unique fruit and counts the number of rows.

ByFruit <- FruitList %>%
    group_by(CODE) %>%
    count() %>%
    arrange(desc(n))
ByFruit
# A tibble: 8 × 2
# Groups:   CODE [8]
  CODE           n
  <chr>      <int>
1 apple         25
2 pear          10
3 plum           9
4 strawberry     9
5 banana         8
6 orange         7
7 lemon          5
8 peach          2

Try it

Plot the number of times each type of fruit is mentioned by interviewees. Make a bar plot with flipped coordinates so that each fruit name is on the y axis and the number of mentions are on the x axis.

Click for solution
ggplot(ByFruit, aes(x = reorder(CODE, n), y = n)) + geom_bar(stat = "identity") +
    coord_flip() + ggtitle("Frequency of fruit mentions") + labs(x = "Fruit", y = "n")

Scree Plots

Another way to visualize these data is through a scree plot. In these types of plots data are arranged such that each point represents a single item, the y-value for which is its frequency of mention in the dataset. These types of plots can be used to quickly identify trends and cut-off points in the data.

ggplot(ByFruit, aes(x = reorder(CODE, desc(n)), y = n, group = 1)) + geom_point() +
    geom_line()

Salience calculations

One of the main insights than can be learned from freelist data is the relative cultural salience of items in particular domains. For example, in the domain of household chores, we could determine whether vacuuming, washing dishes, shoveling the driveway, or feeding the snake are considered more salient or central to the idea of household chores compared to other tasks. We might find variations based on individual attributes or the cultural context in which the question is asked. Salience often mirrors frequency, but the calculation is a bit more complicated as it considers both an item’s frequency of mention and the order in which it is usually listed.

Luckily, AnthroTools has a built in salience calculation function that can do the math for us. The code below calculates the salience of each fruit listed in the context of each individual interviewee.

FruitListSalience <- CalculateSalience(FruitList, Order = "Order", Subj = "Subj",
    CODE = "CODE")
# Note: I have included the arguments for Order, Subj and CODE for illustrative
# purposes. Because these column names match the arguments, you do not actually
# need to include them in this case. However, when you are working with your
# own datasets you may have different column names, so it is helpful to keep
# the underlying structure of functions in mind when you deploy them.
FruitListSalience
    X Subj Order       CODE  Salience
1   1    1     1       pear 1.0000000
2   2    1     2     orange 0.6666667
3   3    1     3      apple 0.3333333
4   4    2     1      apple 1.0000000
5   5    2     2      apple 0.6666667
6   6    2     3 strawberry 0.3333333
7   7    3     1     banana 1.0000000
8   8    3     2 strawberry 0.5000000
9   9    4     1      apple 1.0000000
10 10    4     2       plum 0.8000000
11 11    4     3     banana 0.6000000
12 12    4     4     orange 0.4000000
13 13    4     5      apple 0.2000000
14 14    5     1 strawberry 1.0000000
15 15    5     2      apple 0.6666667
16 16    5     3      apple 0.3333333
17 17    6     1      apple 1.0000000
18 18    6     2      lemon 0.6666667
19 19    6     3      apple 0.3333333
20 20    7     1     banana 1.0000000
21 21    7     2       plum 0.7500000
22 22    7     3      lemon 0.5000000
23 23    7     4      lemon 0.2500000
24 24    8     1 strawberry 1.0000000
25 25    8     2      apple 0.7500000
26 26    8     3       pear 0.5000000
27 27    8     4      apple 0.2500000
28 28    9     1      apple 1.0000000
29 29    9     2      apple 0.6666667
30 30    9     3       plum 0.3333333
31 31   10     1      apple 1.0000000
32 32   10     2       pear 0.6666667
33 33   10     3      apple 0.3333333
34 34   11     1      apple 1.0000000
35 35   11     2 strawberry 0.8000000
36 36   11     3     banana 0.6000000
37 37   11     4      apple 0.4000000
38 38   11     5      lemon 0.2000000
39 39   12     1      apple 1.0000000
40 40   12     2     orange 0.8000000
41 41   12     3       pear 0.6000000
42 42   12     4      apple 0.4000000
43 43   12     5     orange 0.2000000
44 44   13     1       plum 1.0000000
45 45   13     2      peach 0.7500000
46 46   13     3 strawberry 0.5000000
47 47   13     4     banana 0.2500000
48 48   14     1       pear 1.0000000
49 49   14     2      apple 0.8000000
50 50   14     3     banana 0.6000000
51 51   14     4       pear 0.4000000
52 52   14     5 strawberry 0.2000000
53 53   15     1       plum 1.0000000
54 54   15     2 strawberry 0.6666667
55 55   15     3      apple 0.3333333
56 56   16     1     banana 1.0000000
57 57   16     2       plum 0.8000000
58 58   16     3       plum 0.6000000
59 59   16     4       pear 0.4000000
60 60   16     5       pear 0.2000000
61 61   17     1     orange 1.0000000
62 62   17     2      apple 0.5000000
63 63   18     1      peach 1.0000000
64 64   18     2     banana 0.8000000
65 65   18     3     orange 0.6000000
66 66   18     4       pear 0.4000000
67 67   18     5      apple 0.2000000
68 68   19     1       pear 1.0000000
69 69   19     2      lemon 0.6666667
70 70   19     3 strawberry 0.3333333
71 71   20     1       plum 1.0000000
72 72   20     2      apple 0.8000000
73 73   20     3       plum 0.6000000
74 74   20     4     orange 0.4000000
75 75   20     5      apple 0.2000000

The above code calculates the salience for each item by respondent. If you inspect the results, you’ll see that the first item in each list has a salience of 1, with each subsequent item decreasing in relative salience. This is useful for understanding how an individual thinks about the domain of fruits, but what if we are interested in knowing how salient apples are across all responses? We can calculate the salience of particular items as well with the SalienceByCode() function.

Try it

Try running the SalienceByCode() function on the new salience dataframe we made above.

Click for solution
SalienceByFruit <- SalienceByCode(FruitListSalience, dealWithDoubles = "MAX")
        CODE MeanSalience SumSalience   SmithsS
1       pear    0.6958333    5.566667 0.2783333
2     orange    0.6444444    3.866667 0.1933333
3      apple    0.7588889   11.383333 0.5691667
4 strawberry    0.5925926    5.333333 0.2666667
5     banana    0.7312500    5.850000 0.2925000
6       plum    0.8119048    5.683333 0.2841667
7      lemon    0.5083333    2.033333 0.1016667
8      peach    0.8750000    1.750000 0.0875000
# The dealwithdoubles argument tells R what to do if a respondent lists the
# same item twice. There are a few different options available for this, the
# right one to pick will depend on your data and research question.

From this analysis we can see that there are slight differences in the mean salience and Smith’s S. Smith’s S considers the length of lists in its calculation of salience (More info here).

Now let’s plot the Smith’s S results in decreasing order and add a vertical line at the 0.1 mark. This range is generally considered a benchmark level for assessing item salience in freelists.

ggplot(SalienceByFruit, aes(x = reorder(CODE, SmithsS), y = SmithsS)) + geom_bar(stat = "identity") +
    coord_flip() + ggtitle("Fruit Salience") + labs(x = "Fruit", y = "Smith's S") +
    geom_hline(yintercept = 0.1)

From this plot, it looks like most of the fruits could be considered salient in this dataset. However, although pears were mentioned slightly more frequently than bananas and plums, their salience is lower when their overall order within lists and other factors are taken into account.

Multidimensional scaling of freelist data

We can use multidimensional scaling to determine which items are more central or peripheral in the domain of “fruits” for interviewees. First we need to make a presence absence table to compare the fruits to one another.

Try it

Using FreeListTable() make a table indicating whether or not each respondent mentioned a particular fruit. Hint: This is a presence/absence table. Hint: Run ?FreeListTable() if you aren’t sure where to start.

Click for solution
FruitListTable <- FreeListTable(FruitList, CODE = "CODE", Salience = "Salience",
    Subj = "Subj", tableType = "PRESENCE")

Now we will transform the data to be in the right format for MDS.

# add new count column
FruitList$present <- rep(1)
# Spread into wide datatable. Note: spread function requires unique identifiers
# for rows, so here we remove any duplicate rows
FruitListWide <- FruitList %>%
    select(Subj, CODE, present) %>%
    unique %>%
    spread(CODE, present)
# convert NAs to 0
FruitListWide[is.na(FruitListWide)] <- 0
head(FruitListWide)
  Subj apple banana lemon orange peach pear plum strawberry
1    1     1      0     0      1     0    1    0          0
2    2     1      0     0      0     0    0    0          1
3    3     0      1     0      0     0    0    0          1
4    4     1      1     0      1     0    0    1          0
5    5     1      0     0      0     0    0    0          1
6    6     1      0     1      0     0    0    0          0

## Convert dataframe into presence/absence only by removing the first column.
## First we can assign the Subj column to the rownames
rownames(FruitListWide) <- FruitListWide$Subj
# Remove the first column
FruitListWide <- FruitListWide[, -1]
# Convert to matrix
FruitListWide <- as.matrix(FruitListWide)
head(FruitListWide)
  apple banana lemon orange peach pear plum strawberry
1     1      0     0      1     0    1    0          0
2     1      0     0      0     0    0    0          1
3     0      1     0      0     0    0    0          1
4     1      1     0      1     0    0    1          0
5     1      0     0      0     0    0    0          1
6     1      0     1      0     0    0    0          0

# this is a people by fruit table. now we want to make a fruit by fruit table
FruitbyFruit <- crossprod(FruitListWide)

There are many ways to conduct MDS in R. Some examples and tutorials on completing MDS in R are included below.

We can use the matrix we just created in our multidimensional scaling analysis. First, let’s make a distance/similarity matrix.

Fruit_mds <- cmdscale(proxy::dist(FruitbyFruit, convert_similarities = TRUE), k = 2)

# add in labels and dimension names
Fruit_mds_df <- as.data.frame(Fruit_mds)
Fruit_mds_df$FruitName <- row.names(Fruit_mds)

Plotting freelist MDS

# plot the results
ggplot(Fruit_mds_df, aes(x = V1, y = V2, label = FruitName)) + geom_text(size = 2.5,
    position = position_jitter(width = 1, height = 2)) + labs(x = "Dimension 1",
    y = "Dimension 2") + ggtitle("MDS of Fruit Pilesort")


# https://stackoverflow.com/questions/40542685/how-to-jitter-remove-overlap-for-geom-text-labels

More details on MDS and freelist analysis here.

Comparing across groups

It is also possible to compare item salience across different groups of respondents. The following example comes from the AnthroTools package which includes a sample grouping of the FruitList data. First, let’s load the new dataset. Because it is included in the AnthroTools package, we can load it directly with the data() function.

data("WorldList")
head(WorldList)
  Subj Order       CODE GROUPING
1    1     1       plum MAINLAND
2    1     2 strawberry MAINLAND
3    1     3       pear MAINLAND
4    1     4      peach MAINLAND
5    2     1      apple MAINLAND
6    2     2     orange MAINLAND

Try it

Try calculating the salience for each fruit, adding in an argument to the CalculateSalience() function that differentiates responses by GROUPING. The argument is conveniently called GROUPING. First calculate the salience of each item by response (CalculateSalience()), then calculate the salience of each item (SalienceByCode()). Refer to the code we ran earlier in this lesson as a template to write your solution.

Click for solution
FL1 <- CalculateSalience(WorldList, Order = "Order", Subj = "Subj", CODE = "CODE",
    GROUPING = "GROUPING")
# Note, this function will not run without adding in the GROUPING argument.
FL2 <- SalienceByCode(FL1, GROUPING = "GROUPING", dealWithDoubles = "MAX")
[1] "plum"
[1] "strawberry"
[1] "pear"
[1] "peach"
[1] "apple"
[1] "orange"
[1] "lemon"
[1] "banana"
[1] "plum"
[1] "strawberry"
[1] "pear"
[1] "peach"
[1] "apple"
[1] "orange"
[1] "lemon"
[1] "banana"
[1] "plum"
[1] "strawberry"
[1] "pear"
[1] "peach"
[1] "apple"
[1] "orange"
[1] "lemon"
[1] "banana"
   GROUPING       CODE MeanSalience SumSalience    SmithsS
1  MAINLAND       plum    0.8333333    5.000000 0.29411765
2    ISLAND       plum    0.7729167    6.183333 0.47564103
3      MOON       plum    0.5833333    1.750000 0.12500000
4  MAINLAND strawberry    0.6875000    2.750000 0.16176471
5    ISLAND strawberry    0.5000000    2.000000 0.15384615
6      MOON strawberry    0.4722222    1.416667 0.10119048
7  MAINLAND       pear    0.5833333    2.916667 0.17156863
8    ISLAND       pear    0.5733333    2.866667 0.22051282
9      MOON       pear    0.3500000    1.400000 0.10000000
10 MAINLAND      peach    0.6333333    3.166667 0.18627451
11   ISLAND      peach    0.7708333    3.083333 0.23717949
12     MOON      peach    0.6800000    3.400000 0.24285714
13 MAINLAND      apple    0.7320513    9.516667 0.55980392
14   ISLAND      apple    0.6500000    5.850000 0.45000000
15     MOON      apple    0.8555556   10.266667 0.73333333
16 MAINLAND     orange    0.6861111    4.116667 0.24215686
17   ISLAND     orange    0.2500000    0.250000 0.01923077
18     MOON     orange    0.5000000    1.000000 0.07142857
19 MAINLAND      lemon    0.7000000    1.400000 0.08235294
20   ISLAND      lemon    0.8000000    2.400000 0.18461538
21     MOON      lemon    0.7200000    3.600000 0.25714286
22 MAINLAND     banana    0.8888889    2.666667 0.15686275
23   ISLAND     banana    0.6708333    5.366667 0.41282051
24     MOON     banana    0.8000000    2.400000 0.17142857

Frequency by group

In addition to evaluating salience, we might be interested in the percentage of respondents from each group who named a particular item. The code below creates a new object where the data are grouped according to the GROUPING variable.

frequencybygroup <- WorldList %>%
    group_by(GROUPING) %>%
    mutate(GroupN = length(unique(Subj))) %>%
    ungroup %>%
    group_by(GROUPING, CODE, GroupN) %>%
    summarise(totalResponses = n(), nRespondents = length(unique(Subj)), percentRespondents = round(length(unique(Subj))/first(GroupN) *
        100, 2)) %>%
    arrange(GROUPING, desc(percentRespondents))

We can then plot the results of this grouping with a facet wrap graph.

ggplot(frequencybygroup, aes(x = reorder(CODE, desc(percentRespondents)), percentRespondents)) +
    geom_bar(stat = "identity") + coord_flip() + ggtitle("Frequency of fruit mentions by site") +
    labs(x = "Fruit", y = "n") + facet_wrap(vars(GROUPING))

It looks like in all cases, apples were mentioned more frequently, but the different sample populations differ in their rates of mentioning oranges and bananas. One thing you will notice in this graph is that although the fruits are listed in descending order for the “ISLAND” sample, the order is not meaningful for the other sites. This is because in the faceted plot, the y axis is the same for each site, making it easier to compare across samples.

Comparing freelist lengths across groups

We can also compare the lengths of lists across each group. This shows us that the average list lengths were not very different across the groups, but were slightly higher in the Island population.

WorldList %>%
    group_by(GROUPING, Subj) %>%
    summarise(n = n()) %>%
    ungroup %>%
    group_by(GROUPING) %>%
    summarize(nResponse = n(), avgLength = mean(n), maxLength = max(n), minLength = min(n),
        medianLength = median(n))
# A tibble: 3 × 6
  GROUPING nResponse avgLength maxLength minLength medianLength
  <chr>        <int>     <dbl>     <int>     <int>        <dbl>
1 ISLAND          13      3.54         5         2            3
2 MAINLAND        17      3.35         5         2            3
3 MOON            14      3.29         5         2            3

Pile Sorts

Pile sorts are used to collect data about how people categorize a set of things in a particular cultural/cognitive domain. In this example, I’ve created a sample pile sort dataset of tree species.

Let’s read in and take a look at the data.

treepilesort <- read.csv("https://maddiebrown.github.io/ANTH475/data/tree_pilesort_sampledata_2026.csv")

head(treepilesort)
  Person Pile_Num Pile_Name Card_Num Tree_Name
1  Alice        1   nonfood        1     maple
2  Alice        1   nonfood        2       oak
3  Alice        2     fruit        3     apple
4  Alice        2     fruit        4    orange
5  Alice        2     fruit        5      pear
6  Alice        1   nonfood        6      palm

Linking pile numbers to pile names

One challenge you might face with pile sort data is that each pile is only given a number, rather than a name. For this example, I already included the tree name in the pile sort data. Sometimes however, you might only have the card number in your observational data. In this case, you will first need to link the card number with the card name. The code below shows how to link these two dataframes.

treecardID <- read.csv("https://maddiebrown.github.io/ANTH475/data/pilesortcards_sampledata.csv")

tree_join <- left_join(treepilesort, treecardID, by = c(Card_Num = "Card_Num"))
str(tree_join)
'data.frame':   380 obs. of  6 variables:
 $ Person   : chr  "Alice" "Alice" "Alice" "Alice" ...
 $ Pile_Num : int  1 1 2 2 2 1 1 1 1 1 ...
 $ Pile_Name: chr  "nonfood" "nonfood" "fruit" "fruit" ...
 $ Card_Num : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Tree_Name: chr  "maple" "oak" "apple" "orange" ...
 $ Card_Name: chr  "maple" "oak" "apple" "orange" ...

Data wrangling

To start working with the pile sort data, we first add in a new unique identifier for each person_pile combination. This step is required because the piles have numeric names that repeat across respondents. Next we can make a presence/absence column that will be useful for creating a new matrix based on whether or not a tree occurs in a particular pile.

treepilesort <- treepilesort %>%
    mutate(PersonPile = paste(Person, Pile_Num, sep = "_"))
# make a 'presence' column
treepilesort$present <- rep(1)

Now we can convert the data back into a wide format, where each column is a different tree species and the values represent presence or absence of the tree.

# Spread into wide data table. Note: spread function requires unique
# identifiers for rows, so here we remove any duplicate rows
TreeWide <- treepilesort %>%
    select(PersonPile, Tree_Name, present) %>%
    unique %>%
    spread(Tree_Name, present)
# Convert NAs to 0
TreeWide[is.na(TreeWide)] <- 0
head(TreeWide)
  PersonPile alder apple aspen banana beech birch black walnut cedar chestnut
1    Alice_1     1     0     1      0     1     1            0     1        0
2    Alice_2     0     1     0      1     0     0            0     0        0
3    Alice_3     0     0     0      0     0     0            1     0        1
4      Bob_1     1     1     1      0     1     1            1     0        1
5      Bob_2     0     0     0      1     0     0            0     0        0
6      Bob_3     0     0     0      0     0     0            0     1        0
  coconut dawn redwood douglas fir gingko horse chestnut japanese maple
1       0            1           1      1              0              1
2       1            0           0      0              0              0
3       0            0           0      0              1              0
4       0            0           0      1              1              1
5       1            0           0      0              0              0
6       0            1           1      0              0              0
  laurel oak live oak lodgepole pine longleaf pine mango maple oak orange palm
1          1        1              1             1     0     1   1      0    1
2          0        0              0             0     1     0   0      1    0
3          0        0              0             0     0     0   0      0    0
4          1        1              0             0     1     1   1      1    0
5          0        0              0             0     0     0   0      0    1
6          0        0              1             1     0     0   0      0    0
  pear pecan pine ponderosa pine red oak redwood sequioa silver maple spruce
1    0     0    1              1       1       1       1            1      1
2    1     0    0              0       0       0       0            0      0
3    0     1    0              0       0       0       0            0      0
4    1     1    0              0       1       0       0            1      0
5    0     0    0              0       0       0       0            0      0
6    0     0    1              1       0       1       1            0      1
  sugar maple sycamore walnut white oak white pine
1           1        1      0         1          1
2           0        0      0         0          0
3           0        0      1         0          0
4           1        1      1         1          0
5           0        0      0         0          0
6           0        0      0         0          1
## Convert dataframe into presence/absence only by removing the first column.
## First we can assign the PersonPile column to the rownames
rownames(TreeWide) <- TreeWide$PersonPile
# Remove the PersonPile column
TreeWide <- TreeWide[, -1]
# Convert to matrix
TreeWide <- as.matrix(TreeWide)
head(TreeWide)
        alder apple aspen banana beech birch black walnut cedar chestnut
Alice_1     1     0     1      0     1     1            0     1        0
Alice_2     0     1     0      1     0     0            0     0        0
Alice_3     0     0     0      0     0     0            1     0        1
Bob_1       1     1     1      0     1     1            1     0        1
Bob_2       0     0     0      1     0     0            0     0        0
Bob_3       0     0     0      0     0     0            0     1        0
        coconut dawn redwood douglas fir gingko horse chestnut japanese maple
Alice_1       0            1           1      1              0              1
Alice_2       1            0           0      0              0              0
Alice_3       0            0           0      0              1              0
Bob_1         0            0           0      1              1              1
Bob_2         1            0           0      0              0              0
Bob_3         0            1           1      0              0              0
        laurel oak live oak lodgepole pine longleaf pine mango maple oak orange
Alice_1          1        1              1             1     0     1   1      0
Alice_2          0        0              0             0     1     0   0      1
Alice_3          0        0              0             0     0     0   0      0
Bob_1            1        1              0             0     1     1   1      1
Bob_2            0        0              0             0     0     0   0      0
Bob_3            0        0              1             1     0     0   0      0
        palm pear pecan pine ponderosa pine red oak redwood sequioa
Alice_1    1    0     0    1              1       1       1       1
Alice_2    0    1     0    0              0       0       0       0
Alice_3    0    0     1    0              0       0       0       0
Bob_1      0    1     1    0              0       1       0       0
Bob_2      1    0     0    0              0       0       0       0
Bob_3      0    0     0    1              1       0       1       1
        silver maple spruce sugar maple sycamore walnut white oak white pine
Alice_1            1      1           1        1      0         1          1
Alice_2            0      0           0        0      0         0          0
Alice_3            0      0           0        0      1         0          0
Bob_1              1      0           1        1      1         1          0
Bob_2              0      0           0        0      0         0          0
Bob_3              0      1           0        0      0         0          1

The matrix above is a pile-by-tree matrix. Now let’s create a tree-by-tree co-occurence matrix.

TreebyTree <- crossprod(TreeWide)

MDS on pile sort data

Now we can analyze our pile sort data with MDS. This will show us the emergent clusters of how similar or dissimilar the trees are in among interview respondents.

Try it

  1. Using the same cmdscale() function we used to analyze the freelist data, conduct a MDS analysis on the tree pile sort data.
  2. Plot the MDS results using ggplot. Hint: See the code used for the freelist data above and modify it for the pile sort dataset.
Click for solution
tree_mds <- cmdscale(proxy::dist(TreebyTree, convert_similarities = TRUE), k = 2)

# add in labels and dimension names
tree_mds_df <- as.data.frame(tree_mds)
tree_mds_df$TreeName <- row.names(tree_mds)

ggplot(tree_mds_df, aes(x = V1, y = V2, label = TreeName)) + geom_text(size = 2.5,
    position = position_jitter(width = 1, height = 2)) + labs(x = "Dimension 1",
    y = "Dimension 2") + ggtitle("MDS of Tree Pilesort")


# https://stackoverflow.com/questions/40542685/how-to-jitter-remove-overlap-for-geom-text-labels

Sometimes it can be difficult to prevent the text labels from overlapping. Here we can turn to the ggrepel package to prevent overlapping.

# https://www.r-bloggers.com/2016/01/repel-overlapping-text-labels-in-ggplot2-2/
library(ggrepel)
ggplot(tree_mds_df, aes(x = V1, y = V2, label = TreeName)) + geom_text_repel(max.overlaps = 25) +
    geom_point()

Network analysis on pile sort data

We can analyze pile sort data as a network, considering how the cards are linked to one another through co-membership in piles. Our original data already comes in a format similar to to a bipartite edgelist, where the Ego is the Tree_Name and the PersonPile is the group in which they are a member. We can also consider the pile names to be shared categories across respondents. In this case, we treat the Tree_Name values as members in the Pile_Name groups.

Let’s create a network graph of the tree pile sort based on the Pile_Name column. First we select only the columns needed for the edgelist. Then we can install and load the igraph library and use graph.data.frame() to make a graph object. Finally, we can plot the network with the trees in one color and the piles in another color.

library(igraph)

treeedgelist <- treepilesort %>%
    select("Tree_Name", "Pile_Name")
treegraph <- graph.data.frame(treeedgelist, directed = F)

V(treegraph)$Color <- ifelse(V(treegraph)$name %in% treeedgelist$Tree_Name, "lightpink",
    "goldenrod")

plot(treegraph, vertex.label.family = "Helvetica", vertex.label.cex = 0.7, vertex.color = V(treegraph)$Color,
    vertex.label.color = "navy", edge.arrow.size = 0, vertex.frame.color = "bisque",
    edge.color = adjustcolor("grey"), vertex.size = 7)

We can make the graph look a bit nicer by simplifying the edges and adding in weights.

# adding edge weights and simplifying graph:
# https://stackoverflow.com/questions/48698126/adding-edge-weights-from-dataframe-with-igraph-in-r
E(treegraph)$weight <- 1
treegraph2 <- simplify(treegraph, edge.attr.comb = "sum")
E(treegraph2)$weight
  [1] 5 5 5 1 1 1 1 1 5 5 5 5 1 1 1 1 1 5 1 1 1 1 1 5 1 1 1 1 1 5 1 1 1 1 1 2 1
 [38] 1 1 3 1 1 2 3 1 1 1 1 2 3 1 1 1 1 2 3 1 1 1 1 2 3 1 1 1 1 2 1 1 1 2 1 1 1
 [75] 2 1 1 1 2 1 1 1 6 1 1 1 6 1 1 1 1 6 1 1 1 4 1 1 2 1 1 2 1 1 1 2 1 1 1 2 1
[112] 1 3 1 3 1 1 3 1 3 1 1 1 2 5 1 1 1 2 5 1 1 1 2 5 1 1 4 4 4 4 1 3 1 1 1 3 2
[149] 1 1 1 3 2 1 1 1 3 1 1 1 3 1 1 1 3 2 2 1 1 1 3 1 1 1 3 1 1 1 2 1 1 1 2 1 1
[186] 2 1 6 1 1 1 3 1 3 1

plot(treegraph2, vertex.label.family = "Helvetica", vertex.label.cex = 0.7, vertex.color = V(treegraph2)$Color,
    vertex.label.color = "navy", edge.arrow.size = 0, vertex.frame.color = "bisque",
    edge.color = "lightgrey", vertex.size = 7, edge.width = E(treegraph2)$weight *
        1.5)

Bipartite network

Looking good, but there’s an issue. igraph is treating each node as the same type, when in fact this is a bipartite network. There are great examples in this tutorial about how to work with bipartite data.

# Select columns of interest
treeedgelist <- treepilesort %>%
    select("Tree_Name", "Pile_Name")
# first make sure that the piles are labeled as pile type nodes
treeedgelist <- treeedgelist %>%
    mutate(Pile_Name = paste(treeedgelist$Pile_Name, "pile"))
treegraph <- graph.data.frame(treeedgelist, directed = F)
# now we can use igraph's built in bipartite detector to assign type attributes
# to the network
bipartite.mapping(treegraph)
$res
[1] TRUE

$type
               maple                  oak                apple 
               FALSE                FALSE                FALSE 
              orange                 pear                 palm 
               FALSE                FALSE                FALSE 
               beech                aspen                birch 
               FALSE                FALSE                FALSE 
               alder            white oak              red oak 
               FALSE                FALSE                FALSE 
            live oak           laurel oak         black walnut 
               FALSE                FALSE                FALSE 
              walnut             chestnut       horse chestnut 
               FALSE                FALSE                FALSE 
         sugar maple         silver maple       japanese maple 
               FALSE                FALSE                FALSE 
              banana                mango              redwood 
               FALSE                FALSE                FALSE 
             sequioa         dawn redwood                 pine 
               FALSE                FALSE                FALSE 
               cedar               spruce        longleaf pine 
               FALSE                FALSE                FALSE 
          white pine          douglas fir       ponderosa pine 
               FALSE                FALSE                FALSE 
      lodgepole pine               gingko             sycamore 
               FALSE                FALSE                FALSE 
               pecan              coconut         nonfood pile 
               FALSE                FALSE                 TRUE 
          fruit pile             nut pile       deciduous pile 
                TRUE                 TRUE                 TRUE 
        monocot pile         conifer pile                 pile 
                TRUE                 TRUE                 TRUE 
          maple pile             oak pile        tropical pile 
                TRUE                 TRUE                 TRUE 
     ovate leaf pile        redwoods pile            pine pile 
                TRUE                 TRUE                 TRUE 
  other conifer pile          gingko pile        sycamore pile 
                TRUE                 TRUE                 TRUE 
    street tree pile        rosaceae pile          citrus pile 
                TRUE                 TRUE                 TRUE 
           palm pile           beech pile           aspen pile 
                TRUE                 TRUE                 TRUE 
          birch pile           alder pile          walnut pile 
                TRUE                 TRUE                 TRUE 
       chestnut pile          banana pile           mango pile 
                TRUE                 TRUE                 TRUE 
          cedar pile          spruce pile             fir pile 
                TRUE                 TRUE                 TRUE 
          pecan pile          edible pile       nonedible pile 
                TRUE                 TRUE                 TRUE 
temperate fruit pile other deciduous pile  tropical fruit pile 
                TRUE                 TRUE                 TRUE 
      temperate pile 
                TRUE 
V(treegraph)$type <- bipartite.mapping(treegraph)$type
treegraph
IGRAPH 46234d2 UN-B 76 380 -- 
+ attr: name (v/c), type (v/l)
+ edges from 46234d2 (vertex names):
 [1] maple         --nonfood pile oak           --nonfood pile
 [3] apple         --fruit pile   orange        --fruit pile  
 [5] pear          --fruit pile   palm          --nonfood pile
 [7] beech         --nonfood pile aspen         --nonfood pile
 [9] birch         --nonfood pile alder         --nonfood pile
[11] white oak     --nonfood pile red oak       --nonfood pile
[13] live oak      --nonfood pile laurel oak    --nonfood pile
[15] black walnut  --nut pile     walnut        --nut pile    
+ ... omitted several edges

Now that we have a bipartite network, we can plot the tree pile sort network with different colors and shapes for the nodes according to their type. First, we assign a new vertex attribute called TypeColor using the same ifelse() syntax used previously to assign vertex colors. Then we add in vertex shapes and simplify the graph.

V(treegraph)$TypeColor <- ifelse(V(treegraph)$type == TRUE, "lightpink", "goldenrod")
V(treegraph)$TypeShape <- ifelse(V(treegraph)$type == TRUE, "square", "circle")

# adding edge weights and simplifying graph:
# https://stackoverflow.com/questions/48698126/adding-edge-weights-from-dataframe-with-igraph-in-r
E(treegraph)$weight <- 1
treegraph <- simplify(treegraph, edge.attr.comb = "sum")
E(treegraph)$weight
  [1] 1 1 1 5 1 1 1 1 1 5 1 1 5 1 1 1 1 1 5 1 1 1 1 1 5 1 1 1 1 1 1 1 1 3 3 1 1
 [38] 2 3 1 1 1 1 1 2 3 1 1 1 1 1 2 3 1 1 1 1 1 2 3 1 1 1 1 1 2 5 1 1 1 2 5 1 1
 [75] 1 2 5 1 1 1 2 5 1 1 6 1 1 1 1 6 1 1 1 1 6 1 1 1 1 4 1 1 1 2 1 1 2 5 1 1 1
[112] 2 5 1 1 1 2 5 1 1 3 1 3 1 1 1 3 1 3 1 1 1 1 2 5 1 1 1 2 5 1 1 1 2 5 1 1 1
[149] 3 4 1 1 1 3 2 2 1 1 1 3 2 2 1 1 1 3 4 1 1 1 3 4 1 1 1 3 2 2 1 1 1 3 4 1 1
[186] 1 3 4 1 1 1 2 5 1 1 1 2 2 1 1 2 1 6 1 1 1 1 3 1 3 2 1

plot(treegraph, vertex.label.family = "Helvetica", vertex.label.cex = 0.7, vertex.color = V(treegraph)$TypeColor,
    vertex.label.color = "navy", vertex.shape = V(treegraph)$TypeShape, edge.arrow.size = 0,
    vertex.frame.color = "bisque", edge.color = "lightgrey", vertex.size = 7, edge.width = E(treegraph)$weight *
        2)

Chord diagram of pile sort data

Here we make an interactive chord diagram showing the links between the pile sort groups and the individual tree species.

More on chord diagrams at the R graph gallery.

# devtools::install_github('mattflor/chorddiag')
library(chorddiag)

# create a dummy variable
treepilesort$count <- rep(1, nrow(treepilesort))

# subset just the columns we want to use
treepilesort2 <- treepilesort[, c("Pile_Name", "Tree_Name", "count")]

# pivot data into wide format
treepilesort2 <- treepilesort2 %>%
    pivot_wider(names_from = Tree_Name, values_from = count, values_fn = sum)

# remove first column
treepilesort3 <- treepilesort2 %>%
    select(-Pile_Name)

# set rownamees to Pile_Name
row.names(treepilesort3) <- treepilesort2$Pile_Name

# convert to matrix
treepilesort.mat <- as.matrix(treepilesort3)

# make a palette
library(RColorBrewer)
greenspal <- brewer.pal(9, "Greens")[2:9]
purplespal <- brewer.pal(9, "Purples")[2:9]
purples2 <- colorRampPalette(purplespal)(length(unique(row.names(treepilesort.mat))))
greens2 <- colorRampPalette(greenspal)(length(unique(colnames(treepilesort.mat))))
allpal <- c(purples2, greens2)


# now we can create a chord diagram linking func.grp to decades by count.
chord <- chorddiag(treepilesort.mat, type = "bipartite", showTicks = F, groupnameFontsize = 10,
    groupnamePadding = 10, margin = 150, groupColors = allpal)
chord

To export your interactive graphic, we can use the htmlwidgets package.

library(htmlwidgets)
# htmlwidgets::saveWidget(chord, 'treechord.html') see where you graphic output
# to getwd()