This week we learn about how to work with text data in R. We will learn how to turn documents into word lists, analyze frequency counts, extract bigrams, analyze sentiment and parts of speech, and how to visualize text analyses. We are following the procedures outlined in Silge and Robinson’s Text Mining with R: A Tidy Approach.

Analyzing On the Origin of Species

Project Gutenberg is an awesome source for text data. For this tutorial we will analyze Darwin’s On the Origin of Species. The text can be downloaded from Project Gutenberg, or for simplicity, we can download the text directly using the gutenbergr package.

First, load all of the libraries we will be using today.

# install.packages('gutenbergr')
library(gutenbergr)
library(tidyverse)
library(wordcloud)
library(wordcloud2)
library(tidytext)
library(stringr)
library(topicmodels)
library(data.table)
library(textdata)
library(cleanNLP)
cnlp_init_udpipe()

Download Darwin’s text and examine the structure of this dataset. How are the data organized?

darwin <- gutenberg_download(1228)
str(darwin)
tibble [16,188 × 2] (S3: tbl_df/tbl/data.frame)
 $ gutenberg_id: int [1:16188] 1228 1228 1228 1228 1228 1228 1228 1228 1228 1228 ...
 $ text        : chr [1:16188] "Click on any of the filenumbers below to quickly view each ebook." "" "1228    1859, First Edition" "22764   1860, Second Edition" ...

Analyzing individual word frequencies

Word frequencies can tell us more about the common topics appearing in a text as well as the type of language used. When working with texts from multiple sources or with multiple sections, you can also compare word frequency across these different sources.

Try it

Using the unnest_tokens() function, extract out the individual words from Darwin and create a table sorting the top 50 words by count. Hint: Try the count() function.

What do you notice about the top words? Why do you think these words appear at the top of the list?

Click for solution

## make into individual words
allwords <- darwin %>%
    unnest_tokens(output = word, input = text)
# notice that this also converts the text to lowercase and removes punctuation

# look at the top 50 words in the document
allwords %>%
    count(word, sort = T) %>%
    top_n(50)
# A tibble: 50 × 2
   word      n
   <chr> <int>
 1 the   10294
 2 of     7856
 3 and    4440
 4 in     4015
 5 to     3606
 6 a      2469
 7 that   2083
 8 have   1763
 9 be     1656
10 as     1591
# ℹ 40 more rows

Stop words

Many of the top words are what are called stop words, or those that add little information to our analysis. These include words like is, the and so, that do not improve our understanding of the overall topics or themes in a text. Tidytext has a built in dictionary of stop words, making it easy to quickly remove these words from the text.

# look at the words in the stop_words dataset
data(stop_words)
stop_words %>%
    top_n(50)
# A tibble: 174 × 2
   word      lexicon 
   <chr>     <chr>   
 1 i         snowball
 2 me        snowball
 3 my        snowball
 4 myself    snowball
 5 we        snowball
 6 our       snowball
 7 ours      snowball
 8 ourselves snowball
 9 you       snowball
10 your      snowball
# ℹ 164 more rows
# remove stop words from the text
darwin_tidy <- allwords %>%
    anti_join(stop_words)
# look at the structure
str(darwin_tidy)
tibble [57,877 × 2] (S3: tbl_df/tbl/data.frame)
 $ gutenberg_id: int [1:57877] 1228 1228 1228 1228 1228 1228 1228 1228 1228 1228 ...
 $ word        : chr [1:57877] "click" "filenumbers" "quickly" "view" ...

Now we can look at the number of unique words and their counts in Darwin, without interference from stop words.

# how many unique words are there?
length(unique(darwin_tidy$word))
[1] 6910

## look at top 50 words
darwin_tidy %>%
    count(word, sort = TRUE) %>%
    top_n(50) %>%
    mutate(word = reorder(word, n)) %>%
    data.frame()
           word    n
1       species 1540
2     varieties  434
3     selection  412
4         forms  401
5       natural  384
6        plants  335
7          life  306
8       animals  297
9        nature  261
10     distinct  254
11       period  244
12    structure  224
13   conditions  222
14       genera  220
15       common  197
16         time  194
17       degree  191
18   characters  183
19         view  169
20        birds  167
21     instance  165
22 modification  165
23 intermediate  160
24      closely  158
25     modified  153
26        genus  149
27      islands  144
28  individuals  143
29  inhabitants  143
30     produced  143
31      organic  142
32     domestic  141
33       parent  141
34    character  139
35        world  138
36       theory  137
37       organs  136
38  differences  135
39       breeds  133
40      hybrids  133
41       allied  131
42       manner  127
43        found  126
44       habits  125
45       slight  120
46        seeds  117
47         land  116
48        water  115
49  productions  114
50       amount  113
51       widely  113

Try it

Make a plot of the top 50 words in Darwin. What is the best way to represent these data? What do you notice about the top words? Why do you think these words appear at the top of the list?

Click for solution

darwin_tidy %>%
    count(word, sort = TRUE) %>%
    top_n(50) %>%
    mutate(word = reorder(word, n)) %>%
    ggplot(aes(x = word, y = n)) + geom_col() + xlab(NULL) + coord_flip() + labs(y = "Count",
    x = "Unique words", title = "Top Words in On the Origin of Species")

Wordclouds

Wordclouds can quickly illustrate common themes in text. Here we use the wordcloud and wordcloud2 packages to make wordclouds.

darwin_tidy %>%
    count(word) %>%
    with(wordcloud(word, n, max.words = 100))

darwin_wordcounts <- darwin_tidy %>%
    count(word, sort = T)
wordcloud2(data = darwin_wordcounts)

Analyzing pairs of words

We can also analyze pairs of words (bigrams). This can be useful for understanding the context around particular words as well as for identifying themes that are made up of multiple strings (e.g. “climate change”, “public health”).

darwin_bigrams <- darwin %>%
    unnest_tokens(output = bigrams, input = text, token = "ngrams", n = 2)
str(darwin_bigrams)
tibble [143,751 × 2] (S3: tbl_df/tbl/data.frame)
 $ gutenberg_id: int [1:143751] 1228 1228 1228 1228 1228 1228 1228 1228 1228 1228 ...
 $ bigrams     : chr [1:143751] "click on" "on any" "any of" "of the" ...
## look at counts for each pair
darwin_bigrams %>%
    count(bigrams, sort = TRUE) %>%
    top_n(50)
# A tibble: 50 × 2
   bigrams       n
   <chr>     <int>
 1 of the     1937
 2 <NA>       1663
 3 in the     1042
 4 the same    700
 5 on the      627
 6 to the      552
 7 that the    424
 8 have been   398
 9 it is       303
10 in a        266
# ℹ 40 more rows

Again, we have a challenge with stopwords here. One way to remove these stopwords is to take out any bigram with a stopword in it. Here is another example

# separate words to pull out stop words
separated_words <- darwin_bigrams %>%
    separate(bigrams, c("word1", "word2"), sep = " ")
# filter out stop words
darwin_bigrams <- separated_words %>%
    filter(!word1 %in% stop_words$word) %>%
    filter(!word2 %in% stop_words$word)

Try it

Make a table of the top 100 bigrams sorted from most to least frequent. Hint: You can apply count() to multiple columns.
Pull out all bigrams where “species” is the second term and make a table of the most common bigrams in this subset.
Pull out all bigrams where “water” is either the first or second term and make a table of the most common bigrams in this subset.
What does this analysis tell you about this text? Can you think of any data in your own research that would benefit from ngram analysis?

Click for solution

darwin_bigrams_count <- darwin_bigrams %>%
    count(word1, word2, sort = TRUE)

# top 100 pairs of words
head(darwin_bigrams_count, 100) %>%
    data.frame()
             word1         word2    n
1             <NA>          <NA> 1663
2          natural     selection  250
3         distinct       species   65
4          closely        allied   59
5            fresh         water   45
6           allied       species   37
7          glacial        period   36
8            south       america   35
9         physical    conditions   34
10        domestic       animals   28
11          common        parent   27
12      geological        record   27
13        modified   descendants   27
14     rudimentary        organs   27
15    intermediate     varieties   26
16            rock        pigeon   26
17          parent       species   25
18           north       america   24
19         oceanic       islands   24
20            hive           bee   23
21    geographical  distribution   22
22         natural       history   21
23       secondary        sexual   21
24         closely       related   19
25          allied         forms   18
26            land        shells   18
27          larger        genera   18
28          sexual     selection   18
29      successive   generations   18
30        domestic        breeds   17
31   independently       created   17
32              dr        hooker   16
33      individual   differences   16
34    reproductive        system   16
35         species     belonging   16
36          common    progenitor   15
37        domestic     varieties   15
38         natural        system   15
39              de      candolle   14
40        doubtful         forms   14
41         extinct       species   14
42       incipient       species   14
43          direct        action   13
44        dominant       species   13
45          marked     varieties   13
46      reciprocal       crosses   13
47  representative       species   13
48          sexual    characters   13
49            alph            de   12
50     consecutive    formations   12
51        distinct        genera   12
52        domestic   productions   12
53        external    conditions   12
54      geological        period   12
55              la         plata   12
56           malay   archipelago   12
57      occasional         means   12
58          parent          form   12
59   physiological    importance   12
60    reproductive       systems   12
61          slight modifications   12
62        specific         forms   12
63      systematic      affinity   12
64        domestic         races   11
65         extinct         forms   11
66          humble          bees   11
67          lesser        degree   11
68        ordinary          view   11
69          recent        period   11
70           short         faced   11
71          single       species   11
72    specifically      distinct   11
73        strongly        marked   11
74       galapagos   archipelago   10
75      geological    formations   10
76    intermediate         links   10
77        mountain        ranges   10
78       perfectly       fertile   10
79             sea         water   10
80          slight        degree   10
81        southern    hemisphere   10
82     terrestrial       mammals   10
83         complex     relations    9
84          degree  intermediate    9
85        existing       species    9
86      favourable    variations    9
87      geological    succession    9
88         glacial         epoch    9
89    intermediate    gradations    9
90          living       species    9
91         organic       remains    9
92       selection          acts    9
93       separated         sexes    9
94          slight   differences    9
95            slow       process    9
96      successive        slight    9
97            tail      feathers    9
98          tierra           del    9
99           water   productions    9
100         webbed          feet    9

# where 'species' is the second term
speciesbigram <- darwin_bigrams %>%
    filter(word2 == "species")
speciesbigram %>%
    count(word1, word2, sort = TRUE) %>%
    top_n(20)
# A tibble: 20 × 3
   word1          word2       n
   <chr>          <chr>   <int>
 1 distinct       species    65
 2 allied         species    37
 3 parent         species    25
 4 extinct        species    14
 5 incipient      species    14
 6 dominant       species    13
 7 representative species    13
 8 single         species    11
 9 existing       species     9
10 living         species     9
11 peculiar       species     7
12 pure           species     7
13 varying        species     7
14 aboriginal     species     6
15 alpine         species     6
16 doubtful       species     6
17 fossil         species     6
18 original       species     6
19 ranging        species     6
20 true           species     6

## look at words that appear next to the word 'water'
waterbigram <- darwin_bigrams %>%
    filter(word1 == "water" | word2 == "water")
waterbigram %>%
    count(word1, word2, sort = TRUE) %>%
    top_n(20)
# A tibble: 43 × 3
   word1 word2           n
   <chr> <chr>       <int>
 1 fresh water          45
 2 sea   water          10
 3 water productions     9
 4 salt  water           8
 5 water fish            6
 6 water shells          5
 7 water plants          4
 8 water beetle          3
 9 water birds           3
10 water lily            3
# ℹ 33 more rows

Sentiment analysis

Texts often contain certain emotions, feelings, or sentiments that can tell us more about what they mean. In a way, coding sentiments is similar to the qualitative research method of coding fieldnotes for themes. Because of this, you can develop your own custom lexicon for your research context.

We’ll work with the NRC Emotion Lexicon. First, we load the NRC lexicon and look at the different types of sentiments it contains.

# load the nrc sentiment dictionary
get_sentiments("nrc")
# A tibble: 13,872 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# ℹ 13,862 more rows
nrcdf <- get_sentiments("nrc")
# take a look at the top sentiments that occur in the lexicon
nrcdf %>%
    count(sentiment, sort = T)
# A tibble: 10 × 2
   sentiment        n
   <chr>        <int>
 1 negative      3316
 2 positive      2308
 3 fear          1474
 4 anger         1245
 5 trust         1230
 6 sadness       1187
 7 disgust       1056
 8 anticipation   837
 9 joy            687
10 surprise       532

Using inner_join() we can combine the sentiments with the words from Darwin, effectively “tagging” each word with a particular sentiment.

# merge sentiments to Darwin data
darwin_sentiment <- darwin_tidy %>%
    inner_join(get_sentiments("nrc"))

Try it

With the new merged and tagged dataframe, make a table of the top words in Darwin that are associated with the sentiment “trust” and one other sentiment of choice. Reflect on how you might interpret these results. Do you find this information useful? Is there any place you could see sentiment analysis being useful in your own research?

Click for solution

# look at the top words associated with trust
darwin_sentiment %>%
    filter(sentiment == "trust") %>%
    count(word, sort = T) %>%
    top_n(20)
# A tibble: 21 × 2
   word            n
   <chr>       <int>
 1 structure     224
 2 theory        137
 3 allied        131
 4 found         126
 5 doubt         108
 6 related        82
 7 perfect        79
 8 inheritance    63
 9 system         62
10 true           60
# ℹ 11 more rows

# pick another sentient and pull out the top 20 words associated witth this
# sentiment.
darwin_sentiment %>%
    filter(sentiment == "surprise") %>%
    count(word, sort = T) %>%
    top_n(20)
# A tibble: 21 × 2
   word            n
   <chr>       <int>
 1 variable       72
 2 inheritance    63
 3 larger         59
 4 chance         54
 5 wild           51
 6 remarkable     48
 7 occasional     43
 8 tree           38
 9 expect         31
10 suddenly       31
# ℹ 11 more rows
darwin_sentiment %>%
    filter(sentiment == "positive") %>%
    count(word, sort = T) %>%
    top_n(20)
# A tibble: 20 × 2
   word            n
   <chr>       <int>
 1 structure     224
 2 degree        191
 3 organic       142
 4 allied        131
 5 found         126
 6 land          116
 7 sea           107
 8 importance    103
 9 action         84
10 perfect        79
11 existence      78
12 increase       76
13 reason         73
14 advantage      67
15 inheritance    63
16 marked         62
17 true           60
18 growth         58
19 improved       57
20 fertile        55

Case study: Permafrost and climate change survey

Now that we’ve learned a bit about text analysis, let’s test our skills on a real world dataset. Here we will use data from a survey in two Inupiaq villages in Alaska to examine how individuals in these communities feel about climate change and thawing permafrost. These data are drawn from here: William B. Bowden 2013. Perceptions and implications of thawing permafrost and climate change in two Inupiaq villages of arctic Alaska Link. Let’s further examine the responses to two open ended questions: (Q5) What is causing it [permafrost around X village] to change? and (Q69) “What feelings do you have when thinking about the possibility of future climate change in and around [village name]?”.

First we load the data and subset out the columns of interest.

# we will work with the permafrost survey data.
surv <- read.csv("https://maddiebrown.github.io/ANTH630/data/Survey_AKP-SEL.csv",
    stringsAsFactors = F)
surv_subset <- surv %>%
    select(Village, Survey.Respondent, Age.Group, X69..Feelings, X5..PF.Cause.)

Then we can quickly calculate the most frequent terms across all 80 responses.

class(surv$X69..Feelings)  #make sure your column is a character variable
[1] "character"
surv_tidy <- surv_subset %>%
    unnest_tokens(word, X69..Feelings) %>%
    anti_join(stop_words)
# what are most common words?
feelingswordcount <- surv_tidy %>%
    count(word, sort = T)

Try it

Make wordclouds of the word frequency in responses about feelings related to climate change using two different methods.

Click for solution

surv_tidy %>%
    count(word) %>%
    with(wordcloud(word, n, max.words = 100))


wordcloud2(data = feelingswordcount)

Topic Modeling

Topic modeling allows us to identify themes in text without needing to clearly know which themes or groupings we expect to emerge. This can be very useful when you have large volumes of messy data or data from multiple sources that you need to parse. We will use Latent Dirichlet allocation or (LDA), following the explanation in Text Mining with R.

Before we can identify themes across responses however, we need to make sure each “document” or “response” has a unique identifier. We will make a new primary key called “ID” that has a different value for each unique response.

surv_subset %>%
    select(Village, Survey.Respondent)
   Village Survey.Respondent
1      AKP                 1
2      AKP                 2
3      AKP                 3
4      AKP                 4
5      AKP                 5
6      AKP                 6
7      AKP                 7
8      AKP                 8
9      AKP                 9
10     AKP                10
11     AKP                11
12     AKP                12
13     AKP                13
14     AKP                14
15     AKP                15
16     AKP                16
17     AKP                17
18     AKP                18
19     AKP                19
20     AKP                20
21     AKP                21
22     AKP                22
23     AKP                23
24     AKP                24
25     AKP                25
26     AKP                26
27     AKP                27
28     AKP                28
29     AKP                29
30     AKP                30
31     AKP                31
32     AKP                32
33     AKP                33
34     AKP                34
35     AKP                35
36     AKP                36
37     AKP                37
38     AKP                38
39     AKP                39
40     SEL                 1
41     SEL                 2
42     SEL                 3
43     SEL                 4
44     SEL                 5
45     SEL                 6
46     SEL                 7
47     SEL                 8
48     SEL                 9
49     SEL                10
50     SEL                11
51     SEL                12
52     SEL                13
53     SEL                14
54     SEL                15
55     SEL                16
56     SEL                17
57     SEL                18
58     SEL                19
59     SEL                20
60     SEL                21
61     SEL                22
62     SEL                23
63     SEL                24
64     SEL                25
65     SEL                26
66     SEL                27
67     SEL                28
68     SEL                29
69     SEL                30
70     SEL                31
71     SEL                32
72     SEL                33
73     SEL                34
74     SEL                35
75     SEL                36
76     SEL                37
77     SEL                38
78     SEL                39
79     SEL                40
80     SEL                41
surv_subset <- surv_subset %>%
    mutate(ID = paste(Village, Survey.Respondent, sep = "_"))


### let's make a new surv_tidy object that incoporates the new ID we made

surv_tidy <- surv_subset %>%
    unnest_tokens(word, X69..Feelings) %>%
    anti_join(stop_words)
surv_tidy %>%
    count(word, ID, sort = T) %>%
    top_n(5)
         word     ID n
1        days AKP_26 3
2       april AKP_17 2
3     caribou AKP_32 2
4      change  AKP_5 2
5        cold  AKP_1 2
6      future AKP_18 2
7       march AKP_17 2
8      months AKP_17 2
9  permafrost  AKP_3 2
10    weather AKP_17 2
11    weather SEL_21 2
12      worry AKP_18 2

The first step in creating a topic model is to count the number of times each word appears in each individual document (or response in our case). Luckily, we can count by two variables using the count() function. Let’s create a new byresponse variable.

byresponse <- surv_tidy %>%
    count(ID, word, sort = T) %>%
    ungroup()

# check how many responses are included in the analysis. this allows you to
# double check that the new unique identifier we made worked as expected.
unique(byresponse$ID)
 [1] "AKP_26" "AKP_1"  "AKP_17" "AKP_18" "AKP_3"  "AKP_32" "AKP_5"  "SEL_21"
 [9] "AKP_10" "AKP_11" "AKP_12" "AKP_13" "AKP_14" "AKP_15" "AKP_16" "AKP_19"
[17] "AKP_2"  "AKP_20" "AKP_22" "AKP_23" "AKP_24" "AKP_25" "AKP_27" "AKP_28"
[25] "AKP_29" "AKP_30" "AKP_31" "AKP_33" "AKP_34" "AKP_35" "AKP_36" "AKP_37"
[33] "AKP_38" "AKP_4"  "AKP_6"  "AKP_7"  "AKP_8"  "AKP_9"  "SEL_1"  "SEL_10"
[41] "SEL_11" "SEL_13" "SEL_14" "SEL_15" "SEL_17" "SEL_18" "SEL_19" "SEL_2" 
[49] "SEL_20" "SEL_22" "SEL_24" "SEL_25" "SEL_26" "SEL_27" "SEL_29" "SEL_3" 
[57] "SEL_30" "SEL_31" "SEL_32" "SEL_33" "SEL_34" "SEL_35" "SEL_36" "SEL_37"
[65] "SEL_38" "SEL_39" "SEL_4"  "SEL_40" "SEL_5"  "SEL_6"  "SEL_7"  "SEL_8" 
[73] "SEL_9" 
length(unique(byresponse$ID))
[1] 73

Now we can convert our longform word list into a document-term matrix. Read more here

surv_dtm <- byresponse %>%
    cast_dtm(ID, word, n)
# ?cast_dtm #read up on how this function works

Run the LDA() function and choose a number of solutions. in this case, let’s try it with 2

surv_lda <- LDA(surv_dtm, k = 2, control = list(seed = 9999))
# look at our output str(surv_lda)

# examine the probability that each word is in a particular topic group
surv_topics <- tidy(surv_lda, matrix = "beta")
head(surv_topics)
# A tibble: 6 × 3
  topic term     beta
  <int> <chr>   <dbl>
1     1 days  0.00885
2     2 days  0.00864
3     1 cold  0.00540
4     2 cold  0.0296 
5     1 april 0.0102 
6     2 april 0.00140

Examine the top words for each topic identified by the model.

top_words <- surv_topics %>%
    group_by(topic) %>%
    top_n(10, beta) %>%
    ungroup() %>%
    arrange(topic, desc(beta))
top_words
# A tibble: 20 × 3
   topic term      beta
   <int> <chr>    <dbl>
 1     1 climate 0.0446
 2     1 change  0.0363
 3     1 animals 0.0324
 4     1 sad     0.0258
 5     1 adapt   0.0204
 6     1 caribou 0.0196
 7     1 scary   0.0180
 8     1 blank   0.0173
 9     1 river   0.0159
10     1 future  0.0145
11     2 cold    0.0296
12     2 weather 0.0276
13     2 worried 0.0248
14     2 nc      0.0224
15     2 change  0.0220
16     2 future  0.0205
17     2 dk      0.0181
18     2 people  0.0175
19     2 land    0.0172
20     2 scary   0.0170

We can also examine the results graphically.

# plot these top words for each topic (adapted from
# https://www.tidytextmining.com/topicmodeling.html)
top_words %>%
    group_by(topic) %>%
    mutate(term = fct_reorder(term, beta)) %>%
    ungroup() %>%
    ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) +
    facet_wrap(~topic, scales = "free") + theme_minimal()

Try it

Repeat the topic modeling analysis but using 6 topics instead of two.

Click for solution

## repeat the analysis but with 6 topics
surv_lda6 <- LDA(surv_dtm, k = 6, control = list(seed = 9999))
# examine the probability that each word is in a particular topic group
surv_topics6 <- tidy(surv_lda6, matrix = "beta")
surv_topics6
# A tibble: 1,260 × 3
   topic term       beta
   <int> <chr>     <dbl>
 1     1 days  4.88e-  2
 2     2 days  9.19e-262
 3     3 days  1.69e-269
 4     4 days  1.17e-270
 5     5 days  5.77e-267
 6     6 days  4.77e-268
 7     1 cold  1.63e-  2
 8     2 cold  5.81e-  2
 9     3 cold  1.68e-  2
10     4 cold  4.86e-235
# ℹ 1,250 more rows
# examine top words for each topic
top_words6 <- surv_topics6 %>%
    group_by(topic) %>%
    top_n(5, beta) %>%
    ungroup() %>%
    arrange(topic, desc(beta))
# plot these top words for each topic (adapted from
# https://www.tidytextmining.com/topicmodeling.html)
top_words6 %>%
    group_by(topic) %>%
    mutate(term = fct_reorder(term, beta)) %>%
    ungroup() %>%
    ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) +
    facet_wrap(~topic, scales = "free_y") + theme_minimal()

In this case, our sample is small, so topic modeling is not necessarily the best method to use. However, even from this small sample, you can see that some topics emerge from the text that were not previously apparent.

Manual text wrangling

Sometimes you’ll need to edit text or strings manually. For example, you may find that for your research question, you are less interested in differentiating between the terms running, run, and runner, than in identifying clusters of beliefs about running as a more general concept. On the other hand, you might want to differentiate between runners and running as beliefs about groups of people vs. the act of running. How you choose to transform text data in your research is up to your research questions and understanding of the cultural context.

R has a number of helpful functions for manually adjusting strings. We’ll cover a few to get you started. Let’s go back to the permafrost and climate change survey and look at responses to: (Q5) What is causing it [permafrost around X village] to change?.

First let’s look at the raw data. What are some potential issues in the strings below that might make text analysis difficult or ineffective?

surv$X5..PF.Cause.[10:30]
 [1] "climate change"                                                                                                                                  
 [2] "too warm"                                                                                                                                        
 [3] "warmer weather, shorter winters, lack of snow, fast springs. [will affect AKP because we] use Argos to go out."                                  
 [4] "weather warming"                                                                                                                                 
 [5] "temperature"                                                                                                                                     
 [6] "heat. A lot of heat."                                                                                                                            
 [7] "melting of the ground - goes down"                                                                                                               
 [8] "spirited answer. Her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
 [9] "probably global warming"                                                                                                                         
[10] "temperature outside is not steady"                                                                                                               
[11] "(blank)"                                                                                                                                         
[12] "N/A"                                                                                                                                             
[13] "most likely war weather or global warming"                                                                                                       
[14] "not much winter - hardly get snow. Always wind blown. Summer be rain, rain, rain. Late snow."                                                    
[15] "warmer winters"                                                                                                                                  
[16] "global warming"                                                                                                                                  
[17] "warming weather, longer summer/fall season"                                                                                                      
[18] "I have no idea"                                                                                                                                  
[19] "warmer climate"                                                                                                                                  
[20] "Seems like there's lots of rain & water causes ground to thaw. So maybe accumulated water? Maybe warm weather?"                                  
[21] "the heat wave - winter frosts"

Luckily we can manually adjust the strings to make them easier to analyze systematically. For example we might set characters to lowercase, trim white space and remove any empty or missing rows.

# make a new column to hold the tidy data
surv$cause_tidy <- surv$X5..PF.Cause.
# make lower case
surv$cause_tidy <- tolower(surv$cause_tidy)
# remove white space at beginning and end of string
surv$cause_tidy <- trimws(surv$cause_tidy)
# filter out blank or empty rows
surv <- surv %>%
    filter(surv$cause_tidy != "")
surv <- surv %>%
    filter(surv$cause_tidy != "(blank)")
surv <- surv %>%
    filter(surv$cause_tidy != "n/a")

We can also directly replace particular strings. Here we change some strings with typos.

surv$cause_tidy <- surv$cause_tidy %>%
    str_replace("wamer", "warmer")
surv$cause_tidy <- surv$cause_tidy %>%
    str_replace("lnoger", "longer")

Another common string data transformation involves grouping together responses into more standardized categories. You can transform values individually or based on exact string matches. In addition, using %like% we can transform any strings where just part of the string contains a particular string. For example, we might decide that any time the string “warm” appears in a response, the overall theme of the response is associated with “global warming”. Or based on our ethnographic understanding of the context we might know that “seasonal changes” are important causes of permafrost in local cultural models. We can then look for some key terms that will allow us to rapidly change multiple responses that are likely to fit in this category. In this case, “late” and “early”.

# group some responses together based on the presence of a particular string
surv <- surv %>%
    mutate(cause_tidy = replace(cause_tidy, cause_tidy %like% "warm", "global warming"))
surv$cause_tidy[1:30]
 [1] "environment"                                                                                                                                     
 [2] "exhaust"                                                                                                                                         
 [3] "global warming"                                                                                                                                  
 [4] "global warming"                                                                                                                                  
 [5] "hot summers, early springs. in super cold winters the ground comes up & cracks and water comes out."                                             
 [6] "global warming"                                                                                                                                  
 [7] "climate change"                                                                                                                                  
 [8] "freezing & thawing in fall & spring"                                                                                                             
 [9] "global warming"                                                                                                                                  
[10] "climate change"                                                                                                                                  
[11] "global warming"                                                                                                                                  
[12] "global warming"                                                                                                                                  
[13] "global warming"                                                                                                                                  
[14] "temperature"                                                                                                                                     
[15] "heat. a lot of heat."                                                                                                                            
[16] "melting of the ground - goes down"                                                                                                               
[17] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[18] "global warming"                                                                                                                                  
[19] "temperature outside is not steady"                                                                                                               
[20] "global warming"                                                                                                                                  
[21] "not much winter - hardly get snow. always wind blown. summer be rain, rain, rain. late snow."                                                    
[22] "global warming"                                                                                                                                  
[23] "global warming"                                                                                                                                  
[24] "global warming"                                                                                                                                  
[25] "i have no idea"                                                                                                                                  
[26] "global warming"                                                                                                                                  
[27] "global warming"                                                                                                                                  
[28] "the heat wave - winter frosts"                                                                                                                   
[29] "global warming"                                                                                                                                  
[30] "global warming"                                                                                                                                  
surv <- surv %>%
    mutate(cause_tidy = replace(cause_tidy, cause_tidy %like% "early" | cause_tidy %like%
        "late", "seasonal changes"))
surv$cause_tidy[1:30]
 [1] "environment"                                                                                                                                     
 [2] "exhaust"                                                                                                                                         
 [3] "global warming"                                                                                                                                  
 [4] "global warming"                                                                                                                                  
 [5] "seasonal changes"                                                                                                                                
 [6] "global warming"                                                                                                                                  
 [7] "climate change"                                                                                                                                  
 [8] "freezing & thawing in fall & spring"                                                                                                             
 [9] "global warming"                                                                                                                                  
[10] "climate change"                                                                                                                                  
[11] "global warming"                                                                                                                                  
[12] "global warming"                                                                                                                                  
[13] "global warming"                                                                                                                                  
[14] "temperature"                                                                                                                                     
[15] "heat. a lot of heat."                                                                                                                            
[16] "melting of the ground - goes down"                                                                                                               
[17] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[18] "global warming"                                                                                                                                  
[19] "temperature outside is not steady"                                                                                                               
[20] "global warming"                                                                                                                                  
[21] "seasonal changes"                                                                                                                                
[22] "global warming"                                                                                                                                  
[23] "global warming"                                                                                                                                  
[24] "global warming"                                                                                                                                  
[25] "i have no idea"                                                                                                                                  
[26] "global warming"                                                                                                                                  
[27] "global warming"                                                                                                                                  
[28] "the heat wave - winter frosts"                                                                                                                   
[29] "global warming"                                                                                                                                  
[30] "global warming"                                                                                                                                  

# compare the original with your categorizations surv %>%
# select(X5..PF.Cause.,cause_tidy)

We won’t get into too much detail today, but you can also search and select string data using regular expressions. You can read more in R4DS. Here let’s use str_detect() to pull out some strings with regular expressions.


# any responses ending in 'ing'
surv$cause_tidy[str_detect(surv$cause_tidy, "ing$")]
 [1] "global warming"                      "global warming"                     
 [3] "global warming"                      "freezing & thawing in fall & spring"
 [5] "global warming"                      "global warming"                     
 [7] "global warming"                      "global warming"                     
 [9] "global warming"                      "global warming"                     
[11] "global warming"                      "global warming"                     
[13] "global warming"                      "global warming"                     
[15] "global warming"                      "global warming"                     
[17] "global warming"                      "global warming"                     
[19] "global warming"                      "global warming"                     
[21] "global warming"                      "global warming"                     
[23] "global warming"                      "global warming"                     
[25] "global warming"                      "global warming"                     
[27] "global warming"                      "global warming"                     
[29] "global warming"                      "global warming"                     
[31] "global warming"                      "global warming"                     
[33] "global warming"                      "global warming"                     
[35] "global warming"                     

# any reponses that contain a W followed by either an 'e' or an 'a'
surv$cause_tidy[str_detect(surv$cause_tidy, "w[ea]")]
 [1] "global warming"                                                                                                                                  
 [2] "global warming"                                                                                                                                  
 [3] "global warming"                                                                                                                                  
 [4] "global warming"                                                                                                                                  
 [5] "global warming"                                                                                                                                  
 [6] "global warming"                                                                                                                                  
 [7] "global warming"                                                                                                                                  
 [8] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
 [9] "global warming"                                                                                                                                  
[10] "global warming"                                                                                                                                  
[11] "global warming"                                                                                                                                  
[12] "global warming"                                                                                                                                  
[13] "global warming"                                                                                                                                  
[14] "global warming"                                                                                                                                  
[15] "global warming"                                                                                                                                  
[16] "the heat wave - winter frosts"                                                                                                                   
[17] "global warming"                                                                                                                                  
[18] "global warming"                                                                                                                                  
[19] "weather"                                                                                                                                         
[20] "global warming"                                                                                                                                  
[21] "global warming"                                                                                                                                  
[22] "global warming"                                                                                                                                  
[23] "global warming"                                                                                                                                  
[24] "global warming"                                                                                                                                  
[25] "weather."                                                                                                                                        
[26] "global warming"                                                                                                                                  
[27] "global warming"                                                                                                                                  
[28] "global warming"                                                                                                                                  
[29] "global warming"                                                                                                                                  
[30] "global warming"                                                                                                                                  
[31] "global warming"                                                                                                                                  
[32] "global warming"                                                                                                                                  
[33] "global warming"                                                                                                                                  
[34] "global warming"                                                                                                                                  
[35] "global warming"                                                                                                                                  
[36] "global warming"                                                                                                                                  
[37] "global warming"                                                                                                                                  
[38] "global warming"                                                                                                                                  

# any responses that contain the string erosion
surv$cause_tidy[str_detect(surv$cause_tidy, "erosion")]
[1] "erosion, and real hot summers and a lot of snow & rain."       
[2] "mud goes down river, cracking all along & falling in - erosion"
[3] "ground erosion"                                                
[4] "erosion"                                                       

# any responses that contain the string erosion, but which have any character
# occurring before the word erosion.
surv$cause_tidy[str_detect(surv$cause_tidy, ".erosion")]
[1] "mud goes down river, cracking all along & falling in - erosion"
[2] "ground erosion"

The utility of regular expressions is huge for quickly searching through and transforming large volumes of string data. We’ve only scratched the surface today.

Whenever transforming large volumes of data using string detection and regular expressions it is critical to double check that each operation is in fact working as you expected it to. Paying attention to the order of transformations is also important for preventing you from overwriting previous data transformations.

Creating new variables with `str_detect()`

Sometimes it is useful to create flags or indicator variables in your data. These can allow you to quickly filter out rows that have particular characteristics. For example, we can create a new binary column that indicates whether or not the response refers to global warming. This variable can then be used for further grouping, data visualization or other tasks.

surv <- surv %>%
    mutate(GlobalWarmingYN = str_detect(cause_tidy, "global warming"))
table(surv$GlobalWarmingYN)  # how many responses contain the string global warming?

FALSE  TRUE 
   37    34

Linking text analysis and cultural domain analysis

Now that we’ve learned a bit about wrangling text data in R, let’s combine these skills with what we learned previously about cultural domain analysis.

For this example, we will use Q8: “What do you think are the 3 biggest issues that will result from thawing permafrost in this area?” from the Permafrost dataset.

The first thing to consider is how to transform the dataset into the right format for conducting freelist analysis. These responses are already ordered from 1-3, but they are organized into three different columns instead of one. We can pivot the data from wide to long format and then add in a new column that accounts for the rank of the responses.

# first create a unique identifier
surv <- surv %>%
    mutate(ID = paste(Village, Survey.Respondent, sep = "_"))

# select out only the columns related to the permafrost question
pfissues <- surv %>%
    select(ID, Village, X8.1..PF.Issue, X8.2..PF.Issue, X8.3..PF.Issue)

# pivot the data from wide to long format
pfissues_long <- pivot_longer(pfissues, c(X8.1..PF.Issue, X8.2..PF.Issue, X8.3..PF.Issue),
    names_to = "rank", values_to = "issue")

# add in a new column for the rank of each response.
pfissues_long <- pfissues_long %>%
    mutate(rank = case_when(rank == "X8.1..PF.Issue" ~ "1", rank == "X8.2..PF.Issue" ~
        "2", TRUE ~ "3"))

Try it

Now that our data are in a long, ranked format, we are ready to analyze the responses.

Make a table of the number of responses for each issue resulting from permafrost thaw. Why is this difficult to interpret?
List all the unique issues mentioned. Can any of these be meaningfully combined?

Click for solution

# table(pfissues_long$issue)

# look at all the unique issues mentioned unique(pfissues_long$issue)

Tidying up open ended responses

We can clean up the text responses in several ways. First, we can remove missing values, convert text to lowercase and remove extra whitespace. Next, we will begin interpreting which responses can be meaningfully grouped together.

The code below begins the text cleaning process by removing missing values.

# set missing values to NA
pfissues_long$issue[pfissues_long$issue == ""] <- NA
pfissues_long$issue[pfissues_long$issue == "(blank)"] <- NA
pfissues_long$issue[pfissues_long$issue == "DK"] <- NA
# filter out rows with missing values
pfissues_long <- pfissues_long %>%
    filter(!is.na(issue))
# make a new column to hold the tidy data
pfissues_long$issue_tidy <- pfissues_long$issue

Try it

Let’s perform several more text cleaning operations. If you are unfamiliar with the functions used, take a look at their help files to figure out their syntax.

Use tolower() to convert responses to lowercase.
Use trimws() to remove extra white space in responses.
Use count() and arrange() to view the counts of responses.

Click for solution

# make lower case
pfissues_long$issue_tidy <- tolower(pfissues_long$issue_tidy)
# remove white space at beginning and end of string
pfissues_long$issue_tidy <- trimws(pfissues_long$issue_tidy)

# look at unique values and their counts
pfissues_long %>%
    count(issue_tidy) %>%
    arrange(desc(n))
# A tibble: 164 × 2
   issue_tidy                                                                  n
   <chr>                                                                   <int>
 1 erosion                                                                     7
 2 flooding                                                                    7
 3 water levels rising                                                         3
 4 berry picking                                                               2
 5 housing                                                                     2
 6 rivers will change course                                                   2
 7 water                                                                       2
 8 [that's all]                                                                1
 9 a lot of land slides and erosion - especially with rain. putuniq used …     1
10 affect caribou                                                              1
# ℹ 154 more rows

Now the text of the permafrost issue responses is cleaned but we still have many different types of responses that are getting at some of the same issues. Some of this is due to parts of speech, but much of this is because of the open ended nature of the responses. This is where the researcher can start to group some of the responses together using their knowledge of the research context.

For example, the code below groups all responses with the word “caribou” into a single word response. Below that is a more complicated version converting many different strings into the single “subsistence” category.

pfissues_long <- pfissues_long %>%
    mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "caribou", "caribou"))

pfissues_long <- pfissues_long %>%
    mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "hunt" | issue_tidy %like%
        "subsistence" | issue_tidy %like% "food" | issue_tidy %like% "berr" | issue_tidy %like%
        "fishing" | issue_tidy %like% "eat" | issue_tidy %like% "hungry", "subsistence"))

Try it

Take a look at the permafrost issue responses and consider which major categories exist across the responses. Following the syntax outlined above, make 3-4 more categories to group the responses together. Finally, take a look at the results by modifying the following code: pfissues_long %>% count(issue_tidy) %>% arrange(desc(n)) %>% filter(n>1).

Click for solution

pfissues_long <- pfissues_long %>%
    mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "erosion" | issue_tidy %like%
        "erode" | issue_tidy %like% "eroding" | issue_tidy %like% "land slid", "erosion"))

pfissues_long <- pfissues_long %>%
    mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "migration" | issue_tidy %like%
        "migrating", "migration"))

pfissues_long <- pfissues_long %>%
    mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "flood" | issue_tidy %like%
        "water" | issue_tidy %like% "level", "water"))

pfissues_long <- pfissues_long %>%
    mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "animal" | issue_tidy %like%
        "wildlife", "animals"))

pfissues_long <- pfissues_long %>%
    mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "weather" | issue_tidy %like%
        "wind" | issue_tidy %like% "rain" | issue_tidy %like% "temperature" | issue_tidy %like%
        "snow" | issue_tidy %like% "wet" | issue_tidy %like% "dry" | issue_tidy %like%
        "hot", "weather and climate"))

# look at counts
pfissues_long %>%
    count(issue_tidy) %>%
    arrange(desc(n)) %>%
    filter(n > 1)
# A tibble: 9 × 2
  issue_tidy                    n
  <chr>                     <int>
1 water                        30
2 subsistence                  22
3 erosion                      16
4 weather and climate          10
5 caribou                       6
6 animals                       4
7 migration                     4
8 housing                       2
9 rivers will change course     2

Now that we’ve cleaned up the data and grouped responses into categories, we can conduct a salience analysis. The code below follows the same format we learned previously to analyze freelists with AnthroTools. Your results will look different from the ones below since you will have created different protocols for combining responses.

pfissues_long.df <- as.data.frame(pfissues_long)
pfissues_long.df$rank <- as.numeric(pfissues_long.df$rank)

library(AnthroTools)
IssuesSalience <- CalculateSalience(pfissues_long.df, Order = "rank", Subj = "ID",
    CODE = "issue_tidy", Salience = "Salience")

# look at results #IssuesSalience

SalienceByIssue <- SalienceByCode(IssuesSalience, Salience = "Salience", Subj = "ID",
    CODE = "issue_tidy", dealWithDoubles = "MAX")
                                                                                                                                                                           CODE
1                                                                                                                                                                         water
2                                                                                                                                                                hard to travel
3                                                                                                                                                                     migration
4                                                                                                                                                                  stress holes
5                                                                                                                                                     more permafrost fractures
6                                                                                                                                                               building issues
7                                                                                                                                                                        travel
8                                                                                                                                                                       housing
9                                                                                                                                            release of methane gas & particles
10                                                                                                                                                        no more stable ground
11                                                                                                                                                    rivers will change course
12                                                                                                                                                               transportation
13                                                                                                                                                          weather and climate
14                                                                                                                                                                  subsistence
15                                                                                                                                                            changing roadways
16                                                                                                                                might have to fix the river if it flows again
17                                                                                                                                                                      caribou
18                                                                                                                               thaws too early in springtime & hard to travel
19                                                                                                                                                                 [that's all]
20                                                                                                                                                                     overflow
21                                                                                                                                                           dangerous glaciers
22                                                                                                                                             our argo trails will be affected
23                                                                                                        sinking land - have to look for different trails that are more solid.
24                                                                                                                                                             methane escaping
25                                                                                                                                                                      animals
26                                                                                                                                                              travel & trails
27                                                                                                                                                              travel - trails
28                                                                                                                                                            hard time walking
29                                                                                                                                                        changing of landscape
30                                                                                                                                                      changing of the seasons
31                                                                                                              tundra might be not as sturdy on top, maybe more soft or swampy
32                                                                                                                                              ground coming up off the ground
33                                                                                                                                                  ground cave-ins, sink-holes
34                                                                                                                                                                    sinkholes
35                                                                                                                                                affect vegetations or surface
36                                                                                                                                                                  methane gas
37                                                                                                                                                      sink holes will be more
38                                                                                                                                                       vegetation will change
39                                                                                                                                                                  ice changes
40                                                                                                                                                             affecting plants
41                                                                                                                                            might change some feeding sources
42                                                                                                                                             river patterns would be changing
43                                                                                                                                               maybe ground would be changing
44                                                                                                                                                                          air
45                                                                                                                                                                     sunlight
46                                                                                                                     our building, homes, runways, bridge always be monitored
47                                                                                                  the more studies and how it will affect our community on thawing permafrost
48                                                                                                                                                                 ground shift
49                                                                                                                                    houses - foundations have to be worked on
50                                                                                               bacteria & viruses that will thaw out of it' different from what we're used to
51                                                                                                                                                                      erosion
52                                                                                                                                          old bacteria rising from the ground
53                                                                                                                                  bodies buried in permafrost lead to disease
54                                                                                                                                               both sides of island narrowing
55                                                                                                                                                   repair homes & foundations
56                                                                                                                                                              houses buckling
57                                                                                                                                                     camps falling into river
58                                                                                                                                       building houses - now on marshy ground
59                                                        some extent whitefish, sheefish - not so much pike or lingcod.  burbot about 1/10 what they used to be in fish creek.
60                                                                                                                                                                    pollution
61                                                                                                                                                    digging graves is easier.
62                                                                                                                                           we're going to sink - lik putuniq.
63                                                                                                                                                      bad all over - sinking.
64                                                                                               bridge stability - won't be there in a couple years, and other infrastructure.
65                                                                                                                                        not enough good ground for new homes.
66 unpredictable conditions - learn new ways to protect ourselves. rhubarb picking areas are falling into river. rhubarb only grows where it is protected from the severe cold.
67                                                                                                                                     plan for moving, maybe. i'm not worried.
68                                                                                                                                                 whatever happen will happen.
69                                                                        move to higher ground. road conditions & trails used to be firmer. also 4-wheelers tearing up trails.
70                                    by seining time the small creeks are muddy the past 2-3 years - maybe because of beavers? used to be real clear - even like soap bubbles.
71                                                                    fish changing - strong mud taste the past 3-4 years. used to have a lot of grayling, and now almost none.
72                                                                                                                                                               shifting earth
73                                                                                                                                                                     shifting
74                                                                                                                               my house foundation is sinking into the ground
75                                                                                                                                       selawik bridge foundation sinking also
76                                                                                                                                     soft tundra like sinking sand but tundra
77                                                                                                                                                           houses are sinking
78                                                                                                                                                        sidewalks are sinking
79                                                                                                                                                                  ground lost
80                                                                                                                                                ruin our boardwalks and homes
81                                                                                                                                                                  soft ground
82                                                                                                                                                       trails are getting bad
83                                                                                                                                                           ground is unstable
84                                                                                                                                                           boardwalks sinking
85                                                                                                                                                             more muddy areas
86                                                                                                                                                        sinking in the tundra
87                                                                                                                                           climate changing faster every year
88                                                                                                                                                                 early spring
89                                                                                                                                                                       summer
90                                                                                                                                                                shifting land
91                                                                                                                                                             moving community
92                                                                                                                                                  houses will become unstable
93                                                                                                                                                              shifting houses
94                                                                                                                                                                          sun
95                                                                                                                                                                         land
   MeanSalience SumSalience     SmithsS
1     0.7816092  22.6666667 0.338308458
2     0.6666667   0.6666667 0.009950249
3     0.3333333   1.3333333 0.019900498
4     1.0000000   1.0000000 0.014925373
5     0.5000000   0.5000000 0.007462687
6     1.0000000   1.0000000 0.014925373
7     0.6666667   0.6666667 0.009950249
8     0.3333333   0.6666667 0.009950249
9     1.0000000   1.0000000 0.014925373
10    0.6666667   0.6666667 0.009950249
11    0.5000000   1.0000000 0.014925373
12    1.0000000   1.0000000 0.014925373
13    0.6296296   5.6666667 0.084577114
14    0.6491228  12.3333333 0.184079602
15    1.0000000   1.0000000 0.014925373
16    0.6666667   0.6666667 0.009950249
17    0.6111111   3.6666667 0.054726368
18    0.6666667   0.6666667 0.009950249
19    0.3333333   0.3333333 0.004975124
20    1.0000000   1.0000000 0.014925373
21    0.3333333   0.3333333 0.004975124
22    0.5000000   0.5000000 0.007462687
23    1.0000000   1.0000000 0.014925373
24    0.6666667   0.6666667 0.009950249
25    0.5833333   2.3333333 0.034825871
26    0.3333333   0.3333333 0.004975124
27    1.0000000   1.0000000 0.014925373
28    0.3333333   0.3333333 0.004975124
29    1.0000000   1.0000000 0.014925373
30    0.3333333   0.3333333 0.004975124
31    1.0000000   1.0000000 0.014925373
32    1.0000000   1.0000000 0.014925373
33    0.3333333   0.3333333 0.004975124
34    1.0000000   1.0000000 0.014925373
35    0.6666667   0.6666667 0.009950249
36    0.3333333   0.3333333 0.004975124
37    0.6666667   0.6666667 0.009950249
38    0.3333333   0.3333333 0.004975124
39    1.0000000   1.0000000 0.014925373
40    0.3333333   0.3333333 0.004975124
41    1.0000000   1.0000000 0.014925373
42    1.0000000   1.0000000 0.014925373
43    0.6666667   0.6666667 0.009950249
44    1.0000000   1.0000000 0.014925373
45    0.3333333   0.3333333 0.004975124
46    1.0000000   1.0000000 0.014925373
47    0.5000000   0.5000000 0.007462687
48    1.0000000   1.0000000 0.014925373
49    1.0000000   1.0000000 0.014925373
50    0.6666667   0.6666667 0.009950249
51    0.7291667  11.6666667 0.174129353
52    1.0000000   1.0000000 0.014925373
53    0.6666667   0.6666667 0.009950249
54    0.3333333   0.3333333 0.004975124
55    0.6666667   0.6666667 0.009950249
56    0.6666667   0.6666667 0.009950249
57    0.3333333   0.3333333 0.004975124
58    0.6666667   0.6666667 0.009950249
59    0.3333333   0.3333333 0.004975124
60    0.3333333   0.3333333 0.004975124
61    1.0000000   1.0000000 0.014925373
62    1.0000000   1.0000000 0.014925373
63    1.0000000   1.0000000 0.014925373
64    0.6666667   0.6666667 0.009950249
65    0.3333333   0.3333333 0.004975124
66    0.3333333   0.3333333 0.004975124
67    1.0000000   1.0000000 0.014925373
68    0.3333333   0.3333333 0.004975124
69    1.0000000   1.0000000 0.014925373
70    1.0000000   1.0000000 0.014925373
71    0.6666667   0.6666667 0.009950249
72    0.6666667   0.6666667 0.009950249
73    0.6666667   0.6666667 0.009950249
74    1.0000000   1.0000000 0.014925373
75    0.6666667   0.6666667 0.009950249
76    0.3333333   0.3333333 0.004975124
77    1.0000000   1.0000000 0.014925373
78    0.3333333   0.3333333 0.004975124
79    0.6666667   0.6666667 0.009950249
80    1.0000000   1.0000000 0.014925373
81    1.0000000   1.0000000 0.014925373
82    1.0000000   1.0000000 0.014925373
83    0.6666667   0.6666667 0.009950249
84    0.3333333   0.3333333 0.004975124
85    1.0000000   1.0000000 0.014925373
86    1.0000000   1.0000000 0.014925373
87    0.3333333   0.3333333 0.004975124
88    1.0000000   1.0000000 0.014925373
89    0.6666667   0.6666667 0.009950249
90    0.6666667   0.6666667 0.009950249
91    0.3333333   0.3333333 0.004975124
92    1.0000000   1.0000000 0.014925373
93    0.6666667   0.6666667 0.009950249
94    0.6666667   0.6666667 0.009950249
95    0.3333333   0.3333333 0.004975124

# SalienceByIssue

SalienceByIssueS <- SalienceByIssue %>%
    filter(SmithsS >= 0.015)

# SalienceByIssueS

ggplot(SalienceByIssueS, aes(x = reorder(CODE, SmithsS), y = SmithsS)) + geom_bar(stat = "identity") +
    coord_flip() + ggtitle("Issue Salience") + labs(x = "Issue", y = "Smith's S") +
    geom_hline(yintercept = 0.1)

Parts of speech tagging

We can also tag the parts of speech in a text. This allows us to focus an analysis on verbs, nouns, or other parts of speech that may be of interest. For example, in a study on sentiments, we might want to pull out adjectives in order to understand how people feel or describe a particular phenomenon. On the other hand, we might also pull out verbs in order to understand the types of actions people describe as associated with certain cultural practices or beliefs. Let’s tag the parts of speech in Darwin to learn more about the places and cultural practices documented in this book.

Try it

Using the cnlp_annotate() function we can tag the parts of speech in Darwin. This function can take a long time to run. This is the last thing we will do today, so feel free to let it run and then take a break and come back to finish these problems.

Make a new object using only the token part of the output from cnlp_annotate() and then examine the $upos column. What are all the unique parts of speech in this dataset?
Select and examine the top 30 nouns and verbs in this dataset. Do any of the terms surprise you? How might this level of analysis of the text be meaningful for your interpretation of its themes?

Click for solution

library(cleanNLP)
cnlp_init_udpipe()
# tag parts of speech. takes a long time darwin_annotatedtext <-
# cnlp_annotate(darwin_tidy$word)

str(darwin_annotatedtext)  # look at the structure. because it is a list we have to pull out that particular section of the list
darwin_annotatedtextfull <- data.frame(darwin_annotatedtext$token)
str(darwin_annotatedtextfull)

# what are all the different parts of speech that have been tagged?
unique(darwin_annotatedtextfull$upos)

# verb analysis. first look at some of the verbs that occur in the book
# Darwinannotatedtextfull %>% filter(upos=='VERB') %>% select(token,lemma) %>%
# data.frame() %>% top_n(30) top 50 verbs
darwin_annotatedtextfull %>%
    filter(upos == "VERB") %>%
    count(token, sort = T) %>%
    top_n(30)

# what are the top 50 nouns?
darwin_annotatedtextfull %>%
    filter(upos == "NOUN") %>%
    count(lemma, sort = T) %>%
    top_n(30) %>%
    data.frame()

Guest Lecture for ANTH630: Text Analysis

Madeline Brown

Spring 2026

Analyzing On the Origin of Species

Analyzing individual word frequencies

Try it

Stop words

Try it

Wordclouds

Analyzing pairs of words

Try it

Sentiment analysis

Try it

Case study: Permafrost and climate change survey

Try it

Topic Modeling

Try it

Manual text wrangling

Creating new variables with `str_detect()`

Linking text analysis and cultural domain analysis

Try it

Tidying up open ended responses

Try it

Try it

Parts of speech tagging

Try it

Guest Lecture for ANTH630: Text Analysis

Madeline Brown

Spring 2026

Analyzing On the Origin of Species

Analyzing individual word frequencies

Try it

Stop words

Try it

Wordclouds

Analyzing pairs of words

Try it

Sentiment analysis

Try it

Case study: Permafrost and climate change survey

Try it

Topic Modeling

Try it

Manual text wrangling

Creating new variables with str_detect()

Linking text analysis and cultural domain analysis

Try it

Tidying up open ended responses

Try it

Try it

Parts of speech tagging

Try it

Creating new variables with `str_detect()`