This week we learn about how to work with text data in R. We will learn how to turn documents into word lists, analyze frequency counts, extract bigrams, analyze sentiment and parts of speech, and how to visualize text analyses. We are following the procedures outlined in Silge and Robinson’s Text Mining with R: A Tidy Approach.
Project Gutenberg is an awesome source for text data. For this
tutorial we will analyze Darwin’s On the Origin of
Species. The text can be downloaded from Project Gutenberg, or
for simplicity, we can download the text directly using the
gutenbergr package.
First, load all of the libraries we will be using today.
# install.packages('gutenbergr')
library(gutenbergr)
library(tidyverse)
library(wordcloud)
library(wordcloud2)
library(tidytext)
library(stringr)
library(topicmodels)
library(data.table)
library(textdata)
library(cleanNLP)
cnlp_init_udpipe()
Download Darwin’s text and examine the structure of this dataset. How are the data organized?
darwin <- gutenberg_download(1228)
str(darwin)
tibble [16,188 × 2] (S3: tbl_df/tbl/data.frame)
$ gutenberg_id: int [1:16188] 1228 1228 1228 1228 1228 1228 1228 1228 1228 1228 ...
$ text : chr [1:16188] "Click on any of the filenumbers below to quickly view each ebook." "" "1228 1859, First Edition" "22764 1860, Second Edition" ...
Word frequencies can tell us more about the common topics appearing in a text as well as the type of language used. When working with texts from multiple sources or with multiple sections, you can also compare word frequency across these different sources.
Using the unnest_tokens() function, extract out the
individual words from Darwin and create a table sorting the top 50 words
by count. Hint: Try the count() function.
What do you notice about the top words? Why do you think these words appear at the top of the list?
## make into individual words
allwords <- darwin %>%
unnest_tokens(output = word, input = text)
# notice that this also converts the text to lowercase and removes punctuation
# look at the top 50 words in the document
allwords %>%
count(word, sort = T) %>%
top_n(50)
# A tibble: 50 × 2
word n
<chr> <int>
1 the 10294
2 of 7856
3 and 4440
4 in 4015
5 to 3606
6 a 2469
7 that 2083
8 have 1763
9 be 1656
10 as 1591
# ℹ 40 more rows
Many of the top words are what are called stop words, or
those that add little information to our analysis. These include words
like is, the and so, that do not improve our
understanding of the overall topics or themes in a text.
Tidytext has a built in dictionary of stop words, making it
easy to quickly remove these words from the text.
# look at the words in the stop_words dataset
data(stop_words)
stop_words %>%
top_n(50)
# A tibble: 174 × 2
word lexicon
<chr> <chr>
1 i snowball
2 me snowball
3 my snowball
4 myself snowball
5 we snowball
6 our snowball
7 ours snowball
8 ourselves snowball
9 you snowball
10 your snowball
# ℹ 164 more rows
# remove stop words from the text
darwin_tidy <- allwords %>%
anti_join(stop_words)
# look at the structure
str(darwin_tidy)
tibble [57,877 × 2] (S3: tbl_df/tbl/data.frame)
$ gutenberg_id: int [1:57877] 1228 1228 1228 1228 1228 1228 1228 1228 1228 1228 ...
$ word : chr [1:57877] "click" "filenumbers" "quickly" "view" ...
Now we can look at the number of unique words and their counts in Darwin, without interference from stop words.
# how many unique words are there?
length(unique(darwin_tidy$word))
[1] 6910
## look at top 50 words
darwin_tidy %>%
count(word, sort = TRUE) %>%
top_n(50) %>%
mutate(word = reorder(word, n)) %>%
data.frame()
word n
1 species 1540
2 varieties 434
3 selection 412
4 forms 401
5 natural 384
6 plants 335
7 life 306
8 animals 297
9 nature 261
10 distinct 254
11 period 244
12 structure 224
13 conditions 222
14 genera 220
15 common 197
16 time 194
17 degree 191
18 characters 183
19 view 169
20 birds 167
21 instance 165
22 modification 165
23 intermediate 160
24 closely 158
25 modified 153
26 genus 149
27 islands 144
28 individuals 143
29 inhabitants 143
30 produced 143
31 organic 142
32 domestic 141
33 parent 141
34 character 139
35 world 138
36 theory 137
37 organs 136
38 differences 135
39 breeds 133
40 hybrids 133
41 allied 131
42 manner 127
43 found 126
44 habits 125
45 slight 120
46 seeds 117
47 land 116
48 water 115
49 productions 114
50 amount 113
51 widely 113
darwin_tidy %>%
count(word, sort = TRUE) %>%
top_n(50) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) + geom_col() + xlab(NULL) + coord_flip() + labs(y = "Count",
x = "Unique words", title = "Top Words in On the Origin of Species")
Wordclouds can quickly illustrate common themes in text. Here we use
the wordcloud
and wordcloud2
packages to make wordclouds.
darwin_tidy %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
darwin_wordcounts <- darwin_tidy %>%
count(word, sort = T)
wordcloud2(data = darwin_wordcounts)
We can also analyze pairs of words (bigrams). This can be useful for understanding the context around particular words as well as for identifying themes that are made up of multiple strings (e.g. “climate change”, “public health”).
darwin_bigrams <- darwin %>%
unnest_tokens(output = bigrams, input = text, token = "ngrams", n = 2)
str(darwin_bigrams)
tibble [143,751 × 2] (S3: tbl_df/tbl/data.frame)
$ gutenberg_id: int [1:143751] 1228 1228 1228 1228 1228 1228 1228 1228 1228 1228 ...
$ bigrams : chr [1:143751] "click on" "on any" "any of" "of the" ...
## look at counts for each pair
darwin_bigrams %>%
count(bigrams, sort = TRUE) %>%
top_n(50)
# A tibble: 50 × 2
bigrams n
<chr> <int>
1 of the 1937
2 <NA> 1663
3 in the 1042
4 the same 700
5 on the 627
6 to the 552
7 that the 424
8 have been 398
9 it is 303
10 in a 266
# ℹ 40 more rows
Again, we have a challenge with stopwords here. One way to remove these stopwords is to take out any bigram with a stopword in it. Here is another example
# separate words to pull out stop words
separated_words <- darwin_bigrams %>%
separate(bigrams, c("word1", "word2"), sep = " ")
# filter out stop words
darwin_bigrams <- separated_words %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
Make a table of the top 100 bigrams sorted from most to least
frequent. Hint: You can apply count() to multiple
columns.
Pull out all bigrams where “species” is the second term and make a table of the most common bigrams in this subset.
Pull out all bigrams where “water” is either the first or second term and make a table of the most common bigrams in this subset.
What does this analysis tell you about this text? Can you think of any data in your own research that would benefit from ngram analysis?
darwin_bigrams_count <- darwin_bigrams %>%
count(word1, word2, sort = TRUE)
# top 100 pairs of words
head(darwin_bigrams_count, 100) %>%
data.frame()
word1 word2 n
1 <NA> <NA> 1663
2 natural selection 250
3 distinct species 65
4 closely allied 59
5 fresh water 45
6 allied species 37
7 glacial period 36
8 south america 35
9 physical conditions 34
10 domestic animals 28
11 common parent 27
12 geological record 27
13 modified descendants 27
14 rudimentary organs 27
15 intermediate varieties 26
16 rock pigeon 26
17 parent species 25
18 north america 24
19 oceanic islands 24
20 hive bee 23
21 geographical distribution 22
22 natural history 21
23 secondary sexual 21
24 closely related 19
25 allied forms 18
26 land shells 18
27 larger genera 18
28 sexual selection 18
29 successive generations 18
30 domestic breeds 17
31 independently created 17
32 dr hooker 16
33 individual differences 16
34 reproductive system 16
35 species belonging 16
36 common progenitor 15
37 domestic varieties 15
38 natural system 15
39 de candolle 14
40 doubtful forms 14
41 extinct species 14
42 incipient species 14
43 direct action 13
44 dominant species 13
45 marked varieties 13
46 reciprocal crosses 13
47 representative species 13
48 sexual characters 13
49 alph de 12
50 consecutive formations 12
51 distinct genera 12
52 domestic productions 12
53 external conditions 12
54 geological period 12
55 la plata 12
56 malay archipelago 12
57 occasional means 12
58 parent form 12
59 physiological importance 12
60 reproductive systems 12
61 slight modifications 12
62 specific forms 12
63 systematic affinity 12
64 domestic races 11
65 extinct forms 11
66 humble bees 11
67 lesser degree 11
68 ordinary view 11
69 recent period 11
70 short faced 11
71 single species 11
72 specifically distinct 11
73 strongly marked 11
74 galapagos archipelago 10
75 geological formations 10
76 intermediate links 10
77 mountain ranges 10
78 perfectly fertile 10
79 sea water 10
80 slight degree 10
81 southern hemisphere 10
82 terrestrial mammals 10
83 complex relations 9
84 degree intermediate 9
85 existing species 9
86 favourable variations 9
87 geological succession 9
88 glacial epoch 9
89 intermediate gradations 9
90 living species 9
91 organic remains 9
92 selection acts 9
93 separated sexes 9
94 slight differences 9
95 slow process 9
96 successive slight 9
97 tail feathers 9
98 tierra del 9
99 water productions 9
100 webbed feet 9
# where 'species' is the second term
speciesbigram <- darwin_bigrams %>%
filter(word2 == "species")
speciesbigram %>%
count(word1, word2, sort = TRUE) %>%
top_n(20)
# A tibble: 20 × 3
word1 word2 n
<chr> <chr> <int>
1 distinct species 65
2 allied species 37
3 parent species 25
4 extinct species 14
5 incipient species 14
6 dominant species 13
7 representative species 13
8 single species 11
9 existing species 9
10 living species 9
11 peculiar species 7
12 pure species 7
13 varying species 7
14 aboriginal species 6
15 alpine species 6
16 doubtful species 6
17 fossil species 6
18 original species 6
19 ranging species 6
20 true species 6
## look at words that appear next to the word 'water'
waterbigram <- darwin_bigrams %>%
filter(word1 == "water" | word2 == "water")
waterbigram %>%
count(word1, word2, sort = TRUE) %>%
top_n(20)
# A tibble: 43 × 3
word1 word2 n
<chr> <chr> <int>
1 fresh water 45
2 sea water 10
3 water productions 9
4 salt water 8
5 water fish 6
6 water shells 5
7 water plants 4
8 water beetle 3
9 water birds 3
10 water lily 3
# ℹ 33 more rows
Texts often contain certain emotions, feelings, or sentiments that can tell us more about what they mean. In a way, coding sentiments is similar to the qualitative research method of coding fieldnotes for themes. Because of this, you can develop your own custom lexicon for your research context.
We’ll work with the NRC Emotion Lexicon. First, we load the NRC lexicon and look at the different types of sentiments it contains.
# load the nrc sentiment dictionary
get_sentiments("nrc")
# A tibble: 13,872 × 2
word sentiment
<chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
7 abandoned negative
8 abandoned sadness
9 abandonment anger
10 abandonment fear
# ℹ 13,862 more rows
nrcdf <- get_sentiments("nrc")
# take a look at the top sentiments that occur in the lexicon
nrcdf %>%
count(sentiment, sort = T)
# A tibble: 10 × 2
sentiment n
<chr> <int>
1 negative 3316
2 positive 2308
3 fear 1474
4 anger 1245
5 trust 1230
6 sadness 1187
7 disgust 1056
8 anticipation 837
9 joy 687
10 surprise 532
Using inner_join() we can combine the sentiments with
the words from Darwin, effectively “tagging” each word with a particular
sentiment.
# merge sentiments to Darwin data
darwin_sentiment <- darwin_tidy %>%
inner_join(get_sentiments("nrc"))
With the new merged and tagged dataframe, make a table of the top words in Darwin that are associated with the sentiment “trust” and one other sentiment of choice. Reflect on how you might interpret these results. Do you find this information useful? Is there any place you could see sentiment analysis being useful in your own research?
# look at the top words associated with trust
darwin_sentiment %>%
filter(sentiment == "trust") %>%
count(word, sort = T) %>%
top_n(20)
# A tibble: 21 × 2
word n
<chr> <int>
1 structure 224
2 theory 137
3 allied 131
4 found 126
5 doubt 108
6 related 82
7 perfect 79
8 inheritance 63
9 system 62
10 true 60
# ℹ 11 more rows
# pick another sentient and pull out the top 20 words associated witth this
# sentiment.
darwin_sentiment %>%
filter(sentiment == "surprise") %>%
count(word, sort = T) %>%
top_n(20)
# A tibble: 21 × 2
word n
<chr> <int>
1 variable 72
2 inheritance 63
3 larger 59
4 chance 54
5 wild 51
6 remarkable 48
7 occasional 43
8 tree 38
9 expect 31
10 suddenly 31
# ℹ 11 more rows
darwin_sentiment %>%
filter(sentiment == "positive") %>%
count(word, sort = T) %>%
top_n(20)
# A tibble: 20 × 2
word n
<chr> <int>
1 structure 224
2 degree 191
3 organic 142
4 allied 131
5 found 126
6 land 116
7 sea 107
8 importance 103
9 action 84
10 perfect 79
11 existence 78
12 increase 76
13 reason 73
14 advantage 67
15 inheritance 63
16 marked 62
17 true 60
18 growth 58
19 improved 57
20 fertile 55
Now that we’ve learned a bit about text analysis, let’s test our skills on a real world dataset. Here we will use data from a survey in two Inupiaq villages in Alaska to examine how individuals in these communities feel about climate change and thawing permafrost. These data are drawn from here: William B. Bowden 2013. Perceptions and implications of thawing permafrost and climate change in two Inupiaq villages of arctic Alaska Link. Let’s further examine the responses to two open ended questions: (Q5) What is causing it [permafrost around X village] to change? and (Q69) “What feelings do you have when thinking about the possibility of future climate change in and around [village name]?”.
First we load the data and subset out the columns of interest.
# we will work with the permafrost survey data.
surv <- read.csv("https://maddiebrown.github.io/ANTH630/data/Survey_AKP-SEL.csv",
stringsAsFactors = F)
surv_subset <- surv %>%
select(Village, Survey.Respondent, Age.Group, X69..Feelings, X5..PF.Cause.)
Then we can quickly calculate the most frequent terms across all 80 responses.
class(surv$X69..Feelings) #make sure your column is a character variable
[1] "character"
surv_tidy <- surv_subset %>%
unnest_tokens(word, X69..Feelings) %>%
anti_join(stop_words)
# what are most common words?
feelingswordcount <- surv_tidy %>%
count(word, sort = T)
Make wordclouds of the word frequency in responses about feelings related to climate change using two different methods.
surv_tidy %>%
count(word) %>%
with(wordcloud(word, n, max.words = 100))
wordcloud2(data = feelingswordcount)
Topic modeling allows us to identify themes in text without needing to clearly know which themes or groupings we expect to emerge. This can be very useful when you have large volumes of messy data or data from multiple sources that you need to parse. We will use Latent Dirichlet allocation or (LDA), following the explanation in Text Mining with R.
Before we can identify themes across responses however, we need to make sure each “document” or “response” has a unique identifier. We will make a new primary key called “ID” that has a different value for each unique response.
surv_subset %>%
select(Village, Survey.Respondent)
Village Survey.Respondent
1 AKP 1
2 AKP 2
3 AKP 3
4 AKP 4
5 AKP 5
6 AKP 6
7 AKP 7
8 AKP 8
9 AKP 9
10 AKP 10
11 AKP 11
12 AKP 12
13 AKP 13
14 AKP 14
15 AKP 15
16 AKP 16
17 AKP 17
18 AKP 18
19 AKP 19
20 AKP 20
21 AKP 21
22 AKP 22
23 AKP 23
24 AKP 24
25 AKP 25
26 AKP 26
27 AKP 27
28 AKP 28
29 AKP 29
30 AKP 30
31 AKP 31
32 AKP 32
33 AKP 33
34 AKP 34
35 AKP 35
36 AKP 36
37 AKP 37
38 AKP 38
39 AKP 39
40 SEL 1
41 SEL 2
42 SEL 3
43 SEL 4
44 SEL 5
45 SEL 6
46 SEL 7
47 SEL 8
48 SEL 9
49 SEL 10
50 SEL 11
51 SEL 12
52 SEL 13
53 SEL 14
54 SEL 15
55 SEL 16
56 SEL 17
57 SEL 18
58 SEL 19
59 SEL 20
60 SEL 21
61 SEL 22
62 SEL 23
63 SEL 24
64 SEL 25
65 SEL 26
66 SEL 27
67 SEL 28
68 SEL 29
69 SEL 30
70 SEL 31
71 SEL 32
72 SEL 33
73 SEL 34
74 SEL 35
75 SEL 36
76 SEL 37
77 SEL 38
78 SEL 39
79 SEL 40
80 SEL 41
surv_subset <- surv_subset %>%
mutate(ID = paste(Village, Survey.Respondent, sep = "_"))
### let's make a new surv_tidy object that incoporates the new ID we made
surv_tidy <- surv_subset %>%
unnest_tokens(word, X69..Feelings) %>%
anti_join(stop_words)
surv_tidy %>%
count(word, ID, sort = T) %>%
top_n(5)
word ID n
1 days AKP_26 3
2 april AKP_17 2
3 caribou AKP_32 2
4 change AKP_5 2
5 cold AKP_1 2
6 future AKP_18 2
7 march AKP_17 2
8 months AKP_17 2
9 permafrost AKP_3 2
10 weather AKP_17 2
11 weather SEL_21 2
12 worry AKP_18 2
The first step in creating a topic model is to count the number of
times each word appears in each individual document (or response in our
case). Luckily, we can count by two variables using the
count() function. Let’s create a new
byresponse variable.
byresponse <- surv_tidy %>%
count(ID, word, sort = T) %>%
ungroup()
# check how many responses are included in the analysis. this allows you to
# double check that the new unique identifier we made worked as expected.
unique(byresponse$ID)
[1] "AKP_26" "AKP_1" "AKP_17" "AKP_18" "AKP_3" "AKP_32" "AKP_5" "SEL_21"
[9] "AKP_10" "AKP_11" "AKP_12" "AKP_13" "AKP_14" "AKP_15" "AKP_16" "AKP_19"
[17] "AKP_2" "AKP_20" "AKP_22" "AKP_23" "AKP_24" "AKP_25" "AKP_27" "AKP_28"
[25] "AKP_29" "AKP_30" "AKP_31" "AKP_33" "AKP_34" "AKP_35" "AKP_36" "AKP_37"
[33] "AKP_38" "AKP_4" "AKP_6" "AKP_7" "AKP_8" "AKP_9" "SEL_1" "SEL_10"
[41] "SEL_11" "SEL_13" "SEL_14" "SEL_15" "SEL_17" "SEL_18" "SEL_19" "SEL_2"
[49] "SEL_20" "SEL_22" "SEL_24" "SEL_25" "SEL_26" "SEL_27" "SEL_29" "SEL_3"
[57] "SEL_30" "SEL_31" "SEL_32" "SEL_33" "SEL_34" "SEL_35" "SEL_36" "SEL_37"
[65] "SEL_38" "SEL_39" "SEL_4" "SEL_40" "SEL_5" "SEL_6" "SEL_7" "SEL_8"
[73] "SEL_9"
length(unique(byresponse$ID))
[1] 73
Now we can convert our longform word list into a document-term matrix. Read more here
surv_dtm <- byresponse %>%
cast_dtm(ID, word, n)
# ?cast_dtm #read up on how this function works
Run the LDA() function and choose a number of solutions. in this case, let’s try it with 2
surv_lda <- LDA(surv_dtm, k = 2, control = list(seed = 9999))
# look at our output str(surv_lda)
# examine the probability that each word is in a particular topic group
surv_topics <- tidy(surv_lda, matrix = "beta")
head(surv_topics)
# A tibble: 6 × 3
topic term beta
<int> <chr> <dbl>
1 1 days 0.00885
2 2 days 0.00864
3 1 cold 0.00540
4 2 cold 0.0296
5 1 april 0.0102
6 2 april 0.00140
Examine the top words for each topic identified by the model.
top_words <- surv_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, desc(beta))
top_words
# A tibble: 20 × 3
topic term beta
<int> <chr> <dbl>
1 1 climate 0.0446
2 1 change 0.0363
3 1 animals 0.0324
4 1 sad 0.0258
5 1 adapt 0.0204
6 1 caribou 0.0196
7 1 scary 0.0180
8 1 blank 0.0173
9 1 river 0.0159
10 1 future 0.0145
11 2 cold 0.0296
12 2 weather 0.0276
13 2 worried 0.0248
14 2 nc 0.0224
15 2 change 0.0220
16 2 future 0.0205
17 2 dk 0.0181
18 2 people 0.0175
19 2 land 0.0172
20 2 scary 0.0170
We can also examine the results graphically.
# plot these top words for each topic (adapted from
# https://www.tidytextmining.com/topicmodeling.html)
top_words %>%
group_by(topic) %>%
mutate(term = fct_reorder(term, beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) +
facet_wrap(~topic, scales = "free") + theme_minimal()
Repeat the topic modeling analysis but using 6 topics instead of two.
## repeat the analysis but with 6 topics
surv_lda6 <- LDA(surv_dtm, k = 6, control = list(seed = 9999))
# examine the probability that each word is in a particular topic group
surv_topics6 <- tidy(surv_lda6, matrix = "beta")
surv_topics6
# A tibble: 1,260 × 3
topic term beta
<int> <chr> <dbl>
1 1 days 4.88e- 2
2 2 days 9.19e-262
3 3 days 1.69e-269
4 4 days 1.17e-270
5 5 days 5.77e-267
6 6 days 4.77e-268
7 1 cold 1.63e- 2
8 2 cold 5.81e- 2
9 3 cold 1.68e- 2
10 4 cold 4.86e-235
# ℹ 1,250 more rows
# examine top words for each topic
top_words6 <- surv_topics6 %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, desc(beta))
# plot these top words for each topic (adapted from
# https://www.tidytextmining.com/topicmodeling.html)
top_words6 %>%
group_by(topic) %>%
mutate(term = fct_reorder(term, beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) +
facet_wrap(~topic, scales = "free_y") + theme_minimal()
In this case, our sample is small, so topic modeling is not necessarily the best method to use. However, even from this small sample, you can see that some topics emerge from the text that were not previously apparent.
Sometimes you’ll need to edit text or strings manually. For example, you may find that for your research question, you are less interested in differentiating between the terms running, run, and runner, than in identifying clusters of beliefs about running as a more general concept. On the other hand, you might want to differentiate between runners and running as beliefs about groups of people vs. the act of running. How you choose to transform text data in your research is up to your research questions and understanding of the cultural context.
R has a number of helpful functions for manually adjusting strings. We’ll cover a few to get you started. Let’s go back to the permafrost and climate change survey and look at responses to: (Q5) What is causing it [permafrost around X village] to change?.
First let’s look at the raw data. What are some potential issues in the strings below that might make text analysis difficult or ineffective?
surv$X5..PF.Cause.[10:30]
[1] "climate change"
[2] "too warm"
[3] "warmer weather, shorter winters, lack of snow, fast springs. [will affect AKP because we] use Argos to go out."
[4] "weather warming"
[5] "temperature"
[6] "heat. A lot of heat."
[7] "melting of the ground - goes down"
[8] "spirited answer. Her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[9] "probably global warming"
[10] "temperature outside is not steady"
[11] "(blank)"
[12] "N/A"
[13] "most likely war weather or global warming"
[14] "not much winter - hardly get snow. Always wind blown. Summer be rain, rain, rain. Late snow."
[15] "warmer winters"
[16] "global warming"
[17] "warming weather, longer summer/fall season"
[18] "I have no idea"
[19] "warmer climate"
[20] "Seems like there's lots of rain & water causes ground to thaw. So maybe accumulated water? Maybe warm weather?"
[21] "the heat wave - winter frosts"
Luckily we can manually adjust the strings to make them easier to analyze systematically. For example we might set characters to lowercase, trim white space and remove any empty or missing rows.
# make a new column to hold the tidy data
surv$cause_tidy <- surv$X5..PF.Cause.
# make lower case
surv$cause_tidy <- tolower(surv$cause_tidy)
# remove white space at beginning and end of string
surv$cause_tidy <- trimws(surv$cause_tidy)
# filter out blank or empty rows
surv <- surv %>%
filter(surv$cause_tidy != "")
surv <- surv %>%
filter(surv$cause_tidy != "(blank)")
surv <- surv %>%
filter(surv$cause_tidy != "n/a")
We can also directly replace particular strings. Here we change some strings with typos.
surv$cause_tidy <- surv$cause_tidy %>%
str_replace("wamer", "warmer")
surv$cause_tidy <- surv$cause_tidy %>%
str_replace("lnoger", "longer")
Another common string data transformation involves grouping together
responses into more standardized categories. You can transform values
individually or based on exact string matches. In addition, using
%like% we can transform any strings where just part of the
string contains a particular string. For example, we might decide that
any time the string “warm” appears in a response, the overall theme of
the response is associated with “global warming”. Or based on our
ethnographic understanding of the context we might know that “seasonal
changes” are important causes of permafrost in local cultural models. We
can then look for some key terms that will allow us to rapidly change
multiple responses that are likely to fit in this category. In this
case, “late” and “early”.
# group some responses together based on the presence of a particular string
surv <- surv %>%
mutate(cause_tidy = replace(cause_tidy, cause_tidy %like% "warm", "global warming"))
surv$cause_tidy[1:30]
[1] "environment"
[2] "exhaust"
[3] "global warming"
[4] "global warming"
[5] "hot summers, early springs. in super cold winters the ground comes up & cracks and water comes out."
[6] "global warming"
[7] "climate change"
[8] "freezing & thawing in fall & spring"
[9] "global warming"
[10] "climate change"
[11] "global warming"
[12] "global warming"
[13] "global warming"
[14] "temperature"
[15] "heat. a lot of heat."
[16] "melting of the ground - goes down"
[17] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[18] "global warming"
[19] "temperature outside is not steady"
[20] "global warming"
[21] "not much winter - hardly get snow. always wind blown. summer be rain, rain, rain. late snow."
[22] "global warming"
[23] "global warming"
[24] "global warming"
[25] "i have no idea"
[26] "global warming"
[27] "global warming"
[28] "the heat wave - winter frosts"
[29] "global warming"
[30] "global warming"
surv <- surv %>%
mutate(cause_tidy = replace(cause_tidy, cause_tidy %like% "early" | cause_tidy %like%
"late", "seasonal changes"))
surv$cause_tidy[1:30]
[1] "environment"
[2] "exhaust"
[3] "global warming"
[4] "global warming"
[5] "seasonal changes"
[6] "global warming"
[7] "climate change"
[8] "freezing & thawing in fall & spring"
[9] "global warming"
[10] "climate change"
[11] "global warming"
[12] "global warming"
[13] "global warming"
[14] "temperature"
[15] "heat. a lot of heat."
[16] "melting of the ground - goes down"
[17] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[18] "global warming"
[19] "temperature outside is not steady"
[20] "global warming"
[21] "seasonal changes"
[22] "global warming"
[23] "global warming"
[24] "global warming"
[25] "i have no idea"
[26] "global warming"
[27] "global warming"
[28] "the heat wave - winter frosts"
[29] "global warming"
[30] "global warming"
# compare the original with your categorizations surv %>%
# select(X5..PF.Cause.,cause_tidy)
We won’t get into too much detail today, but you can also search and
select string data using regular
expressions. You can read more in R4DS.
Here let’s use str_detect() to pull out some strings with
regular expressions.
# any responses ending in 'ing'
surv$cause_tidy[str_detect(surv$cause_tidy, "ing$")]
[1] "global warming" "global warming"
[3] "global warming" "freezing & thawing in fall & spring"
[5] "global warming" "global warming"
[7] "global warming" "global warming"
[9] "global warming" "global warming"
[11] "global warming" "global warming"
[13] "global warming" "global warming"
[15] "global warming" "global warming"
[17] "global warming" "global warming"
[19] "global warming" "global warming"
[21] "global warming" "global warming"
[23] "global warming" "global warming"
[25] "global warming" "global warming"
[27] "global warming" "global warming"
[29] "global warming" "global warming"
[31] "global warming" "global warming"
[33] "global warming" "global warming"
[35] "global warming"
# any reponses that contain a W followed by either an 'e' or an 'a'
surv$cause_tidy[str_detect(surv$cause_tidy, "w[ea]")]
[1] "global warming"
[2] "global warming"
[3] "global warming"
[4] "global warming"
[5] "global warming"
[6] "global warming"
[7] "global warming"
[8] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[9] "global warming"
[10] "global warming"
[11] "global warming"
[12] "global warming"
[13] "global warming"
[14] "global warming"
[15] "global warming"
[16] "the heat wave - winter frosts"
[17] "global warming"
[18] "global warming"
[19] "weather"
[20] "global warming"
[21] "global warming"
[22] "global warming"
[23] "global warming"
[24] "global warming"
[25] "weather."
[26] "global warming"
[27] "global warming"
[28] "global warming"
[29] "global warming"
[30] "global warming"
[31] "global warming"
[32] "global warming"
[33] "global warming"
[34] "global warming"
[35] "global warming"
[36] "global warming"
[37] "global warming"
[38] "global warming"
# any responses that contain the string erosion
surv$cause_tidy[str_detect(surv$cause_tidy, "erosion")]
[1] "erosion, and real hot summers and a lot of snow & rain."
[2] "mud goes down river, cracking all along & falling in - erosion"
[3] "ground erosion"
[4] "erosion"
# any responses that contain the string erosion, but which have any character
# occurring before the word erosion.
surv$cause_tidy[str_detect(surv$cause_tidy, ".erosion")]
[1] "mud goes down river, cracking all along & falling in - erosion"
[2] "ground erosion"
The utility of regular expressions is huge for quickly searching through and transforming large volumes of string data. We’ve only scratched the surface today.
Whenever transforming large volumes of data using string detection and regular expressions it is critical to double check that each operation is in fact working as you expected it to. Paying attention to the order of transformations is also important for preventing you from overwriting previous data transformations.
str_detect()Sometimes it is useful to create flags or indicator variables in your data. These can allow you to quickly filter out rows that have particular characteristics. For example, we can create a new binary column that indicates whether or not the response refers to global warming. This variable can then be used for further grouping, data visualization or other tasks.
surv <- surv %>%
mutate(GlobalWarmingYN = str_detect(cause_tidy, "global warming"))
table(surv$GlobalWarmingYN) # how many responses contain the string global warming?
FALSE TRUE
37 34
Now that we’ve learned a bit about wrangling text data in R, let’s combine these skills with what we learned previously about cultural domain analysis.
For this example, we will use Q8: “What do you think are the 3 biggest issues that will result from thawing permafrost in this area?” from the Permafrost dataset.
The first thing to consider is how to transform the dataset into the right format for conducting freelist analysis. These responses are already ordered from 1-3, but they are organized into three different columns instead of one. We can pivot the data from wide to long format and then add in a new column that accounts for the rank of the responses.
# first create a unique identifier
surv <- surv %>%
mutate(ID = paste(Village, Survey.Respondent, sep = "_"))
# select out only the columns related to the permafrost question
pfissues <- surv %>%
select(ID, Village, X8.1..PF.Issue, X8.2..PF.Issue, X8.3..PF.Issue)
# pivot the data from wide to long format
pfissues_long <- pivot_longer(pfissues, c(X8.1..PF.Issue, X8.2..PF.Issue, X8.3..PF.Issue),
names_to = "rank", values_to = "issue")
# add in a new column for the rank of each response.
pfissues_long <- pfissues_long %>%
mutate(rank = case_when(rank == "X8.1..PF.Issue" ~ "1", rank == "X8.2..PF.Issue" ~
"2", TRUE ~ "3"))
Now that our data are in a long, ranked format, we are ready to analyze the responses.
Make a table of the number of responses for each issue resulting from permafrost thaw. Why is this difficult to interpret?
List all the unique issues mentioned. Can any of these be meaningfully combined?
# table(pfissues_long$issue)
# look at all the unique issues mentioned unique(pfissues_long$issue)
We can clean up the text responses in several ways. First, we can remove missing values, convert text to lowercase and remove extra whitespace. Next, we will begin interpreting which responses can be meaningfully grouped together.
The code below begins the text cleaning process by removing missing values.
# set missing values to NA
pfissues_long$issue[pfissues_long$issue == ""] <- NA
pfissues_long$issue[pfissues_long$issue == "(blank)"] <- NA
pfissues_long$issue[pfissues_long$issue == "DK"] <- NA
# filter out rows with missing values
pfissues_long <- pfissues_long %>%
filter(!is.na(issue))
# make a new column to hold the tidy data
pfissues_long$issue_tidy <- pfissues_long$issue
Let’s perform several more text cleaning operations. If you are unfamiliar with the functions used, take a look at their help files to figure out their syntax.
tolower() to convert responses to lowercase.trimws() to remove extra white space in
responses.count() and arrange() to view the
counts of responses.# make lower case
pfissues_long$issue_tidy <- tolower(pfissues_long$issue_tidy)
# remove white space at beginning and end of string
pfissues_long$issue_tidy <- trimws(pfissues_long$issue_tidy)
# look at unique values and their counts
pfissues_long %>%
count(issue_tidy) %>%
arrange(desc(n))
# A tibble: 164 × 2
issue_tidy n
<chr> <int>
1 erosion 7
2 flooding 7
3 water levels rising 3
4 berry picking 2
5 housing 2
6 rivers will change course 2
7 water 2
8 [that's all] 1
9 a lot of land slides and erosion - especially with rain. putuniq used … 1
10 affect caribou 1
# ℹ 154 more rows
Now the text of the permafrost issue responses is cleaned but we still have many different types of responses that are getting at some of the same issues. Some of this is due to parts of speech, but much of this is because of the open ended nature of the responses. This is where the researcher can start to group some of the responses together using their knowledge of the research context.
For example, the code below groups all responses with the word “caribou” into a single word response. Below that is a more complicated version converting many different strings into the single “subsistence” category.
pfissues_long <- pfissues_long %>%
mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "caribou", "caribou"))
pfissues_long <- pfissues_long %>%
mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "hunt" | issue_tidy %like%
"subsistence" | issue_tidy %like% "food" | issue_tidy %like% "berr" | issue_tidy %like%
"fishing" | issue_tidy %like% "eat" | issue_tidy %like% "hungry", "subsistence"))
Take a look at the permafrost issue responses and consider which
major categories exist across the responses. Following the syntax
outlined above, make 3-4 more categories to group the responses
together. Finally, take a look at the results by modifying the following
code:
pfissues_long %>% count(issue_tidy) %>% arrange(desc(n)) %>% filter(n>1).
pfissues_long <- pfissues_long %>%
mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "erosion" | issue_tidy %like%
"erode" | issue_tidy %like% "eroding" | issue_tidy %like% "land slid", "erosion"))
pfissues_long <- pfissues_long %>%
mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "migration" | issue_tidy %like%
"migrating", "migration"))
pfissues_long <- pfissues_long %>%
mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "flood" | issue_tidy %like%
"water" | issue_tidy %like% "level", "water"))
pfissues_long <- pfissues_long %>%
mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "animal" | issue_tidy %like%
"wildlife", "animals"))
pfissues_long <- pfissues_long %>%
mutate(issue_tidy = replace(issue_tidy, issue_tidy %like% "weather" | issue_tidy %like%
"wind" | issue_tidy %like% "rain" | issue_tidy %like% "temperature" | issue_tidy %like%
"snow" | issue_tidy %like% "wet" | issue_tidy %like% "dry" | issue_tidy %like%
"hot", "weather and climate"))
# look at counts
pfissues_long %>%
count(issue_tidy) %>%
arrange(desc(n)) %>%
filter(n > 1)
# A tibble: 9 × 2
issue_tidy n
<chr> <int>
1 water 30
2 subsistence 22
3 erosion 16
4 weather and climate 10
5 caribou 6
6 animals 4
7 migration 4
8 housing 2
9 rivers will change course 2
Now that we’ve cleaned up the data and grouped responses into
categories, we can conduct a salience analysis. The code below follows
the same format we learned previously to analyze freelists with
AnthroTools. Your results will look different from the ones
below since you will have created different protocols for combining
responses.
pfissues_long.df <- as.data.frame(pfissues_long)
pfissues_long.df$rank <- as.numeric(pfissues_long.df$rank)
library(AnthroTools)
IssuesSalience <- CalculateSalience(pfissues_long.df, Order = "rank", Subj = "ID",
CODE = "issue_tidy", Salience = "Salience")
# look at results #IssuesSalience
SalienceByIssue <- SalienceByCode(IssuesSalience, Salience = "Salience", Subj = "ID",
CODE = "issue_tidy", dealWithDoubles = "MAX")
CODE
1 water
2 hard to travel
3 migration
4 stress holes
5 more permafrost fractures
6 building issues
7 travel
8 housing
9 release of methane gas & particles
10 no more stable ground
11 rivers will change course
12 transportation
13 weather and climate
14 subsistence
15 changing roadways
16 might have to fix the river if it flows again
17 caribou
18 thaws too early in springtime & hard to travel
19 [that's all]
20 overflow
21 dangerous glaciers
22 our argo trails will be affected
23 sinking land - have to look for different trails that are more solid.
24 methane escaping
25 animals
26 travel & trails
27 travel - trails
28 hard time walking
29 changing of landscape
30 changing of the seasons
31 tundra might be not as sturdy on top, maybe more soft or swampy
32 ground coming up off the ground
33 ground cave-ins, sink-holes
34 sinkholes
35 affect vegetations or surface
36 methane gas
37 sink holes will be more
38 vegetation will change
39 ice changes
40 affecting plants
41 might change some feeding sources
42 river patterns would be changing
43 maybe ground would be changing
44 air
45 sunlight
46 our building, homes, runways, bridge always be monitored
47 the more studies and how it will affect our community on thawing permafrost
48 ground shift
49 houses - foundations have to be worked on
50 bacteria & viruses that will thaw out of it' different from what we're used to
51 erosion
52 old bacteria rising from the ground
53 bodies buried in permafrost lead to disease
54 both sides of island narrowing
55 repair homes & foundations
56 houses buckling
57 camps falling into river
58 building houses - now on marshy ground
59 some extent whitefish, sheefish - not so much pike or lingcod. burbot about 1/10 what they used to be in fish creek.
60 pollution
61 digging graves is easier.
62 we're going to sink - lik putuniq.
63 bad all over - sinking.
64 bridge stability - won't be there in a couple years, and other infrastructure.
65 not enough good ground for new homes.
66 unpredictable conditions - learn new ways to protect ourselves. rhubarb picking areas are falling into river. rhubarb only grows where it is protected from the severe cold.
67 plan for moving, maybe. i'm not worried.
68 whatever happen will happen.
69 move to higher ground. road conditions & trails used to be firmer. also 4-wheelers tearing up trails.
70 by seining time the small creeks are muddy the past 2-3 years - maybe because of beavers? used to be real clear - even like soap bubbles.
71 fish changing - strong mud taste the past 3-4 years. used to have a lot of grayling, and now almost none.
72 shifting earth
73 shifting
74 my house foundation is sinking into the ground
75 selawik bridge foundation sinking also
76 soft tundra like sinking sand but tundra
77 houses are sinking
78 sidewalks are sinking
79 ground lost
80 ruin our boardwalks and homes
81 soft ground
82 trails are getting bad
83 ground is unstable
84 boardwalks sinking
85 more muddy areas
86 sinking in the tundra
87 climate changing faster every year
88 early spring
89 summer
90 shifting land
91 moving community
92 houses will become unstable
93 shifting houses
94 sun
95 land
MeanSalience SumSalience SmithsS
1 0.7816092 22.6666667 0.338308458
2 0.6666667 0.6666667 0.009950249
3 0.3333333 1.3333333 0.019900498
4 1.0000000 1.0000000 0.014925373
5 0.5000000 0.5000000 0.007462687
6 1.0000000 1.0000000 0.014925373
7 0.6666667 0.6666667 0.009950249
8 0.3333333 0.6666667 0.009950249
9 1.0000000 1.0000000 0.014925373
10 0.6666667 0.6666667 0.009950249
11 0.5000000 1.0000000 0.014925373
12 1.0000000 1.0000000 0.014925373
13 0.6296296 5.6666667 0.084577114
14 0.6491228 12.3333333 0.184079602
15 1.0000000 1.0000000 0.014925373
16 0.6666667 0.6666667 0.009950249
17 0.6111111 3.6666667 0.054726368
18 0.6666667 0.6666667 0.009950249
19 0.3333333 0.3333333 0.004975124
20 1.0000000 1.0000000 0.014925373
21 0.3333333 0.3333333 0.004975124
22 0.5000000 0.5000000 0.007462687
23 1.0000000 1.0000000 0.014925373
24 0.6666667 0.6666667 0.009950249
25 0.5833333 2.3333333 0.034825871
26 0.3333333 0.3333333 0.004975124
27 1.0000000 1.0000000 0.014925373
28 0.3333333 0.3333333 0.004975124
29 1.0000000 1.0000000 0.014925373
30 0.3333333 0.3333333 0.004975124
31 1.0000000 1.0000000 0.014925373
32 1.0000000 1.0000000 0.014925373
33 0.3333333 0.3333333 0.004975124
34 1.0000000 1.0000000 0.014925373
35 0.6666667 0.6666667 0.009950249
36 0.3333333 0.3333333 0.004975124
37 0.6666667 0.6666667 0.009950249
38 0.3333333 0.3333333 0.004975124
39 1.0000000 1.0000000 0.014925373
40 0.3333333 0.3333333 0.004975124
41 1.0000000 1.0000000 0.014925373
42 1.0000000 1.0000000 0.014925373
43 0.6666667 0.6666667 0.009950249
44 1.0000000 1.0000000 0.014925373
45 0.3333333 0.3333333 0.004975124
46 1.0000000 1.0000000 0.014925373
47 0.5000000 0.5000000 0.007462687
48 1.0000000 1.0000000 0.014925373
49 1.0000000 1.0000000 0.014925373
50 0.6666667 0.6666667 0.009950249
51 0.7291667 11.6666667 0.174129353
52 1.0000000 1.0000000 0.014925373
53 0.6666667 0.6666667 0.009950249
54 0.3333333 0.3333333 0.004975124
55 0.6666667 0.6666667 0.009950249
56 0.6666667 0.6666667 0.009950249
57 0.3333333 0.3333333 0.004975124
58 0.6666667 0.6666667 0.009950249
59 0.3333333 0.3333333 0.004975124
60 0.3333333 0.3333333 0.004975124
61 1.0000000 1.0000000 0.014925373
62 1.0000000 1.0000000 0.014925373
63 1.0000000 1.0000000 0.014925373
64 0.6666667 0.6666667 0.009950249
65 0.3333333 0.3333333 0.004975124
66 0.3333333 0.3333333 0.004975124
67 1.0000000 1.0000000 0.014925373
68 0.3333333 0.3333333 0.004975124
69 1.0000000 1.0000000 0.014925373
70 1.0000000 1.0000000 0.014925373
71 0.6666667 0.6666667 0.009950249
72 0.6666667 0.6666667 0.009950249
73 0.6666667 0.6666667 0.009950249
74 1.0000000 1.0000000 0.014925373
75 0.6666667 0.6666667 0.009950249
76 0.3333333 0.3333333 0.004975124
77 1.0000000 1.0000000 0.014925373
78 0.3333333 0.3333333 0.004975124
79 0.6666667 0.6666667 0.009950249
80 1.0000000 1.0000000 0.014925373
81 1.0000000 1.0000000 0.014925373
82 1.0000000 1.0000000 0.014925373
83 0.6666667 0.6666667 0.009950249
84 0.3333333 0.3333333 0.004975124
85 1.0000000 1.0000000 0.014925373
86 1.0000000 1.0000000 0.014925373
87 0.3333333 0.3333333 0.004975124
88 1.0000000 1.0000000 0.014925373
89 0.6666667 0.6666667 0.009950249
90 0.6666667 0.6666667 0.009950249
91 0.3333333 0.3333333 0.004975124
92 1.0000000 1.0000000 0.014925373
93 0.6666667 0.6666667 0.009950249
94 0.6666667 0.6666667 0.009950249
95 0.3333333 0.3333333 0.004975124
# SalienceByIssue
SalienceByIssueS <- SalienceByIssue %>%
filter(SmithsS >= 0.015)
# SalienceByIssueS
ggplot(SalienceByIssueS, aes(x = reorder(CODE, SmithsS), y = SmithsS)) + geom_bar(stat = "identity") +
coord_flip() + ggtitle("Issue Salience") + labs(x = "Issue", y = "Smith's S") +
geom_hline(yintercept = 0.1)
We can also tag the parts of speech in a text. This allows us to focus an analysis on verbs, nouns, or other parts of speech that may be of interest. For example, in a study on sentiments, we might want to pull out adjectives in order to understand how people feel or describe a particular phenomenon. On the other hand, we might also pull out verbs in order to understand the types of actions people describe as associated with certain cultural practices or beliefs. Let’s tag the parts of speech in Darwin to learn more about the places and cultural practices documented in this book.
Using the cnlp_annotate() function we can tag the parts
of speech in Darwin. This function can take a long time to run. This is
the last thing we will do today, so feel free to let it run and then
take a break and come back to finish these problems.
Make a new object using only the token part of the output from
cnlp_annotate() and then examine the $upos
column. What are all the unique parts of speech in this
dataset?
Select and examine the top 30 nouns and verbs in this dataset. Do any of the terms surprise you? How might this level of analysis of the text be meaningful for your interpretation of its themes?
library(cleanNLP)
cnlp_init_udpipe()
# tag parts of speech. takes a long time darwin_annotatedtext <-
# cnlp_annotate(darwin_tidy$word)
str(darwin_annotatedtext) # look at the structure. because it is a list we have to pull out that particular section of the list
darwin_annotatedtextfull <- data.frame(darwin_annotatedtext$token)
str(darwin_annotatedtextfull)
# what are all the different parts of speech that have been tagged?
unique(darwin_annotatedtextfull$upos)
# verb analysis. first look at some of the verbs that occur in the book
# Darwinannotatedtextfull %>% filter(upos=='VERB') %>% select(token,lemma) %>%
# data.frame() %>% top_n(30) top 50 verbs
darwin_annotatedtextfull %>%
filter(upos == "VERB") %>%
count(token, sort = T) %>%
top_n(30)
# what are the top 50 nouns?
darwin_annotatedtextfull %>%
filter(upos == "NOUN") %>%
count(lemma, sort = T) %>%
top_n(30) %>%
data.frame()