This week we learn about how to work with text data in R. We will learn how to turn documents into word lists, analyze frequency counts, extract bigrams, analyze sentiment and parts of speech, and how to visualize text analyses.
To get started, we will analyze the classic Malinowski (1922) text Argonauts of the Western Pacific. The text can be downloaded from Project Gutenberg, or for simplicity, we can download the text directly using the gutenbergr
package.
First, let’s load all of the libraries we will be using today.
#install.packages("gutenbergr")
library(gutenbergr)
gutenberg_metadata
# A tibble: 51,997 × 8
gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
<int> <chr> <chr> <int> <chr> <chr> <chr>
1 0 <NA> <NA> NA en <NA> Publi…
2 1 "The D… Jeffe… 1638 en United States L… Publi…
3 2 "The U… Unite… 1 en American Revolu… Publi…
4 3 "John … Kenne… 1666 en <NA> Publi…
5 4 "Linco… Linco… 3 en US Civil War Publi…
6 5 "The U… Unite… 1 en American Revolu… Publi…
7 6 "Give … Henry… 4 en American Revolu… Publi…
8 7 "The M… <NA> NA en <NA> Publi…
9 8 "Abrah… Linco… 3 en US Civil War Publi…
10 9 "Abrah… Linco… 3 en US Civil War Publi…
# … with 51,987 more rows, and 1 more variable: has_text <lgl>
library(tidyverse)
library(wordcloud)
library(tidytext)
library(stringr)
library(topicmodels)
library(data.table)
library(textdata)
library(cleanNLP)
cnlp_init_udpipe()
Download the Malinowski book text and examine the structure. How are the data organized?
malinowski1922 <- gutenberg_download(55822)
str(malinowski1922)
tibble [22,219 × 2] (S3: tbl_df/tbl/data.frame)
$ gutenberg_id: int [1:22219] 55822 55822 55822 55822 55822 55822 55822 55822 55822 55822 ...
$ text : chr [1:22219] " ARGONAUTS OF THE WESTERN PACIFIC" "" " An Account of Native" " Enterprise and Adventure" ...
One of the first ways we can explore a text is through looking at word frequncies. With multiple samples from different people or sites, comparing word frequencies can reveal differences across populations, while within a single text, word frequencies can highlight key issues, people, or places in a text.
Using the unnest_tokens()
function, extract out the individual words from Malinowski and create a table sorting the top words by count. What do you notice about the top words? Why do you think these words appear at the top of the list?
## make into individual words
words<- malinowski1922 %>% unnest_tokens(output=word,input=text)
#notice that this also converts to lowercase and removes punctuation
#look at the top 50 words in the document
words %>% count(word,sort=T) %>% top_n(50)
# A tibble: 50 × 2
word n
<chr> <int>
1 the 18607
2 of 10254
3 and 6625
4 in 5296
5 to 4950
6 a 4732
7 is 3554
8 it 2070
9 as 1795
10 on 1746
# … with 40 more rows
Many of these top words are what we call stop words, or those that add little to our analysis. These include words like is, the and so, that add little to our understanding of the overall topics or themes in a text. Tidytext
has a built in dictionary of stop words, making it easy to quickly remove these words from the text.
# look at the words in the stop_words dataset
data(stop_words)
stop_words %>% top_n(50)
# A tibble: 174 × 2
word lexicon
<chr> <chr>
1 i snowball
2 me snowball
3 my snowball
4 myself snowball
5 we snowball
6 our snowball
7 ours snowball
8 ourselves snowball
9 you snowball
10 your snowball
# … with 164 more rows
#remove stop words from the text
malinowski1922tidy <- words %>% anti_join(stop_words)
#look at the structure
str(malinowski1922tidy)
tibble [81,199 × 2] (S3: tbl_df/tbl/data.frame)
$ gutenberg_id: int [1:81199] 55822 55822 55822 55822 55822 55822 55822 55822 55822 55822 ...
$ word : chr [1:81199] "argonauts" "western" "pacific" "account" ...
Now we can look at the number of unique words and their counts in Malinowski, without interference from stop words.
#how many unique words are there?
length(unique(malinowski1922tidy$word))
[1] 10925
#make a table of the top words with stop words removed
malinowski1922tidy_wordcounts <- malinowski1922tidy %>% count(word, sort=T)
##look at top 50 words
malinowski1922tidy %>% count(word,sort=TRUE) %>% top_n(50) %>% mutate(word=reorder(word,n)) %>% data.frame()
word n
1 kula 932
2 magic 880
3 canoe 814
4 natives 596
5 village 405
6 spell 346
7 native 336
8 magical 307
9 canoes 304
10 island 268
11 dobu 267
12 time 253
13 called 248
14 chapter 243
15 food 228
16 sea 227
17 main 221
18 villages 219
19 words 217
20 sinaketa 208
21 beach 204
22 trobriands 199
23 chief 197
24 people 186
25 gifts 184
26 ceremonial 182
27 shell 181
28 word 180
29 district 176
30 expedition 172
31 nut 167
32 life 162
33 social 162
34 form 156
35 kitava 153
36 exchange 149
37 myth 147
38 day 143
39 sailing 143
40 found 141
41 south 140
42 means 139
43 sail 137
44 spells 137
45 islands 135
46 manner 135
47 trade 133
48 amphletts 132
49 gift 131
50 community 128
Make a plot of these top words. What do you make of these new top words?
#plot top words from tokenized tweets
top50wordsplot <- malinowski1922tidy %>% count(word,sort=TRUE) %>% top_n(50) %>% mutate(word=reorder(word,n))%>% ggplot(aes(x=word,y=n))+ geom_col()+xlab(NULL)+coord_flip()+labs(y="Count",x="Unique words", title="Malinowski 1922")
top50wordsplot
Wordclouds are often avoided in scientific research due to their sometimes misleading arrangements and sizes of words. This can make them difficult to interpret. At the same time, word clouds can be useful in exploratory data analysis or applied research for quickly showing the main themes in a text, that can then be explored for further contextual information. Here we will make wordclouds of Malinowski’s text using two different methods.
malinowski1922tidy %>% count(word) %>% with(wordcloud(word, n, max.words = 100))
Another way to make wordclouds using the WordCloud2 package.
#install the package
#require(devtools)
#install_github("lchiffon/wordcloud2")
#load package
library(wordcloud2)
#make wordcloud. you may want to expand out the figure for the full effect.
wordcloud2(data = malinowski1922tidy_wordcounts)
We can also analyze pairs of words (bigrams). This can be useful for understanding the context around particular words as well as for identifying themes that are made up of multiple strings (e.g. “climate change”, “public health”).
bigrams<- malinowski1922 %>% unnest_tokens(output=bigrams,input=text, token="ngrams",n=2)
str(bigrams)
tibble [197,633 × 2] (S3: tbl_df/tbl/data.frame)
$ gutenberg_id: int [1:197633] 55822 55822 55822 55822 55822 55822 55822 55822 55822 55822 ...
$ bigrams : chr [1:197633] "argonauts of" "of the" "the western" "western pacific" ...
##look at counts for each pair
bigrams %>% count(bigrams, sort = TRUE) %>% top_n(20)
# A tibble: 20 × 2
bigrams n
<chr> <int>
1 <NA> 3192
2 of the 2896
3 in the 1551
4 to the 912
5 on the 721
6 and the 559
7 it is 504
8 the kula 486
9 of a 421
10 with the 418
11 the natives 405
12 to be 399
13 by the 391
14 from the 340
15 is the 330
16 the canoe 329
17 there is 282
18 that the 275
19 one of 268
20 in a 267
One challenge here is that again the stop words rise to the top of the frequencies. There are multiple ways we can handle this, but here we will remove any bigrams whre either the first or second word is a stop word.
#seperate words to pull out stop words
separated_words <- bigrams %>% separate(bigrams, c("word1", "word2"), sep = " ")
#filter out stop words
malinowski_bigrams <- separated_words %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word)
Make a table of the top 100 bigrams sorted from most to least frequent.
Pull out all bigrams where “island” is the second term and make a table of the most common bigrams in this subset.
Pull out all bigrams where “canoe” is either the first or second term and make a table of the most common bigrams in this subset.
What does this analysis tell you about this text? Can you think of any data in your own research that would benefit from ngram analysis?
malinowski_bigrams_count <- malinowski_bigrams %>% count(word1, word2, sort = TRUE)
malinowski_bigrams_count %>% top_n(20)
# A tibble: 21 × 3
word1 word2 n
<chr> <chr> <int>
1 <NA> <NA> 3192
2 betel nut 80
3 coco nut 74
4 olden days 59
5 conch shell 55
6 tribal life 46
7 canoe building 45
8 kula magic 45
9 woodlark island 44
10 coco nuts 33
# … with 11 more rows
#top 100 pairs of words
bigram100 <- head(malinowski_bigrams_count, 100) %>% data.frame()
bigram100
word1 word2 n
1 <NA> <NA> 3192
2 betel nut 80
3 coco nut 74
4 olden days 59
5 conch shell 55
6 tribal life 46
7 canoe building 45
8 kula magic 45
9 woodlark island 44
10 coco nuts 33
11 canoe magic 30
12 flying witches 29
13 kula expedition 29
14 chapter ii 27
15 communal labour 26
16 southern massim 26
17 arm shells 25
18 magical rites 25
19 op cit 24
20 prow boards 22
21 spondylus shell 22
22 garden magic 20
23 key words 20
24 kula ring 20
25 native belief 20
26 inland kula 19
27 key word 18
28 kula communities 18
29 kula community 18
30 magical formul 18
31 evil magic 17
32 kula valuables 17
33 lime pot 17
34 social organisation 17
35 south coast 17
36 chapter vi 16
37 main island 16
38 sugar cane 16
39 village community 16
40 ginger root 15
41 ii division 15
42 inter tribal 15
43 nut oil 15
44 professor seligman 15
45 division ii 14
46 free translation 14
47 maternal uncle 14
48 pandanus streamers 14
49 trobriand islands 14
50 uvalaku expedition 14
51 white man's 14
52 banana leaf 13
53 beauty magic 13
54 generations ago 13
55 lime pots 13
56 native ideas 13
57 pandanus streamer 13
58 prow board 13
59 super normal 13
60 chapter xiii 12
61 conch shells 12
62 fergusson island 12
63 kula expeditions 12
64 magical rite 12
65 mwasila magic 12
66 native life 12
67 overseas expedition 12
68 port moresby 12
69 seligman op 12
70 areca nut 11
71 clay pots 11
72 dawson straits 11
73 division vi 11
74 folk lore 11
75 love magic 11
76 magical bundle 11
77 mint plant 11
78 normanby island 11
79 north west 11
80 primitive economics 11
81 trial run 11
82 turtle shell 11
83 axe blades 10
84 ceremonial distribution 10
85 cf chapter 10
86 chapter iii 10
87 chapter vii 10
88 counter gift 10
89 division iii 10
90 elder brother 10
91 fish hawk 10
92 flying canoe 10
93 kula district 10
94 kula exchange 10
95 kula gifts 10
96 lashing creeper 10
97 mental attitude 10
98 nut betel 10
99 overseas expeditions 10
100 red paint 10
##look at words that appear next to the word "island"
islandbigram <- malinowski_bigrams %>% filter(word1 == "island" | word2 == "island")
islandbigram %>% count(word1, word2, sort = TRUE) %>% top_n(20)
# A tibble: 62 × 3
word1 word2 n
<chr> <chr> <int>
1 woodlark island 44
2 main island 16
3 fergusson island 12
4 normanby island 11
5 coral island 4
6 dobu island 4
7 neighbouring island 4
8 island called 3
9 rossel island 3
10 aignan island 2
# … with 52 more rows
#just where island is the second term
islandbigram <- malinowski_bigrams %>% filter(word2 == "island")
islandbigram %>% count(word1, word2, sort = TRUE) %>% top_n(20)
# A tibble: 28 × 3
word1 word2 n
<chr> <chr> <int>
1 woodlark island 44
2 main island 16
3 fergusson island 12
4 normanby island 11
5 coral island 4
6 dobu island 4
7 neighbouring island 4
8 rossel island 3
9 aignan island 2
10 amphlett island 2
# … with 18 more rows
##look at words that appear next to the word "canoe"
canoebigram <- malinowski_bigrams %>% filter(word1 == "canoe" | word2 == "canoe")
canoebigram %>% count(word1, word2, sort = TRUE) %>% top_n(20)
# A tibble: 39 × 3
word1 word2 n
<chr> <chr> <int>
1 canoe building 45
2 canoe magic 30
3 flying canoe 10
4 canoe flies 7
5 canoe builder 5
6 masawa canoe 5
7 canoe body 4
8 canoe spells 4
9 canoe thou 4
10 chief's canoe 4
# … with 29 more rows
Texts often contain certain emotions, feelings, or sentiments that can tell us more about what they mean. In a way, coding text data for sentiments is similar to the qualitative reseach method of coding fieldnotes for themes. Because of this, you can develop your own custom lexicon for your research context. However, because this is a popular methodology, many existing sentiment analysis dictionaries have been developed and publicly shared.
We’ll work with the NRC Emotion Lexicon. First, we can load the NRC lexicon and look at the different types of sentiments that it contains.
# load the nrc sentiment dictionary
get_sentiments("nrc")
# A tibble: 13,875 × 2
word sentiment
<chr> <chr>
1 abacus trust
2 abandon fear
3 abandon negative
4 abandon sadness
5 abandoned anger
6 abandoned fear
7 abandoned negative
8 abandoned sadness
9 abandonment anger
10 abandonment fear
# … with 13,865 more rows
nrcdf <- get_sentiments("nrc")
#take a look at the top sentiments that occur in the lexicon
nrcdf %>% count(sentiment,sort=T)
# A tibble: 10 × 2
sentiment n
<chr> <int>
1 negative 3318
2 positive 2308
3 fear 1474
4 anger 1246
5 trust 1230
6 sadness 1187
7 disgust 1056
8 anticipation 837
9 joy 687
10 surprise 532
Using inner_join()
we can combine the sentiments with the words from Malinowski, effectively “tagging” each word with a particular sentiment.
#merge sentiments to malinowski data
malinowski1922_sentiment <- malinowski1922tidy %>% inner_join(get_sentiments("nrc"))
With the new merged and tagged dataframe, make a table of the top words in Malinowski that are associated with the sentiment “trust” and one other sentiment of choice. Reflect on how you might interprete these results. Do you find this information useful? Is there any place you could see sentiment analysis being useful in your own research?
# look at the top words associated with trust
malinowski1922_sentiment %>% filter(sentiment=="trust") %>% count(word,sort=T)
# A tibble: 480 × 2
word n
<chr> <int>
1 food 228
2 word 180
3 exchange 149
4 found 141
5 trade 133
6 rule 124
7 clan 104
8 account 101
9 formula 92
10 real 92
# … with 470 more rows
#pick another sentient and pull out the top 20 words associated witth this sentiment.
malinowski1922_sentiment %>% filter(sentiment=="surprise") %>% count(word,sort=T) %>% top_n(20)
# A tibble: 20 × 2
word n
<chr> <int>
1 magical 307
2 shell 181
3 gift 131
4 magician 100
5 spirits 64
6 tree 55
7 finally 54
8 sorcery 46
9 death 44
10 ceremony 37
11 deal 37
12 leave 32
13 hero 30
14 remarkable 29
15 break 27
16 sun 27
17 feeling 25
18 mouth 23
19 catch 22
20 art 21
malinowski1922_sentiment %>% filter(sentiment=="sadness") %>% count(word,sort=T) %>% top_n(20)
# A tibble: 20 × 2
word n
<chr> <int>
1 shell 181
2 mother 54
3 evil 53
4 doubt 50
5 death 44
6 bad 40
7 mortuary 37
8 bottom 36
9 leave 32
10 broken 31
11 shipwreck 28
12 danger 27
13 feeling 25
14 fall 23
15 sentence 23
16 art 21
17 hut 20
18 die 19
19 disease 18
20 lie 18
Now that we’ve learned a bit about text analysis using Malinowski let’s test our skills on a real world dataset. Here we will use data from a survey in two Inupiaq villages in Alaska to examine how indiviuals in these communities feel about climate change and thawing permafrost. These data are drawn from here: William B. Bowden 2013. Perceptions and implications of thawing permafrost and climate change in two Inupiaq villages of arctic Alaska Link. Let’s further examine the responses to two open ended questions: (Q5) What is causing it [permafrost around X village] to change? and (Q69) “What feelings do you have when thinking about the possibility of future climate change in and around [village name]?”.
First we load the data and subset out the columns of interest.
#we will work with the permafrost survey data.
surv<- read.csv("https://maddiebrown.github.io/ANTH630/data/Survey_AKP-SEL.csv", stringsAsFactors = F)
surv_subset <- surv %>% select(Village, Survey.Respondent, Age.Group, X69..Feelings, X5..PF.Cause.)
Then we can quickly calculate the most frequent terms across all 80 responses.
class(surv$X69..Feelings) #make sure your column is a character variable
[1] "character"
surv_tidy <- surv_subset %>% unnest_tokens(word, X69..Feelings) %>% anti_join(stop_words)
#what are most common words?
feelingswordcount <- surv_tidy %>% count(word,sort=T)
Make wordclouds of the word frequency in responses about feelings related to climate change using two different methods.
surv_tidy %>% count(word) %>% with(wordcloud(word, n, max.words = 100))
#wordcloud2(data = feelingswordcount)
Are there noticeable differences in responses across individuals from different sites? We can compare the responses about “What feelings do you have when thinking about the possibility of future climate change in and around [village name]?” from the permafrost survey, based on which village the respondent lives in.
#word frequency by village
surv_tidy <- surv_subset %>% unnest_tokens(word, X69..Feelings) %>% anti_join(stop_words)
#what are most common words?
surv_tidy %>% count(word,sort=T) %>% top_n(20)
word n
1 change 10
2 climate 8
3 sad 7
4 animals 6
5 cold 6
6 future 6
7 scary 6
8 weather 6
9 adapt 5
10 caribou 5
11 nc 5
12 worried 5
13 changing 4
14 dk 4
15 ground 4
16 move 4
17 people 4
18 worry 4
19 affect 3
20 blank 3
21 concerned 3
22 days 3
23 farther 3
24 feel 3
25 food 3
26 land 3
27 river 3
28 scared 3
29 time 3
30 water 3
#we can also look at the top words
byvillage <- surv_tidy %>% count(Village,word,sort=T) %>% ungroup()
byvillage %>% top_n(20)
Village word n
1 AKP animals 6
2 AKP change 6
3 SEL scary 6
4 AKP future 5
5 AKP nc 5
6 AKP caribou 4
7 AKP climate 4
8 AKP cold 4
9 AKP sad 4
10 AKP weather 4
11 AKP worry 4
12 SEL change 4
13 SEL climate 4
14 SEL dk 4
15 SEL ground 4
16 SEL move 4
17 AKP days 3
18 AKP people 3
19 AKP worried 3
20 SEL adapt 3
21 SEL blank 3
22 SEL river 3
23 SEL sad 3
24 SEL water 3
top_10 <- byvillage %>%
group_by(Village) %>%
top_n(10, n) %>%
ungroup() %>%
arrange(Village, desc(n))
ggplot(top_10, aes(x=reorder(word,n),y=n)) + geom_bar(stat="identity") +coord_flip() +ggtitle("Top terms by village") + labs(x="Word", y="Count") +facet_wrap(~ Village, scales = "free_y")
In addition to analyzing word and bigram frequencies, we can also analyze texts using topic modeling. Topic modeling allows us to identify themes in the text without needing to clearly know which themes or groupings we expect to emerge. This can be very useful when you have large columes of messy data or data from multiple diverse sources that you need to parse. We will use Latent Dirichlet allocation or (LDA), following the explanation in Text Mining with R.
Before we can identify themes across responses however, we need to make sure each “document” or “response” has a unique identifier.
What is the primary key or unique identifier for this dataset? How do you know? Why can’t you use Survey.Respondent as a unique identifier?
Make a new primary key called “ID” that has a different value for each unique response.
surv_subset %>% select(Village, Survey.Respondent)
Village Survey.Respondent
1 AKP 1
2 AKP 2
3 AKP 3
4 AKP 4
5 AKP 5
6 AKP 6
7 AKP 7
8 AKP 8
9 AKP 9
10 AKP 10
11 AKP 11
12 AKP 12
13 AKP 13
14 AKP 14
15 AKP 15
16 AKP 16
17 AKP 17
18 AKP 18
19 AKP 19
20 AKP 20
21 AKP 21
22 AKP 22
23 AKP 23
24 AKP 24
25 AKP 25
26 AKP 26
27 AKP 27
28 AKP 28
29 AKP 29
30 AKP 30
31 AKP 31
32 AKP 32
33 AKP 33
34 AKP 34
35 AKP 35
36 AKP 36
37 AKP 37
38 AKP 38
39 AKP 39
40 SEL 1
41 SEL 2
42 SEL 3
43 SEL 4
44 SEL 5
45 SEL 6
46 SEL 7
47 SEL 8
48 SEL 9
49 SEL 10
50 SEL 11
51 SEL 12
52 SEL 13
53 SEL 14
54 SEL 15
55 SEL 16
56 SEL 17
57 SEL 18
58 SEL 19
59 SEL 20
60 SEL 21
61 SEL 22
62 SEL 23
63 SEL 24
64 SEL 25
65 SEL 26
66 SEL 27
67 SEL 28
68 SEL 29
69 SEL 30
70 SEL 31
71 SEL 32
72 SEL 33
73 SEL 34
74 SEL 35
75 SEL 36
76 SEL 37
77 SEL 38
78 SEL 39
79 SEL 40
80 SEL 41
surv_subset<- surv_subset %>% mutate(ID=paste(Village,Survey.Respondent, sep="_"))
#look at the bigrams in these responses.
surv_subset %>% unnest_tokens(output=bigrams,input=X69..Feelings, token="ngrams",n=2) %>% count(bigrams,sort=T) %>% top_n(20)
bigrams n
1 <NA> 15
2 have to 9
3 don't know 4
4 it will 4
5 need to 4
6 to adapt 4
7 to be 4
8 to move 4
9 we have 4
10 we will 4
11 able to 3
12 about it 3
13 be able 3
14 had to 3
15 higher ground 3
16 if we 3
17 scary thought 3
18 the climate 3
19 the weather 3
20 think about 3
21 to higher 3
22 used to 3
23 will have 3
24 won't be 3
#look at pairwise counts per responses. how often do two words show up together in one person's response?
library(widyr)
surv_tidy %>% pairwise_count(word, Survey.Respondent,sort=T)
# A tibble: 3,322 × 3
item1 item2 n
<chr> <chr> <dbl>
1 change cold 3
2 caribou cold 3
3 change climate 3
4 cold change 3
5 climate change 3
6 adapt change 3
7 scary change 3
8 cold caribou 3
9 changing sad 3
10 sad changing 3
# … with 3,312 more rows
# so we see that "change" and "cold" appear in three responses, as do climate and change. however, we previously learned that the Survey.Respondent column is not a unique identifier for the responses. Let's run the same code, but with the new ID column we created.
###let's make a new surv_tidy object that incoporates the new ID we made
surv_tidy <- surv_subset %>% unnest_tokens(word, X69..Feelings) %>% anti_join(stop_words)
surv_tidy %>% pairwise_count(word, ID,sort=T) %>% top_n(20)
# A tibble: 58 × 3
item1 item2 n
<chr> <chr> <dbl>
1 adapt change 3
2 scary change 3
3 change adapt 3
4 change scary 3
5 climate cold 2
6 change cold 2
7 weather cold 2
8 cold climate 2
9 change climate 2
10 relocate climate 2
# … with 48 more rows
#in this case the output is nearly the same, but in other cases this distinction can make a significant difference.
The first step in creating a topic model is to count the number of times each word appears in each individual document (or response in our case). Luckily, we can count by two variables using the count()
function. Let’s create a new byresponse
variable.
byresponse <- surv_tidy %>% count(ID,word,sort=T) %>% ungroup()
#check how many responses are included in the analysis. this allows you to double check that the new unique identifier we made worked as expected.
unique(byresponse$ID)
[1] "AKP_26" "AKP_1" "AKP_17" "AKP_18" "AKP_3" "AKP_32" "AKP_5" "SEL_21"
[9] "AKP_10" "AKP_11" "AKP_12" "AKP_13" "AKP_14" "AKP_15" "AKP_16" "AKP_19"
[17] "AKP_2" "AKP_20" "AKP_22" "AKP_23" "AKP_24" "AKP_25" "AKP_27" "AKP_28"
[25] "AKP_29" "AKP_30" "AKP_31" "AKP_33" "AKP_34" "AKP_35" "AKP_36" "AKP_37"
[33] "AKP_38" "AKP_4" "AKP_6" "AKP_7" "AKP_8" "AKP_9" "SEL_1" "SEL_10"
[41] "SEL_11" "SEL_13" "SEL_14" "SEL_15" "SEL_17" "SEL_18" "SEL_19" "SEL_2"
[49] "SEL_20" "SEL_22" "SEL_24" "SEL_25" "SEL_26" "SEL_27" "SEL_29" "SEL_3"
[57] "SEL_30" "SEL_31" "SEL_32" "SEL_33" "SEL_34" "SEL_35" "SEL_36" "SEL_37"
[65] "SEL_38" "SEL_39" "SEL_4" "SEL_40" "SEL_5" "SEL_6" "SEL_7" "SEL_8"
[73] "SEL_9"
length(unique(byresponse$ID))
[1] 73
Now we can convert our longform word list into a document-term matrix. Read more here
surv_dtm <- byresponse %>% cast_dtm(ID, word, n)
?cast_dtm #read up on how this function works
Run the LDA() function and choose a number of solutions. in this case, let’s try it with 2
surv_lda <- LDA(surv_dtm, k = 2, control = list(seed = 9999))
#look at our output
str(surv_lda)
Formal class 'LDA_VEM' [package "topicmodels"] with 14 slots
..@ alpha : num 35.6
..@ call : language LDA(x = surv_dtm, k = 2, control = list(seed = 9999))
..@ Dim : int [1:2] 73 210
..@ control :Formal class 'LDA_VEMcontrol' [package "topicmodels"] with 13 slots
.. .. ..@ estimate.alpha: logi TRUE
.. .. ..@ alpha : num 25
.. .. ..@ seed : int 9999
.. .. ..@ verbose : int 0
.. .. ..@ prefix : chr "/var/folders/l_/hl0qrh9535l691r67nxlvd9c0000gn/T//RtmpfCj5rL/file46556f3259ed"
.. .. ..@ save : int 0
.. .. ..@ nstart : int 1
.. .. ..@ best : logi TRUE
.. .. ..@ keep : int 0
.. .. ..@ estimate.beta : logi TRUE
.. .. ..@ var :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
.. .. .. .. ..@ iter.max: int 500
.. .. .. .. ..@ tol : num 1e-06
.. .. ..@ em :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
.. .. .. .. ..@ iter.max: int 1000
.. .. .. .. ..@ tol : num 1e-04
.. .. ..@ initialize : chr "random"
..@ k : int 2
..@ terms : chr [1:210] "days" "cold" "april" "march" ...
..@ documents : chr [1:73] "AKP_26" "AKP_1" "AKP_17" "AKP_18" ...
..@ beta : num [1:2, 1:210] -4.73 -4.75 -5.22 -3.52 -4.58 ...
..@ gamma : num [1:73, 1:2] 0.505 0.491 0.476 0.498 0.516 ...
..@ wordassignments:List of 5
.. ..$ i : int [1:330] 1 1 1 1 1 1 1 1 1 1 ...
.. ..$ j : int [1:330] 1 11 31 32 48 50 62 63 64 65 ...
.. ..$ v : num [1:330] 1 1 1 1 2 2 2 1 1 2 ...
.. ..$ nrow: int 73
.. ..$ ncol: int 210
.. ..- attr(*, "class")= chr "simple_triplet_matrix"
..@ loglikelihood : num [1:73] -64.3 -41.4 -66.3 -50.1 -49.8 ...
..@ iter : int 8
..@ logLiks : num(0)
..@ n : int 343
#examine the probability that each word is in a particular topic group
surv_topics <- tidy(surv_lda, matrix = "beta")
surv_topics
# A tibble: 420 × 3
topic term beta
<int> <chr> <dbl>
1 1 days 0.00885
2 2 days 0.00864
3 1 cold 0.00540
4 2 cold 0.0296
5 1 april 0.0102
6 2 april 0.00140
7 1 march 0.00346
8 2 march 0.00821
9 1 months 0.000975
10 2 months 0.0107
# … with 410 more rows
Examine the top words for each topic identified by the model.
top_words <- surv_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
ungroup() %>%
arrange(topic, desc(beta))
top_words
# A tibble: 20 × 3
topic term beta
<int> <chr> <dbl>
1 1 climate 0.0446
2 1 change 0.0363
3 1 animals 0.0324
4 1 sad 0.0258
5 1 adapt 0.0204
6 1 caribou 0.0196
7 1 scary 0.0180
8 1 blank 0.0173
9 1 river 0.0159
10 1 future 0.0145
11 2 cold 0.0296
12 2 weather 0.0276
13 2 worried 0.0248
14 2 nc 0.0224
15 2 change 0.0220
16 2 future 0.0205
17 2 dk 0.0181
18 2 people 0.0175
19 2 land 0.0172
20 2 scary 0.0170
We can also examine the results graphically.
#plot these top words for each topic (adapted from https://www.tidytextmining.com/topicmodeling.html)
top_words %>% group_by(topic) %>%
mutate(term = fct_reorder(term, beta)) %>% ungroup() %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") + theme_minimal()
Repeat the topic modeling analysis but using 6 topics instead of two.
##repeat the analysis but with 6 topics
surv_lda6 <- LDA(surv_dtm, k = 6, control = list(seed = 9999))
#examine the probability that each word is in a particular topic group
surv_topics6 <- tidy(surv_lda6, matrix = "beta")
surv_topics6
# A tibble: 1,260 × 3
topic term beta
<int> <chr> <dbl>
1 1 days 4.88e- 2
2 2 days 9.19e-262
3 3 days 1.69e-269
4 4 days 1.17e-270
5 5 days 5.77e-267
6 6 days 4.77e-268
7 1 cold 1.63e- 2
8 2 cold 5.81e- 2
9 3 cold 1.68e- 2
10 4 cold 4.86e-235
# … with 1,250 more rows
#examine top words for each topic
top_words6 <- surv_topics6 %>%
group_by(topic) %>%
top_n(5, beta) %>%
ungroup() %>%
arrange(topic, desc(beta))
#plot these top words for each topic (adapted from https://www.tidytextmining.com/topicmodeling.html)
top_words6 %>% group_by(topic) %>%
mutate(term = fct_reorder(term, beta)) %>% ungroup() %>%
ggplot(aes(beta, term, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") + theme_minimal()
In this case, our sample is small, so topic modeling is not necessarily the best method to use. However, even from this small sample, you can see that some topics emerge from the text that were not previously apparent.
Sometimes you’ll need to edit text or strings manually. For example, you may find that for your research question, you are less interested in differentiating between the terms running, run, and runner, than in identifying clusters of beliefs about running as a more general concept. On the other hand, you might want to differentiate between runners and running as beliefs about groups of people vs. the act of running. How you choose to transform text data in your research is up to your research questions and understanding of the cultural context.
R has a number of helpful functions for manually adjusting strings. We’ll cover a few to get you started. Let’s go back to the permafrost and climate change survey and look at responses to: (Q5) What is causing it [permafrost around X village] to change?.
First let’s look at the raw data. What are some potential issues in the strings below that might make text analysis difficult or ineffective?
surv$X5..PF.Cause.[10:30]
[1] "climate change"
[2] "too warm"
[3] "warmer weather, shorter winters, lack of snow, fast springs. [will affect AKP because we] use Argos to go out."
[4] "weather warming"
[5] "temperature"
[6] "heat. A lot of heat."
[7] "melting of the ground - goes down"
[8] "spirited answer. Her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[9] "probably global warming"
[10] "temperature outside is not steady"
[11] "(blank)"
[12] "N/A"
[13] "most likely war weather or global warming"
[14] "not much winter - hardly get snow. Always wind blown. Summer be rain, rain, rain. Late snow."
[15] "warmer winters"
[16] "global warming"
[17] "warming weather, longer summer/fall season"
[18] "I have no idea"
[19] "warmer climate"
[20] "Seems like there's lots of rain & water causes ground to thaw. So maybe accumulated water? Maybe warm weather?"
[21] "the heat wave - winter frosts"
Luckily we can manually adjust the strings to make them easier to analyze systematically. For example we might set characters to lowercase, trim whitespace and remove any empty or missing rows.
#make a new column to hold the tidy data
surv$cause_tidy <- surv$X5..PF.Cause.
#make lower case
surv$cause_tidy <- tolower(surv$cause_tidy)
#remove white space at beginning and end of string
surv$cause_tidy<- trimws(surv$cause_tidy)
#filter out blank or empty rows
surv<- surv %>% filter(surv$cause_tidy!="")
surv <- surv %>% filter(surv$cause_tidy!="(blank)")
surv <- surv %>% filter(surv$cause_tidy!="n/a")
We can also directly replace particular strings. Here we change some strings with typos.
surv$cause_tidy<- surv$cause_tidy %>% str_replace("wamer", "warmer")
surv$cause_tidy<- surv$cause_tidy %>% str_replace("lnoger", "longer")
Another common string data transformation involves grouping together responses into more standardized categories. You can transform cll values individually or based on exact string matches. In addition, using %like%
we can transform any strings where just part of the string contains a particular string. For example, we might decide that any time the string “warm” appears in a response, the overall theme of the response is associated with “global warming”. Or based on our ethnographic understanding of the context we might know that “seasonal changes” are important causes of permafrost in local cultural models. We can then look for some key terms that will allow us to rapidly change multiple responses that are likely to fit in this category. In this case, “late” and “early”.
#group some responses together based on the presence of a particular string
surv <- surv %>% mutate(cause_tidy=replace(cause_tidy,cause_tidy %like% "warm","global warming"))
surv$cause_tidy[1:30]
[1] "environment"
[2] "exhaust"
[3] "global warming"
[4] "global warming"
[5] "hot summers, early springs. in super cold winters the ground comes up & cracks and water comes out."
[6] "global warming"
[7] "climate change"
[8] "freezing & thawing in fall & spring"
[9] "global warming"
[10] "climate change"
[11] "global warming"
[12] "global warming"
[13] "global warming"
[14] "temperature"
[15] "heat. a lot of heat."
[16] "melting of the ground - goes down"
[17] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[18] "global warming"
[19] "temperature outside is not steady"
[20] "global warming"
[21] "not much winter - hardly get snow. always wind blown. summer be rain, rain, rain. late snow."
[22] "global warming"
[23] "global warming"
[24] "global warming"
[25] "i have no idea"
[26] "global warming"
[27] "global warming"
[28] "the heat wave - winter frosts"
[29] "global warming"
[30] "global warming"
surv <- surv %>% mutate(cause_tidy=replace(cause_tidy,cause_tidy %like% "early"|cause_tidy %like% "late","seasonal changes"))
surv$cause_tidy[1:30]
[1] "environment"
[2] "exhaust"
[3] "global warming"
[4] "global warming"
[5] "seasonal changes"
[6] "global warming"
[7] "climate change"
[8] "freezing & thawing in fall & spring"
[9] "global warming"
[10] "climate change"
[11] "global warming"
[12] "global warming"
[13] "global warming"
[14] "temperature"
[15] "heat. a lot of heat."
[16] "melting of the ground - goes down"
[17] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[18] "global warming"
[19] "temperature outside is not steady"
[20] "global warming"
[21] "seasonal changes"
[22] "global warming"
[23] "global warming"
[24] "global warming"
[25] "i have no idea"
[26] "global warming"
[27] "global warming"
[28] "the heat wave - winter frosts"
[29] "global warming"
[30] "global warming"
#compare the original with your categorizations
#surv %>% select(X5..PF.Cause.,cause_tidy)
We won’t get into too much detail today, but you can also search and select string data using regular expressions. You can read more in R4DS. Here let’s use str_detect()
to pull out some strings with regular expressions.
#any responses ending in "ing"
surv$cause_tidy[str_detect(surv$cause_tidy,"ing$")]
[1] "global warming" "global warming"
[3] "global warming" "freezing & thawing in fall & spring"
[5] "global warming" "global warming"
[7] "global warming" "global warming"
[9] "global warming" "global warming"
[11] "global warming" "global warming"
[13] "global warming" "global warming"
[15] "global warming" "global warming"
[17] "global warming" "global warming"
[19] "global warming" "global warming"
[21] "global warming" "global warming"
[23] "global warming" "global warming"
[25] "global warming" "global warming"
[27] "global warming" "global warming"
[29] "global warming" "global warming"
[31] "global warming" "global warming"
[33] "global warming" "global warming"
[35] "global warming"
#any reponses that contain a W followed by either an 'e' or an 'a'
surv$cause_tidy[str_detect(surv$cause_tidy,"w[ea]")]
[1] "global warming"
[2] "global warming"
[3] "global warming"
[4] "global warming"
[5] "global warming"
[6] "global warming"
[7] "global warming"
[8] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[9] "global warming"
[10] "global warming"
[11] "global warming"
[12] "global warming"
[13] "global warming"
[14] "global warming"
[15] "global warming"
[16] "the heat wave - winter frosts"
[17] "global warming"
[18] "global warming"
[19] "weather"
[20] "global warming"
[21] "global warming"
[22] "global warming"
[23] "global warming"
[24] "global warming"
[25] "weather."
[26] "global warming"
[27] "global warming"
[28] "global warming"
[29] "global warming"
[30] "global warming"
[31] "global warming"
[32] "global warming"
[33] "global warming"
[34] "global warming"
[35] "global warming"
[36] "global warming"
[37] "global warming"
[38] "global warming"
#any responses that contain the string erosion
surv$cause_tidy[str_detect(surv$cause_tidy,"erosion")]
[1] "erosion, and real hot summers and a lot of snow & rain."
[2] "mud goes down river, cracking all along & falling in - erosion"
[3] "ground erosion"
[4] "erosion"
# any responses that contain the string erosion, but which have any character occurring before the word erosion.
surv$cause_tidy[str_detect(surv$cause_tidy,".erosion")]
[1] "mud goes down river, cracking all along & falling in - erosion"
[2] "ground erosion"
The utility of regular expressions is huge for quickly searching through and transforming large volumes of string data. We’ve only scratched the surface today.
Whenever transforming large volumes of data using string detection and regular expressions it is critical to double check that each operation is in fact working as you expected it to. Paying attention to the order of transformations is also important for preventing you from overwriting previous data transformations.
str_detect()
Sometimes it is useful to create flags or indicator variables in your data. These can allow you to quickly filter out rows that have particular characteristics. For example, we can create a new binary column that indicates whether or not the response refers to global warming. This variable can then be used for further grouping, data visualization or other tasks.
surv <- surv %>% mutate(GlobalWarmingYN=str_detect(cause_tidy,"global warming"))
table(surv$GlobalWarmingYN) # how many responses contain the string global warming?
FALSE TRUE
37 34
We can also tag the parts of speech in a text. This allows us to focus an analysis on verbs, nouns, or other parts of speech that may be of interest. For example, in a study on sentiments, we might want to pull out adjectives in order to understand how people feel or describe a particular phenomenon. On the other hand, we might also pull out verbs in order to understand the types of actions people describe as associated with certain cultural practices or beliefs. Let’s tag the parts of speech in Malinowski 1922 to learn more about the places and cultural practices documented in this book.
Using the cnlp_annotate()
function we can tag the parts of speech in Malinowski 1922. This function can take a long time to run. This is the last thing we will do today, so feel free to let it run and then take a break and come back to finish these problems.
Make a new object using only the token part of the output from cnlp_annotate()
and then examine the $upos
column. What are all the unique parts of speech in this dataset?
Select and examine the top 30 nouns and verbs in this dataset. Do any of the terms surprise you? How might this level of analysis of the text be meaningful for your interpretation of its themes?
library(cleanNLP)
cnlp_init_udpipe()
#tag parts of speech. takes a long time
malinowksiannotatedtext <- cnlp_annotate(malinowski1922tidy$word)
str(malinowksiannotatedtext) # look at the structure. because it is a list we have to pull out that particular section of the list
List of 2
$ token : tibble [81,939 × 11] (S3: tbl_df/tbl/data.frame)
..$ doc_id : int [1:81939] 1 2 3 4 5 6 7 8 9 10 ...
..$ sid : int [1:81939] 1 1 1 1 1 1 1 1 1 1 ...
..$ tid : chr [1:81939] "1" "1" "1" "1" ...
..$ token : chr [1:81939] "argonauts" "western" "pacific" "account" ...
..$ token_with_ws: chr [1:81939] "argonauts" "western" "pacific" "account" ...
..$ lemma : chr [1:81939] "argonaut" "western" "pacific" "account" ...
..$ upos : chr [1:81939] "NOUN" "ADJ" "ADJ" "NOUN" ...
..$ xpos : chr [1:81939] "NNS" "JJ" "JJ" "NN" ...
..$ feats : chr [1:81939] "Number=Plur" "Degree=Pos" "Degree=Pos" "Number=Sing" ...
..$ tid_source : chr [1:81939] "0" "0" "0" "0" ...
..$ relation : chr [1:81939] "root" "root" "root" "root" ...
$ document:'data.frame': 81199 obs. of 1 variable:
..$ doc_id: int [1:81199] 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, "class")= chr [1:2] "cnlp_annotation" "list"
malinowskiannotatedtextfull <-data.frame(malinowksiannotatedtext$token)
str(malinowskiannotatedtextfull)
'data.frame': 81939 obs. of 11 variables:
$ doc_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ sid : int 1 1 1 1 1 1 1 1 1 1 ...
$ tid : chr "1" "1" "1" "1" ...
$ token : chr "argonauts" "western" "pacific" "account" ...
$ token_with_ws: chr "argonauts" "western" "pacific" "account" ...
$ lemma : chr "argonaut" "western" "pacific" "account" ...
$ upos : chr "NOUN" "ADJ" "ADJ" "NOUN" ...
$ xpos : chr "NNS" "JJ" "JJ" "NN" ...
$ feats : chr "Number=Plur" "Degree=Pos" "Degree=Pos" "Number=Sing" ...
$ tid_source : chr "0" "0" "0" "0" ...
$ relation : chr "root" "root" "root" "root" ...
# what are all the different parts of speech that have been tagged?
unique(malinowskiannotatedtextfull$upos)
[1] "NOUN" "ADJ" "X" "NUM" "VERB" "ADV" "PRON" "PART" "INTJ"
[10] "PROPN" "AUX" "SYM" "SCONJ" "ADP" "DET"
#verb analysis. first look at some of the verbs that occur in the book
#malinowskiannotatedtextfull %>% filter(upos=="VERB") %>% select(token,lemma) %>% data.frame() %>% top_n(30)
#top 50 verbs
malinowskiannotatedtextfull %>% filter(upos=="VERB") %>% count(token, sort=T) %>% top_n(30)
token n
1 called 248
2 found 141
3 means 139
4 carried 105
5 mentioned 101
6 meaning 99
7 performed 98
8 brought 91
9 obtained 89
10 flying 81
11 objects 76
12 makes 69
13 received 68
14 repeated 66
15 told 66
16 spoken 65
17 takes 61
18 left 60
19 bring 57
20 receive 56
21 consists 55
22 giving 55
23 visit 53
24 carry 52
25 eat 52
26 log 52
27 remain 52
28 connected 51
29 beginning 50
30 leaves 50
#what are the top 50 nouns?
malinowskiannotatedtextfull %>% filter(upos=="NOUN") %>% count(lemma, sort=T) %>% top_n(30) %>% data.frame()
lemma n
1 canoe 1118
2 kula 942
3 magic 880
4 village 624
5 native 596
6 spell 484
7 island 403
8 word 402
9 time 316
10 gift 315
11 sail 307
12 trobriand 307
13 chief 301
14 day 267
15 expedition 265
16 sea 257
17 form 254
18 chapter 243
19 shell 241
20 rite 232
21 food 228
22 beach 209
23 sinaketa 208
24 woman 208
25 myth 198
26 partner 194
27 people 189
28 rule 184
29 life 180
30 district 176