Text mining with R

This week we learn about how to work with text data in R. We will learn how to turn documents into word lists, analyze frequency counts, extract bigrams, analyze sentiment and parts of speech, and how to visualize text analyses.

Analyzing Malinowski

To get started, we will analyze the classic Malinowski (1922) text Argonauts of the Western Pacific. The text can be downloaded from Project Gutenberg, or for simplicity, we can download the text directly using the gutenbergr package.

First, let's load all of the libraries we will be using today.

# install.packages('gutenbergr')
library(gutenbergr)
gutenberg_metadata
# A tibble: 51,997 x 8
   gutenberg_id title author gutenberg_autho… language gutenberg_books… rights
          <int> <chr> <chr>             <int> <chr>    <chr>            <chr> 
 1            0  <NA> <NA>                 NA en       <NA>             Publi…
 2            1 "The… Jeffe…             1638 en       United States L… Publi…
 3            2 "The… Unite…                1 en       American Revolu… Publi…
 4            3 "Joh… Kenne…             1666 en       <NA>             Publi…
 5            4 "Lin… Linco…                3 en       US Civil War     Publi…
 6            5 "The… Unite…                1 en       American Revolu… Publi…
 7            6 "Giv… Henry…                4 en       American Revolu… Publi…
 8            7 "The… <NA>                 NA en       <NA>             Publi…
 9            8 "Abr… Linco…                3 en       US Civil War     Publi…
10            9 "Abr… Linco…                3 en       US Civil War     Publi…
# … with 51,987 more rows, and 1 more variable: has_text <lgl>
library(tidyverse)
library(wordcloud)
library(tidytext)
library(stringr)
library(topicmodels)
library(data.table)
library(cleanNLP)
cnlp_init_udpipe()

Download the Malinowski book text and examine the structure. How are the data organized?

malinowski1922 <- gutenberg_download(55822)
str(malinowski1922)
tibble [22,219 × 2] (S3: tbl_df/tbl/data.frame)
 $ gutenberg_id: int [1:22219] 55822 55822 55822 55822 55822 55822 55822 55822 55822 55822 ...
 $ text        : chr [1:22219] "                    ARGONAUTS OF THE WESTERN PACIFIC" "" "                          An Account of Native" "                        Enterprise and Adventure" ...

Analyzing individual word frequencies

One of the first ways we can explore a text is through looking at word frequncies. With multiple samples from different people or sites, comparing word frequencies can reveal differences across populations, while within a single text, word frequencies can highlight key issues, people, or places in a text.

Try it

Using the unnest_tokens() function, extract out the individual words from Malinowski and create a table sorting the top words by count. What do you notice about the top words? Why do you think these words appear at the top of the list?

Click for solution

## make into individual words
words <- malinowski1922 %>% unnest_tokens(output = word, input = text)
# notice that this also converts to lowercase and removes punctuation

# look at the top 50 words in the document
words %>% count(word, sort = T) %>% top_n(50)
# A tibble: 50 x 2
   word      n
   <chr> <int>
 1 the   18607
 2 of    10254
 3 and    6625
 4 in     5296
 5 to     4950
 6 a      4732
 7 is     3554
 8 it     2070
 9 as     1795
10 on     1746
# … with 40 more rows

Stop words

Many of these top words are what we call stop words, or those that add little to our analysis. These include words like is, the and so, that add little to our understanding of the overall topics or themes in a text. Tidytext has a built in dictionary of stop words, making it easy to quickly remove these words from the text.

# look at the words in the stop_words dataset
data(stop_words)
stop_words %>% top_n(50)
# A tibble: 174 x 2
   word      lexicon 
   <chr>     <chr>   
 1 i         snowball
 2 me        snowball
 3 my        snowball
 4 myself    snowball
 5 we        snowball
 6 our       snowball
 7 ours      snowball
 8 ourselves snowball
 9 you       snowball
10 your      snowball
# … with 164 more rows
# remove stop words from the text
malinowski1922tidy <- words %>% anti_join(stop_words)
# look at the structure
str(malinowski1922tidy)
tibble [81,199 × 2] (S3: tbl_df/tbl/data.frame)
 $ gutenberg_id: int [1:81199] 55822 55822 55822 55822 55822 55822 55822 55822 55822 55822 ...
 $ word        : chr [1:81199] "argonauts" "western" "pacific" "account" ...

Now we can look at the number of unique words and their counts in Malinowski, without interference from stop words.

# how many unique words are there?
length(unique(malinowski1922tidy$word))
[1] 10925
# make a table of the top words with stop words removed
malinowski1922tidy_wordcounts <- malinowski1922tidy %>% count(word, sort = T)

## look at top 50 words
malinowski1922tidy %>% count(word, sort = TRUE) %>% top_n(50) %>% mutate(word = reorder(word, 
    n)) %>% data.frame()
         word   n
1        kula 932
2       magic 880
3       canoe 814
4     natives 596
5     village 405
6       spell 346
7      native 336
8     magical 307
9      canoes 304
10     island 268
11       dobu 267
12       time 253
13     called 248
14    chapter 243
15       food 228
16        sea 227
17       main 221
18   villages 219
19      words 217
20   sinaketa 208
21      beach 204
22 trobriands 199
23      chief 197
24     people 186
25      gifts 184
26 ceremonial 182
27      shell 181
28       word 180
29   district 176
30 expedition 172
31        nut 167
32       life 162
33     social 162
34       form 156
35     kitava 153
36   exchange 149
37       myth 147
38        day 143
39    sailing 143
40      found 141
41      south 140
42      means 139
43       sail 137
44     spells 137
45    islands 135
46     manner 135
47      trade 133
48  amphletts 132
49       gift 131
50  community 128

Make a plot of these top words. What do you make of these new top words?

# plot top words from tokenized tweets
top50wordsplot <- malinowski1922tidy %>% count(word, sort = TRUE) %>% top_n(50) %>% 
    mutate(word = reorder(word, n)) %>% ggplot(aes(x = word, y = n)) + geom_col() + 
    xlab(NULL) + coord_flip() + labs(y = "Count", x = "Unique words", title = "Malinowski 1922")
top50wordsplot

Wordclouds

Wordclouds are often avoided in scientific research due to their sometimes misleading arrangements and sizes of words. This can make them difficult to interpret. At the same time, word clouds can be useful in exploratory data analysis or applied research for quickly showing the main themes in a text, that can then be explored for further contextual information. Here we will make wordclouds of Malinowski's text using two different methods.

malinowski1922tidy %>% count(word) %>% with(wordcloud(word, n, max.words = 100))

Another way to make wordclouds using the WordCloud2 package.

# install the package require(devtools) install_github('lchiffon/wordcloud2')
# load package
library(wordcloud2)
# make wordcloud. you may want to expand out the figure for the full effect.
wordcloud2(data = malinowski1922tidy_wordcounts)

Analyzing pairs of words

We can also analyze pairs of words (bigrams). This can be useful for understanding the context around particular words as well as for identifying themes that are made up of multiple strings (e.g. "climate change", "public health").

bigrams <- malinowski1922 %>% unnest_tokens(output = bigrams, input = text, token = "ngrams", 
    n = 2)
str(bigrams)
tibble [213,666 × 2] (S3: tbl_df/tbl/data.frame)
 $ gutenberg_id: int [1:213666] 55822 55822 55822 55822 55822 55822 55822 55822 55822 55822 ...
 $ bigrams     : chr [1:213666] "argonauts of" "of the" "the western" "western pacific" ...
## look at counts for each pair
bigrams %>% count(bigrams, sort = TRUE) %>% top_n(20)
# A tibble: 20 x 2
   bigrams         n
   <chr>       <int>
 1 of the       3078
 2 in the       1647
 3 to the        968
 4 on the        763
 5 and the       596
 6 it is         540
 7 the kula      523
 8 of a          447
 9 with the      447
10 the natives   446
11 by the        424
12 to be         423
13 from the      361
14 the canoe     361
15 is the        351
16 there is      302
17 that the      295
18 one of        287
19 all the       284
20 the same      282

One challenge here is that again the stop words rise to the top of the frequencies. There are multiple ways we can handle this, but here we will remove any bigrams whre either the first or second word is a stop word.

# seperate words to pull out stop words
separated_words <- bigrams %>% separate(bigrams, c("word1", "word2"), sep = " ")
# filter out stop words
malinowski_bigrams <- separated_words %>% filter(!word1 %in% stop_words$word) %>% 
    filter(!word2 %in% stop_words$word)

Try it

  1. Make a table of the top 100 bigrams sorted from most to least frequent.

  2. Pull out all bigrams where "island" is the second term and make a table of the most common bigrams in this subset.

  3. Pull out all bigrams where "canoe" is either the first or second term and make a table of the most common bigrams in this subset.

  4. What does this analysis tell you about this text? Can you think of any data in your own research that would benefit from ngram analysis?

Click for solution

malinowski_bigrams_count <- malinowski_bigrams %>% count(word1, word2, sort = TRUE)
malinowski_bigrams_count %>% top_n(20)
# A tibble: 20 x 3
   word1    word2          n
   <chr>    <chr>      <int>
 1 betel    nut           81
 2 coco     nut           74
 3 olden    days          61
 4 conch    shell         60
 5 tribal   life          48
 6 canoe    building      47
 7 woodlark island        47
 8 kula     magic         46
 9 kula     expedition    37
10 coco     nuts          33
11 southern massim        32
12 communal labour        31
13 flying   witches       31
14 canoe    magic         30
15 chapter  ii            28
16 magical  rites         27
17 arm      shells        25
18 op       cit           24
19 kula     community     23
20 kula     ring          23

# top 100 pairs of words
bigram100 <- head(malinowski_bigrams_count, 100) %>% data.frame()
bigram100
          word1        word2  n
1         betel          nut 81
2          coco          nut 74
3         olden         days 61
4         conch        shell 60
5        tribal         life 48
6         canoe     building 47
7      woodlark       island 47
8          kula        magic 46
9          kula   expedition 37
10         coco         nuts 33
11     southern       massim 32
12     communal       labour 31
13       flying      witches 31
14        canoe        magic 30
15      chapter           ii 28
16      magical        rites 27
17          arm       shells 25
18           op          cit 24
19         kula    community 23
20         kula         ring 23
21       inland         kula 22
22         prow       boards 22
23    spondylus        shell 22
24       garden        magic 21
25       native       belief 21
26          key        words 20
27         kula  communities 20
28      magical       formul 20
29    professor     seligman 20
30       social organisation 20
31        south        coast 19
32         evil        magic 18
33          key         word 18
34         lime          pot 18
35         main       island 18
36      village    community 18
37         kula    valuables 17
38          nut          oil 17
39        sugar         cane 17
40      chapter           vi 16
41       ginger         root 16
42    trobriand      islands 16
43      uvalaku   expedition 16
44        white        man's 16
45         free  translation 15
46           ii     division 15
47        inter       tribal 15
48     maternal        uncle 15
49       native        ideas 15
50       banana         leaf 14
51       beauty        magic 14
52      chapter         xiii 14
53     division           ii 14
54  generations          ago 14
55         lime         pots 14
56       native         life 14
57     pandanus    streamers 14
58       dawson      straits 13
59    fergusson       island 13
60       mental     attitude 13
61     pandanus     streamer 13
62         port      moresby 13
63         prow        board 13
64        super       normal 13
65        black        magic 12
66         clay         pots 12
67        conch       shells 12
68     division          iii 12
69       flying        canoe 12
70         kula  expeditions 12
71      magical         rite 12
72         mint        plant 12
73      mwasila        magic 12
74     overseas   expedition 12
75     overseas  expeditions 12
76    primitive    economics 12
77     seligman           op 12
78    ancestral      spirits 11
79        areca          nut 11
80          axe       blades 11
81      chapter          iii 11
82      chapter          vii 11
83     division           vi 11
84         folk         lore 11
85         kula     articles 11
86         kula     district 11
87         kula     exchange 11
88         kula        gifts 11
89      lashing      creeper 11
90         love        magic 11
91      magical       bundle 11
92      magical      formula 11
93      mwasila         kula 11
94     normanby       island 11
95        north         west 11
96          red        paint 11
97       return      journey 11
98     southern       boyowa 11
99   systematic        magic 11
100       trial          run 11

## look at words that appear next to the word 'island'
islandbigram <- malinowski_bigrams %>% filter(word1 == "island" | word2 == "island")
islandbigram %>% count(word1, word2, sort = TRUE) %>% top_n(20)
# A tibble: 70 x 3
   word1        word2      n
   <chr>        <chr>  <int>
 1 woodlark     island    47
 2 main         island    18
 3 fergusson    island    13
 4 normanby     island    11
 5 coral        island     5
 6 neighbouring island     5
 7 dobu         island     4
 8 rossel       island     4
 9 island       called     3
10 aignan       island     2
# … with 60 more rows

# just where island is the second term
islandbigram <- malinowski_bigrams %>% filter(word2 == "island")
islandbigram %>% count(word1, word2, sort = TRUE) %>% top_n(20)
# A tibble: 32 x 3
   word1        word2      n
   <chr>        <chr>  <int>
 1 woodlark     island    47
 2 main         island    18
 3 fergusson    island    13
 4 normanby     island    11
 5 coral        island     5
 6 neighbouring island     5
 7 dobu         island     4
 8 rossel       island     4
 9 aignan       island     2
10 amphlett     island     2
# … with 22 more rows

## look at words that appear next to the word 'canoe'
canoebigram <- malinowski_bigrams %>% filter(word1 == "canoe" | word2 == "canoe")
canoebigram %>% count(word1, word2, sort = TRUE) %>% top_n(20)
# A tibble: 20 x 3
   word1    word2        n
   <chr>    <chr>    <int>
 1 canoe    building    47
 2 canoe    magic       30
 3 flying   canoe       12
 4 canoe    flies        7
 5 native   canoe        6
 6 canoe    builder      5
 7 canoe    spells       5
 8 masawa   canoe        5
 9 canoe    body         4
10 canoe    thou         4
11 chief's  canoe        4
12 kudayuri canoe        4
13 canoe    anchored     3
14 canoe    belongs      3
15 canoe    fleet        3
16 canoe    flew         3
17 canoe    makes        3
18 canoe    myth         3
19 canoe    ready        3
20 canoe    speed        3

Sentiment analysis

Texts often contain certain emotions, feelings, or sentiments that can tell us more about what they mean. In a way, coding text data for sentiments is similar to the qualitative reseach method of coding fieldnotes for themes. Because of this, you can develop your own custom lexicon for your research context. However, because this is a popular methodology, many existing sentiment analysis dictionaries have been developed and publicly shared.

We'll work with the NRC Emotion Lexicon. First, we can load the NRC lexicon and look at the different types of sentiments that it contains.

# load the nrc sentiment dictionary
get_sentiments("nrc")
# A tibble: 13,901 x 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# … with 13,891 more rows
nrcdf <- get_sentiments("nrc")
# take a look at the top sentiments that occur in the lexicon
nrcdf %>% count(sentiment, sort = T)
# A tibble: 10 x 2
   sentiment        n
   <chr>        <int>
 1 negative      3324
 2 positive      2312
 3 fear          1476
 4 anger         1247
 5 trust         1231
 6 sadness       1191
 7 disgust       1058
 8 anticipation   839
 9 joy            689
10 surprise       534

Using inner_join() we can combine the sentiments with the words from Malinowski, effectively "tagging" each word with a particular sentiment.

# merge sentiments to malinowski data
malinowski1922_sentiment <- malinowski1922tidy %>% inner_join(get_sentiments("nrc"))

Try it

With the new merged and tagged dataframe, make a table of the top words in Malinowski that are associated with the sentiment "trust" and one other sentiment of choice. Reflect on how you might interprete these results. Do you find this information useful? Is there any place you could see sentiment analysis being useful in your own research?

Click for solution

# look at the top words associated with trust
malinowski1922_sentiment %>% filter(sentiment == "trust") %>% count(word, sort = T)
# A tibble: 481 x 2
   word         n
   <chr>    <int>
 1 food       228
 2 word       180
 3 exchange   149
 4 found      141
 5 trade      133
 6 rule       124
 7 clan       104
 8 account    101
 9 formula     92
10 real        92
# … with 471 more rows

# pick another sentient and pull out the top 20 words associated witth this
# sentiment.
malinowski1922_sentiment %>% filter(sentiment == "surprise") %>% count(word, sort = T) %>% 
    top_n(20)
# A tibble: 20 x 2
   word           n
   <chr>      <int>
 1 magical      307
 2 shell        181
 3 gift         131
 4 magician     100
 5 spirits       64
 6 tree          55
 7 finally       54
 8 sorcery       46
 9 death         44
10 ceremony      37
11 deal          37
12 leave         32
13 hero          30
14 remarkable    29
15 break         27
16 sun           27
17 feeling       25
18 mouth         23
19 catch         22
20 art           21
malinowski1922_sentiment %>% filter(sentiment == "sadness") %>% count(word, sort = T) %>% 
    top_n(20)
# A tibble: 21 x 2
   word         n
   <chr>    <int>
 1 shell      181
 2 mother      54
 3 evil        53
 4 doubt       50
 5 black       45
 6 death       44
 7 bad         40
 8 mortuary    37
 9 bottom      36
10 leave       32
# … with 11 more rows

Case study: Permafrost and climate change survey

Now that we've learned a bit about text analysis using Malinowski let's test our skills on a real world dataset. Here we will use data from a survey in two Inupiaq villages in Alaska to examine how indiviuals in these communities feel about climate change and thawing permafrost. These data are drawn from here: William B. Bowden 2013. Perceptions and implications of thawing permafrost and climate change in two Inupiaq villages of arctic Alaska Link. Let's further examine the responses to two open ended questions: (Q5) What is causing it [permafrost around X village] to change? and (Q69) "What feelings do you have when thinking about the possibility of future climate change in and around [village name]?".

First we load the data and subset out the columns of interest.

# we will work with the permafrost survey data.
surv <- read.csv("https://maddiebrown.github.io/ANTH630/data/Survey_AKP-SEL.csv", 
    stringsAsFactors = F)
surv_subset <- surv %>% select(Village, Survey.Respondent, Age.Group, X69..Feelings, 
    X5..PF.Cause.)

Then we can quickly calculate the most frequent terms across all 80 responses.

class(surv$X69..Feelings)  #make sure your column is a character variable
[1] "character"
surv_tidy <- surv_subset %>% unnest_tokens(word, X69..Feelings) %>% anti_join(stop_words)
# what are most common words?
feelingswordcount <- surv_tidy %>% count(word, sort = T)

Try it

Make wordclouds of the word frequency in responses about feelings related to climate change using two different methods.

Click for solution

surv_tidy %>% count(word) %>% with(wordcloud(word, n, max.words = 100))

# wordcloud2(data = feelingswordcount)

Comparing word frequency across samples

Are there noticeable differences in responses across individuals from different sites? We can compare the responses about "What feelings do you have when thinking about the possibility of future climate change in and around [village name]?" from the permafrost survey, based on which village the respondent lives in.

# word frequency by village
surv_tidy <- surv_subset %>% unnest_tokens(word, X69..Feelings) %>% anti_join(stop_words)

# what are most common words?
surv_tidy %>% count(word, sort = T) %>% top_n(20)
        word  n
1     change 10
2    climate  8
3        sad  7
4    animals  6
5       cold  6
6     future  6
7      scary  6
8    weather  6
9      adapt  5
10   caribou  5
11        nc  5
12   worried  5
13  changing  4
14        dk  4
15    ground  4
16      move  4
17    people  4
18     worry  4
19    affect  3
20     blank  3
21 concerned  3
22      days  3
23   farther  3
24      feel  3
25      food  3
26      land  3
27     river  3
28    scared  3
29      time  3
30     water  3

# we can also look at the top words
byvillage <- surv_tidy %>% count(Village, word, sort = T) %>% ungroup()
byvillage %>% top_n(20)
   Village    word n
1      AKP animals 6
2      AKP  change 6
3      SEL   scary 6
4      AKP  future 5
5      AKP      nc 5
6      AKP caribou 4
7      AKP climate 4
8      AKP    cold 4
9      AKP     sad 4
10     AKP weather 4
11     AKP   worry 4
12     SEL  change 4
13     SEL climate 4
14     SEL      dk 4
15     SEL  ground 4
16     SEL    move 4
17     AKP    days 3
18     AKP  people 3
19     AKP worried 3
20     SEL   adapt 3
21     SEL   blank 3
22     SEL   river 3
23     SEL     sad 3
24     SEL   water 3

top_10 <- byvillage %>% group_by(Village) %>% top_n(10, n) %>% ungroup() %>% arrange(Village, 
    desc(n))

ggplot(top_10, aes(x = reorder(word, n), y = n)) + geom_bar(stat = "identity") + 
    coord_flip() + ggtitle("Top terms by village") + labs(x = "Word", y = "Count") + 
    facet_wrap(~Village, scales = "free_y")

Topic Modeling

In addition to analyzing word and bigram frequencies, we can also analyze texts using topic modeling. Topic modeling allows us to identify themes in the text without needing to clearly know which themes or groupings we expect to emerge. This can be very useful when you have large columes of messy data or data from multiple diverse sources that you need to parse. We will use Latent Dirichlet allocation or (LDA), following the explanation in Text Mining with R.

Before we can identify themes across responses however, we need to make sure each "document" or "response" has a unique identifier.

Try it

What is the primary key or unique identifier for this dataset? How do you know? Why can't you use Survey.Respondent as a unique identifier?

Make a new primary key called "ID" that has a different value for each unique response.

Click for solution

surv_subset %>% select(Village, Survey.Respondent)
   Village Survey.Respondent
1      AKP                 1
2      AKP                 2
3      AKP                 3
4      AKP                 4
5      AKP                 5
6      AKP                 6
7      AKP                 7
8      AKP                 8
9      AKP                 9
10     AKP                10
11     AKP                11
12     AKP                12
13     AKP                13
14     AKP                14
15     AKP                15
16     AKP                16
17     AKP                17
18     AKP                18
19     AKP                19
20     AKP                20
21     AKP                21
22     AKP                22
23     AKP                23
24     AKP                24
25     AKP                25
26     AKP                26
27     AKP                27
28     AKP                28
29     AKP                29
30     AKP                30
31     AKP                31
32     AKP                32
33     AKP                33
34     AKP                34
35     AKP                35
36     AKP                36
37     AKP                37
38     AKP                38
39     AKP                39
40     SEL                 1
41     SEL                 2
42     SEL                 3
43     SEL                 4
44     SEL                 5
45     SEL                 6
46     SEL                 7
47     SEL                 8
48     SEL                 9
49     SEL                10
50     SEL                11
51     SEL                12
52     SEL                13
53     SEL                14
54     SEL                15
55     SEL                16
56     SEL                17
57     SEL                18
58     SEL                19
59     SEL                20
60     SEL                21
61     SEL                22
62     SEL                23
63     SEL                24
64     SEL                25
65     SEL                26
66     SEL                27
67     SEL                28
68     SEL                29
69     SEL                30
70     SEL                31
71     SEL                32
72     SEL                33
73     SEL                34
74     SEL                35
75     SEL                36
76     SEL                37
77     SEL                38
78     SEL                39
79     SEL                40
80     SEL                41
surv_subset <- surv_subset %>% mutate(ID = paste(Village, Survey.Respondent, sep = "_"))

Frequency of word pairs per response

# look at the bigrams in these responses.
surv_subset %>% unnest_tokens(output = bigrams, input = X69..Feelings, token = "ngrams", 
    n = 2) %>% count(bigrams, sort = T) %>% top_n(20)
         bigrams  n
1           <NA> 15
2        have to  9
3     don't know  4
4        it will  4
5        need to  4
6       to adapt  4
7          to be  4
8        to move  4
9        we have  4
10       we will  4
11       able to  3
12      about it  3
13       be able  3
14        had to  3
15 higher ground  3
16         if we  3
17 scary thought  3
18   the climate  3
19   the weather  3
20   think about  3
21     to higher  3
22       used to  3
23     will have  3
24      won't be  3

# look at pairwise counts per responses. how often do two words show up together
# in one person's response?
library(widyr)
surv_tidy %>% pairwise_count(word, Survey.Respondent, sort = T)
# A tibble: 3,322 x 3
   item1    item2        n
   <chr>    <chr>    <dbl>
 1 change   cold         3
 2 caribou  cold         3
 3 change   climate      3
 4 cold     change       3
 5 climate  change       3
 6 adapt    change       3
 7 scary    change       3
 8 cold     caribou      3
 9 changing sad          3
10 sad      changing     3
# … with 3,312 more rows
# so we see that 'change' and 'cold' appear in three responses, as do climate and
# change. however, we previously learned that the Survey.Respondent column is not
# a unique identifier for the responses. Let's run the same code, but with the
# new ID column we created.

### let's make a new surv_tidy object that incoporates the new ID we made
surv_tidy <- surv_subset %>% unnest_tokens(word, X69..Feelings) %>% anti_join(stop_words)
surv_tidy %>% pairwise_count(word, ID, sort = T) %>% top_n(20)
# A tibble: 58 x 3
   item1    item2       n
   <chr>    <chr>   <dbl>
 1 adapt    change      3
 2 scary    change      3
 3 change   adapt       3
 4 change   scary       3
 5 climate  cold        2
 6 change   cold        2
 7 weather  cold        2
 8 cold     climate     2
 9 change   climate     2
10 relocate climate     2
# … with 48 more rows
# in this case the output is nearly the same, but in other cases this distinction
# can make a significant difference.

The first step in creating a topic model is to count the number of times each word appears in each individual document (or response in our case). Luckily, we can count by two variables using the count() function. Let's create a new byresponse variable.

byresponse <- surv_tidy %>% count(ID, word, sort = T) %>% ungroup()

# check how many responses are included in the analysis. this allows you to
# double check that the new unique identifier we made worked as expected.
unique(byresponse$ID)
 [1] "AKP_26" "AKP_1"  "AKP_17" "AKP_18" "AKP_3"  "AKP_32" "AKP_5"  "SEL_21"
 [9] "AKP_10" "AKP_11" "AKP_12" "AKP_13" "AKP_14" "AKP_15" "AKP_16" "AKP_19"
[17] "AKP_2"  "AKP_20" "AKP_22" "AKP_23" "AKP_24" "AKP_25" "AKP_27" "AKP_28"
[25] "AKP_29" "AKP_30" "AKP_31" "AKP_33" "AKP_34" "AKP_35" "AKP_36" "AKP_37"
[33] "AKP_38" "AKP_4"  "AKP_6"  "AKP_7"  "AKP_8"  "AKP_9"  "SEL_1"  "SEL_10"
[41] "SEL_11" "SEL_13" "SEL_14" "SEL_15" "SEL_17" "SEL_18" "SEL_19" "SEL_2" 
[49] "SEL_20" "SEL_22" "SEL_24" "SEL_25" "SEL_26" "SEL_27" "SEL_29" "SEL_3" 
[57] "SEL_30" "SEL_31" "SEL_32" "SEL_33" "SEL_34" "SEL_35" "SEL_36" "SEL_37"
[65] "SEL_38" "SEL_39" "SEL_4"  "SEL_40" "SEL_5"  "SEL_6"  "SEL_7"  "SEL_8" 
[73] "SEL_9" 
length(unique(byresponse$ID))
[1] 73

Now we can convert our longform word list into a document-term matrix. Read more here

surv_dtm <- byresponse %>% cast_dtm(ID, word, n)
`?`(cast_dtm  #read up on how this function works
)

Run the LDA() function and choose a number of solutions. in this case, let's try it with 2

surv_lda <- LDA(surv_dtm, k = 2, control = list(seed = 9999))
# look at our output
str(surv_lda)
Formal class 'LDA_VEM' [package "topicmodels"] with 14 slots
  ..@ alpha          : num 35.6
  ..@ call           : language LDA(x = surv_dtm, k = 2, control = list(seed = 9999))
  ..@ Dim            : int [1:2] 73 210
  ..@ control        :Formal class 'LDA_VEMcontrol' [package "topicmodels"] with 13 slots
  .. .. ..@ estimate.alpha: logi TRUE
  .. .. ..@ alpha         : num 25
  .. .. ..@ seed          : int 9999
  .. .. ..@ verbose       : int 0
  .. .. ..@ prefix        : chr "/var/folders/l_/hl0qrh9535l691r67nxlvd9c0000gn/T//RtmpIDhnpY/file8d974ee52c3d"
  .. .. ..@ save          : int 0
  .. .. ..@ nstart        : int 1
  .. .. ..@ best          : logi TRUE
  .. .. ..@ keep          : int 0
  .. .. ..@ estimate.beta : logi TRUE
  .. .. ..@ var           :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
  .. .. .. .. ..@ iter.max: int 500
  .. .. .. .. ..@ tol     : num 1e-06
  .. .. ..@ em            :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
  .. .. .. .. ..@ iter.max: int 1000
  .. .. .. .. ..@ tol     : num 1e-04
  .. .. ..@ initialize    : chr "random"
  ..@ k              : int 2
  ..@ terms          : chr [1:210] "days" "cold" "april" "march" ...
  ..@ documents      : chr [1:73] "AKP_26" "AKP_1" "AKP_17" "AKP_18" ...
  ..@ beta           : num [1:2, 1:210] -4.73 -4.75 -5.22 -3.52 -4.58 ...
  ..@ gamma          : num [1:73, 1:2] 0.505 0.491 0.476 0.498 0.516 ...
  ..@ wordassignments:List of 5
  .. ..$ i   : int [1:330] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ j   : int [1:330] 1 11 31 32 48 50 62 63 64 65 ...
  .. ..$ v   : num [1:330] 1 1 1 1 2 2 2 1 1 2 ...
  .. ..$ nrow: int 73
  .. ..$ ncol: int 210
  .. ..- attr(*, "class")= chr "simple_triplet_matrix"
  ..@ loglikelihood  : num [1:73] -64.3 -41.4 -66.3 -50.1 -49.8 ...
  ..@ iter           : int 8
  ..@ logLiks        : num(0) 
  ..@ n              : int 343

# examine the probability that each word is in a particular topic group
surv_topics <- tidy(surv_lda, matrix = "beta")
surv_topics
# A tibble: 420 x 3
   topic term       beta
   <int> <chr>     <dbl>
 1     1 days   0.00885 
 2     2 days   0.00864 
 3     1 cold   0.00540 
 4     2 cold   0.0296  
 5     1 april  0.0102  
 6     2 april  0.00140 
 7     1 march  0.00346 
 8     2 march  0.00821 
 9     1 months 0.000975
10     2 months 0.0107  
# … with 410 more rows

Examine the top words for each topic identified by the model.

top_words <- surv_topics %>% group_by(topic) %>% top_n(10, beta) %>% ungroup() %>% 
    arrange(topic, desc(beta))
top_words
# A tibble: 20 x 3
   topic term      beta
   <int> <chr>    <dbl>
 1     1 climate 0.0446
 2     1 change  0.0363
 3     1 animals 0.0324
 4     1 sad     0.0258
 5     1 adapt   0.0204
 6     1 caribou 0.0196
 7     1 scary   0.0180
 8     1 blank   0.0173
 9     1 river   0.0159
10     1 future  0.0145
11     2 cold    0.0296
12     2 weather 0.0276
13     2 worried 0.0248
14     2 nc      0.0224
15     2 change  0.0220
16     2 future  0.0205
17     2 dk      0.0181
18     2 people  0.0175
19     2 land    0.0172
20     2 scary   0.0170

We can also examine the results graphically.

# plot these top words for each topic (adapted from
# https://www.tidytextmining.com/topicmodeling.html)
top_words %>% group_by(topic) %>% mutate(term = fct_reorder(term, beta)) %>% ungroup() %>% 
    ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) + 
    facet_wrap(~topic, scales = "free") + theme_minimal()

Try it

Repeat the topic modeling analysis but using 6 topics instead of two.

Click for solution

## repeat the analysis but with 6 topics
surv_lda6 <- LDA(surv_dtm, k = 6, control = list(seed = 9999))
# examine the probability that each word is in a particular topic group
surv_topics6 <- tidy(surv_lda6, matrix = "beta")
surv_topics6
# A tibble: 1,260 x 3
   topic term       beta
   <int> <chr>     <dbl>
 1     1 days  4.88e-  2
 2     2 days  9.19e-262
 3     3 days  1.69e-269
 4     4 days  1.17e-270
 5     5 days  5.77e-267
 6     6 days  4.77e-268
 7     1 cold  1.63e-  2
 8     2 cold  5.81e-  2
 9     3 cold  1.68e-  2
10     4 cold  4.86e-235
# … with 1,250 more rows
# examine top words for each topic
top_words6 <- surv_topics6 %>% group_by(topic) %>% top_n(5, beta) %>% ungroup() %>% 
    arrange(topic, desc(beta))
# plot these top words for each topic (adapted from
# https://www.tidytextmining.com/topicmodeling.html)
top_words6 %>% group_by(topic) %>% mutate(term = fct_reorder(term, beta)) %>% ungroup() %>% 
    ggplot(aes(beta, term, fill = factor(topic))) + geom_col(show.legend = FALSE) + 
    facet_wrap(~topic, scales = "free_y") + theme_minimal()

In this case, our sample is small, so topic modeling is not necessarily the best method to use. However, even from this small sample, you can see that some topics emerge from the text that were not previously apparent.

Manual text wrangling

Sometimes you'll need to edit text or strings manually. For example, you may find that for your research question, you are less interested in differentiating between the terms running, run, and runner, than in identifying clusters of beliefs about running as a more general concept. On the other hand, you might want to differentiate between runners and running as beliefs about groups of people vs. the act of running. How you choose to transform text data in your research is up to your research questions and understanding of the cultural context.

R has a number of helpful functions for manually adjusting strings. We'll cover a few to get you started. Let's go back to the permafrost and climate change survey and look at responses to: (Q5) What is causing it [permafrost around X village] to change?.

First let's look at the raw data. What are some potential issues in the strings below that might make text analysis difficult or ineffective?

surv$X5..PF.Cause.[10:30]
 [1] "climate change"                                                                                                                                  
 [2] "too warm"                                                                                                                                        
 [3] "warmer weather, shorter winters, lack of snow, fast springs. [will affect AKP because we] use Argos to go out."                                  
 [4] "weather warming"                                                                                                                                 
 [5] "temperature"                                                                                                                                     
 [6] "heat. A lot of heat."                                                                                                                            
 [7] "melting of the ground - goes down"                                                                                                               
 [8] "spirited answer. Her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
 [9] "probably global warming"                                                                                                                         
[10] "temperature outside is not steady"                                                                                                               
[11] "(blank)"                                                                                                                                         
[12] "N/A"                                                                                                                                             
[13] "most likely war weather or global warming"                                                                                                       
[14] "not much winter - hardly get snow. Always wind blown. Summer be rain, rain, rain. Late snow."                                                    
[15] "warmer winters"                                                                                                                                  
[16] "global warming"                                                                                                                                  
[17] "warming weather, longer summer/fall season"                                                                                                      
[18] "I have no idea"                                                                                                                                  
[19] "warmer climate"                                                                                                                                  
[20] "Seems like there's lots of rain & water causes ground to thaw. So maybe accumulated water? Maybe warm weather?"                                  
[21] "the heat wave - winter frosts"                                                                                                                   

Luckily we can manually adjust the strings to make them easier to analyze systematically. For example we might set characters to lowercase, trim whitespace and remove any empty or missing rows.

# make a new column to hold the tidy data
surv$cause_tidy <- surv$X5..PF.Cause.
# make lower case
surv$cause_tidy <- tolower(surv$cause_tidy)
# remove white space at beginning and end of string
surv$cause_tidy <- trimws(surv$cause_tidy)
# filter out blank or empty rows
surv <- surv %>% filter(surv$cause_tidy != "")
surv <- surv %>% filter(surv$cause_tidy != "(blank)")
surv <- surv %>% filter(surv$cause_tidy != "n/a")

We can also directly replace particular strings. Here we change some strings with typos.

surv$cause_tidy <- surv$cause_tidy %>% str_replace("wamer", "warmer")
surv$cause_tidy <- surv$cause_tidy %>% str_replace("lnoger", "longer")

Another common string data transformation involves grouping together responses into more standardized categories. You can transform cll values individually or based on exact string matches. In addition, using %like% we can transform any strings where just part of the string contains a particular string. For example, we might decide that any time the string "warm" appears in a response, the overall theme of the response is associated with "global warming". Or based on our ethnographic understanding of the context we might know that "seasonal changes" are important causes of permafrost in local cultural models. We can then look for some key terms that will allow us to rapidly change multiple responses that are likely to fit in this category. In this case, "late" and "early".

# group some responses together based on the presence of a particular string
surv <- surv %>% mutate(cause_tidy = replace(cause_tidy, cause_tidy %like% "warm", 
    "global warming"))
surv$cause_tidy[1:30]
 [1] "environment"                                                                                                                                     
 [2] "exhaust"                                                                                                                                         
 [3] "global warming"                                                                                                                                  
 [4] "global warming"                                                                                                                                  
 [5] "hot summers, early springs. in super cold winters the ground comes up & cracks and water comes out."                                             
 [6] "global warming"                                                                                                                                  
 [7] "climate change"                                                                                                                                  
 [8] "freezing & thawing in fall & spring"                                                                                                             
 [9] "global warming"                                                                                                                                  
[10] "climate change"                                                                                                                                  
[11] "global warming"                                                                                                                                  
[12] "global warming"                                                                                                                                  
[13] "global warming"                                                                                                                                  
[14] "temperature"                                                                                                                                     
[15] "heat. a lot of heat."                                                                                                                            
[16] "melting of the ground - goes down"                                                                                                               
[17] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[18] "global warming"                                                                                                                                  
[19] "temperature outside is not steady"                                                                                                               
[20] "global warming"                                                                                                                                  
[21] "not much winter - hardly get snow. always wind blown. summer be rain, rain, rain. late snow."                                                    
[22] "global warming"                                                                                                                                  
[23] "global warming"                                                                                                                                  
[24] "global warming"                                                                                                                                  
[25] "i have no idea"                                                                                                                                  
[26] "global warming"                                                                                                                                  
[27] "global warming"                                                                                                                                  
[28] "the heat wave - winter frosts"                                                                                                                   
[29] "global warming"                                                                                                                                  
[30] "global warming"                                                                                                                                  
surv <- surv %>% mutate(cause_tidy = replace(cause_tidy, cause_tidy %like% "early" | 
    cause_tidy %like% "late", "seasonal changes"))
surv$cause_tidy[1:30]
 [1] "environment"                                                                                                                                     
 [2] "exhaust"                                                                                                                                         
 [3] "global warming"                                                                                                                                  
 [4] "global warming"                                                                                                                                  
 [5] "seasonal changes"                                                                                                                                
 [6] "global warming"                                                                                                                                  
 [7] "climate change"                                                                                                                                  
 [8] "freezing & thawing in fall & spring"                                                                                                             
 [9] "global warming"                                                                                                                                  
[10] "climate change"                                                                                                                                  
[11] "global warming"                                                                                                                                  
[12] "global warming"                                                                                                                                  
[13] "global warming"                                                                                                                                  
[14] "temperature"                                                                                                                                     
[15] "heat. a lot of heat."                                                                                                                            
[16] "melting of the ground - goes down"                                                                                                               
[17] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[18] "global warming"                                                                                                                                  
[19] "temperature outside is not steady"                                                                                                               
[20] "global warming"                                                                                                                                  
[21] "seasonal changes"                                                                                                                                
[22] "global warming"                                                                                                                                  
[23] "global warming"                                                                                                                                  
[24] "global warming"                                                                                                                                  
[25] "i have no idea"                                                                                                                                  
[26] "global warming"                                                                                                                                  
[27] "global warming"                                                                                                                                  
[28] "the heat wave - winter frosts"                                                                                                                   
[29] "global warming"                                                                                                                                  
[30] "global warming"                                                                                                                                  

# compare the original with your categorizations surv %>%
# select(X5..PF.Cause.,cause_tidy)

We won't get into too much detail today, but you can also search and select string data using regular expressions. You can read more in R4DS. Here let's use str_detect() to pull out some strings with regular expressions.


# any responses ending in 'ing'
surv$cause_tidy[str_detect(surv$cause_tidy, "ing$")]
 [1] "global warming"                      "global warming"                     
 [3] "global warming"                      "freezing & thawing in fall & spring"
 [5] "global warming"                      "global warming"                     
 [7] "global warming"                      "global warming"                     
 [9] "global warming"                      "global warming"                     
[11] "global warming"                      "global warming"                     
[13] "global warming"                      "global warming"                     
[15] "global warming"                      "global warming"                     
[17] "global warming"                      "global warming"                     
[19] "global warming"                      "global warming"                     
[21] "global warming"                      "global warming"                     
[23] "global warming"                      "global warming"                     
[25] "global warming"                      "global warming"                     
[27] "global warming"                      "global warming"                     
[29] "global warming"                      "global warming"                     
[31] "global warming"                      "global warming"                     
[33] "global warming"                      "global warming"                     
[35] "global warming"                     

# any reponses that contain a W followed by either an 'e' or an 'a'
surv$cause_tidy[str_detect(surv$cause_tidy, "w[ea]")]
 [1] "global warming"                                                                                                                                  
 [2] "global warming"                                                                                                                                  
 [3] "global warming"                                                                                                                                  
 [4] "global warming"                                                                                                                                  
 [5] "global warming"                                                                                                                                  
 [6] "global warming"                                                                                                                                  
 [7] "global warming"                                                                                                                                  
 [8] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
 [9] "global warming"                                                                                                                                  
[10] "global warming"                                                                                                                                  
[11] "global warming"                                                                                                                                  
[12] "global warming"                                                                                                                                  
[13] "global warming"                                                                                                                                  
[14] "global warming"                                                                                                                                  
[15] "global warming"                                                                                                                                  
[16] "the heat wave - winter frosts"                                                                                                                   
[17] "global warming"                                                                                                                                  
[18] "global warming"                                                                                                                                  
[19] "weather"                                                                                                                                         
[20] "global warming"                                                                                                                                  
[21] "global warming"                                                                                                                                  
[22] "global warming"                                                                                                                                  
[23] "global warming"                                                                                                                                  
[24] "global warming"                                                                                                                                  
[25] "weather."                                                                                                                                        
[26] "global warming"                                                                                                                                  
[27] "global warming"                                                                                                                                  
[28] "global warming"                                                                                                                                  
[29] "global warming"                                                                                                                                  
[30] "global warming"                                                                                                                                  
[31] "global warming"                                                                                                                                  
[32] "global warming"                                                                                                                                  
[33] "global warming"                                                                                                                                  
[34] "global warming"                                                                                                                                  
[35] "global warming"                                                                                                                                  
[36] "global warming"                                                                                                                                  
[37] "global warming"                                                                                                                                  
[38] "global warming"                                                                                                                                  

# any responses that contain the string erosion
surv$cause_tidy[str_detect(surv$cause_tidy, "erosion")]
[1] "erosion, and real hot summers and a lot of snow & rain."       
[2] "mud goes down river, cracking all along & falling in - erosion"
[3] "ground erosion"                                                
[4] "erosion"                                                       
# any responses that contain the string erosion, but which have any character
# occurring before the word erosion.
surv$cause_tidy[str_detect(surv$cause_tidy, ".erosion")]
[1] "mud goes down river, cracking all along & falling in - erosion"
[2] "ground erosion"                                                

The utility of regular expressions is huge for quickly searching through and transforming large volumes of string data. We've only scratched the surface today.

Whenever transforming large volumes of data using string detection and regular expressions it is critical to double check that each operation is in fact working as you expected it to. Paying attention to the order of transformations is also important for preventing you from overwriting previous data transformations.

Creating new variables with str_detect()

Sometimes it is useful to create flags or indicator variables in your data. These can allow you to quickly filter out rows that have particular characteristics. For example, we can create a new binary column that indicates whether or not the response refers to global warming. This variable can then be used for further grouping, data visualization or other tasks.

surv <- surv %>% mutate(GlobalWarmingYN = str_detect(cause_tidy, "global warming"))
table(surv$GlobalWarmingYN)  # how many responses contain the string global warming?

FALSE  TRUE 
   37    34 

Parts of speech tagging

We can also tag the parts of speech in a text. This allows us to focus an analysis on verbs, nouns, or other parts of speech that may be of interest. For example, in a study on sentiments, we might want to pull out adjectives in order to understand how people feel or describe a particular phenomenon. On the other hand, we might also pull out verbs in order to understand the types of actions people describe as associated with certain cultural practices or beliefs. Let's tag the parts of speech in Malinowski 1922 to learn more about the places and cultural practices documented in this book.

Try it

Using the cnlp_annotate() function we can tag the parts of speech in Malinowski 1922. This function can take a long time to run. This is the last thing we will do today, so feel free to let it run and then take a break and come back to finish these problems.

  1. Make a new object using only the token part of the output from cnlp_annotate() and then examine the $upos column. What are all the unique parts of speech in this dataset?

  2. Select and examine the top 30 nouns and verbs in this dataset. Do any of the terms surprise you? How might this level of analysis of the text be meaningful for your interpretation of its themes?

Click for solution

library(cleanNLP)
cnlp_init_udpipe()
# tag parts of speech. takes a long time
malinowksiannotatedtext <- cnlp_annotate(malinowski1922tidy$word)
str(malinowksiannotatedtext)  # look at the structure. because it is a list we have to pull out that particular section of the list
List of 2
 $ token   : tibble [81,973 × 11] (S3: tbl_df/tbl/data.frame)
  ..$ doc_id     : int [1:81973] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ sid        : int [1:81973] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ tid        : chr [1:81973] "1" "1" "1" "1" ...
  ..$ token      : chr [1:81973] "argonauts" "western" "pacific" "account" ...
  ..$ lemma      : chr [1:81973] "argonaut" "western" "pacific" "account" ...
  ..$ space_after: chr [1:81973] "\\n" "\\n" "\\n" "\\n" ...
  ..$ upos       : chr [1:81973] "NOUN" "ADJ" "ADJ" "NOUN" ...
  ..$ xpos       : chr [1:81973] "NNS" "JJ" "JJ" "NN" ...
  ..$ feats      : chr [1:81973] "Number=Plur" "Degree=Pos" "Degree=Pos" "Number=Sing" ...
  ..$ tid_source : chr [1:81973] "0" "0" "0" "0" ...
  ..$ relation   : chr [1:81973] "root" "root" "root" "root" ...
 $ document:'data.frame':   81199 obs. of  1 variable:
  ..$ doc_id: int [1:81199] 1 2 3 4 5 6 7 8 9 10 ...
malinowskiannotatedtextfull <- data.frame(malinowksiannotatedtext$token)
str(malinowskiannotatedtextfull)
'data.frame':   81973 obs. of  11 variables:
 $ doc_id     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ sid        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ tid        : chr  "1" "1" "1" "1" ...
 $ token      : chr  "argonauts" "western" "pacific" "account" ...
 $ lemma      : chr  "argonaut" "western" "pacific" "account" ...
 $ space_after: chr  "\\n" "\\n" "\\n" "\\n" ...
 $ upos       : chr  "NOUN" "ADJ" "ADJ" "NOUN" ...
 $ xpos       : chr  "NNS" "JJ" "JJ" "NN" ...
 $ feats      : chr  "Number=Plur" "Degree=Pos" "Degree=Pos" "Number=Sing" ...
 $ tid_source : chr  "0" "0" "0" "0" ...
 $ relation   : chr  "root" "root" "root" "root" ...

# what are all the different parts of speech that have been tagged?
unique(malinowskiannotatedtextfull$upos)
 [1] "NOUN"  "ADJ"   "X"     "VERB"  "DET"   "NUM"   "ADV"   "PRON"  "PART" 
[10] "INTJ"  "PROPN" "AUX"   "SYM"   "ADP"   "SCONJ" "PUNCT"

# verb analysis. first look at some of the verbs that occur in the book
# malinowskiannotatedtextfull %>% filter(upos=='VERB') %>% select(token,lemma)
# %>% data.frame() %>% top_n(30) top 50 verbs
malinowskiannotatedtextfull %>% filter(upos == "VERB") %>% count(token, sort = T) %>% 
    top_n(30)
       token   n
1     called 248
2      found 141
3      means 139
4    carried 105
5  mentioned 101
6    meaning  99
7  performed  98
8    brought  91
9   obtained  89
10    flying  81
11   objects  76
12     makes  69
13  received  68
14  repeated  66
15      told  66
16    spoken  65
17     speak  61
18     takes  61
19      left  60
20     bring  57
21   receive  56
22  consists  55
23    giving  55
24     visit  53
25     carry  52
26       log  52
27    remain  52
28 connected  51
29 beginning  50
30    leaves  50

# what are the top 50 nouns?
malinowskiannotatedtextfull %>% filter(upos == "NOUN") %>% count(lemma, sort = T) %>% 
    top_n(30) %>% data.frame()
        lemma    n
1       canoe 1121
2        kula  942
3       magic  880
4     village  624
5      natife  596
6       spell  484
7      island  403
8        word  402
9        time  317
10       gift  315
11       sail  307
12  trobriand  307
13      chief  301
14        day  274
15 expedition  265
16       form  254
17    chapter  243
18      shell  241
19       rite  232
20       food  228
21        sea  227
22      beach  209
23        nut  209
24   sinaketa  208
25      woman  208
26       myth  198
27    partner  194
28     people  189
29  community  187
30       rule  184