Text mining with R

This week we learn about how to work with text data in R. We will learn how to turn documents into word lists, analyze frequency counts, extract bigrams, analyze sentiment and parts of speech, and how to visualize text analyses.

Analyzing Malinowski

To get started, we will analyze the classic Malinowski (1922) text Argonauts of the Western Pacific. The text can be downloaded from Project Gutenberg, or for simplicity, we can download the text directly using the gutenbergr package.

First, let’s load all of the libraries we will be using today.

#install.packages("gutenbergr")
library(gutenbergr)
gutenberg_metadata
# A tibble: 51,997 × 8
   gutenberg_id title   author gutenberg_autho… language gutenberg_books… rights
          <int> <chr>   <chr>             <int> <chr>    <chr>            <chr> 
 1            0  <NA>   <NA>                 NA en       <NA>             Publi…
 2            1 "The D… Jeffe…             1638 en       United States L… Publi…
 3            2 "The U… Unite…                1 en       American Revolu… Publi…
 4            3 "John … Kenne…             1666 en       <NA>             Publi…
 5            4 "Linco… Linco…                3 en       US Civil War     Publi…
 6            5 "The U… Unite…                1 en       American Revolu… Publi…
 7            6 "Give … Henry…                4 en       American Revolu… Publi…
 8            7 "The M… <NA>                 NA en       <NA>             Publi…
 9            8 "Abrah… Linco…                3 en       US Civil War     Publi…
10            9 "Abrah… Linco…                3 en       US Civil War     Publi…
# … with 51,987 more rows, and 1 more variable: has_text <lgl>
library(tidyverse)
library(wordcloud)
library(tidytext)
library(stringr)
library(topicmodels)
library(data.table)
library(textdata)
library(cleanNLP)
cnlp_init_udpipe()

Download the Malinowski book text and examine the structure. How are the data organized?

malinowski1922 <- gutenberg_download(55822)
str(malinowski1922)
tibble [22,219 × 2] (S3: tbl_df/tbl/data.frame)
 $ gutenberg_id: int [1:22219] 55822 55822 55822 55822 55822 55822 55822 55822 55822 55822 ...
 $ text        : chr [1:22219] "                    ARGONAUTS OF THE WESTERN PACIFIC" "" "                          An Account of Native" "                        Enterprise and Adventure" ...

Analyzing individual word frequencies

One of the first ways we can explore a text is through looking at word frequncies. With multiple samples from different people or sites, comparing word frequencies can reveal differences across populations, while within a single text, word frequencies can highlight key issues, people, or places in a text.

Try it

Using the unnest_tokens() function, extract out the individual words from Malinowski and create a table sorting the top words by count. What do you notice about the top words? Why do you think these words appear at the top of the list?

Click for solution
## make into individual words
words<- malinowski1922 %>% unnest_tokens(output=word,input=text)
#notice that this also converts to lowercase and removes punctuation

#look at the top 50 words in the document
words %>% count(word,sort=T) %>% top_n(50)
# A tibble: 50 × 2
   word      n
   <chr> <int>
 1 the   18607
 2 of    10254
 3 and    6625
 4 in     5296
 5 to     4950
 6 a      4732
 7 is     3554
 8 it     2070
 9 as     1795
10 on     1746
# … with 40 more rows

Stop words

Many of these top words are what we call stop words, or those that add little to our analysis. These include words like is, the and so, that add little to our understanding of the overall topics or themes in a text. Tidytext has a built in dictionary of stop words, making it easy to quickly remove these words from the text.

# look at the words in the stop_words dataset
data(stop_words)
stop_words %>% top_n(50)
# A tibble: 174 × 2
   word      lexicon 
   <chr>     <chr>   
 1 i         snowball
 2 me        snowball
 3 my        snowball
 4 myself    snowball
 5 we        snowball
 6 our       snowball
 7 ours      snowball
 8 ourselves snowball
 9 you       snowball
10 your      snowball
# … with 164 more rows
#remove stop words from the text
malinowski1922tidy <- words %>% anti_join(stop_words)
#look at the structure 
str(malinowski1922tidy)
tibble [81,199 × 2] (S3: tbl_df/tbl/data.frame)
 $ gutenberg_id: int [1:81199] 55822 55822 55822 55822 55822 55822 55822 55822 55822 55822 ...
 $ word        : chr [1:81199] "argonauts" "western" "pacific" "account" ...

Now we can look at the number of unique words and their counts in Malinowski, without interference from stop words.

#how many unique words are there?
length(unique(malinowski1922tidy$word))
[1] 10925
#make a table of the top words with stop words removed
malinowski1922tidy_wordcounts <- malinowski1922tidy %>% count(word, sort=T) 

##look at top 50 words
malinowski1922tidy %>% count(word,sort=TRUE) %>% top_n(50) %>% mutate(word=reorder(word,n)) %>% data.frame()
         word   n
1        kula 932
2       magic 880
3       canoe 814
4     natives 596
5     village 405
6       spell 346
7      native 336
8     magical 307
9      canoes 304
10     island 268
11       dobu 267
12       time 253
13     called 248
14    chapter 243
15       food 228
16        sea 227
17       main 221
18   villages 219
19      words 217
20   sinaketa 208
21      beach 204
22 trobriands 199
23      chief 197
24     people 186
25      gifts 184
26 ceremonial 182
27      shell 181
28       word 180
29   district 176
30 expedition 172
31        nut 167
32       life 162
33     social 162
34       form 156
35     kitava 153
36   exchange 149
37       myth 147
38        day 143
39    sailing 143
40      found 141
41      south 140
42      means 139
43       sail 137
44     spells 137
45    islands 135
46     manner 135
47      trade 133
48  amphletts 132
49       gift 131
50  community 128

Make a plot of these top words. What do you make of these new top words?

#plot top words from tokenized tweets
top50wordsplot <- malinowski1922tidy %>% count(word,sort=TRUE) %>% top_n(50) %>% mutate(word=reorder(word,n))%>% ggplot(aes(x=word,y=n))+ geom_col()+xlab(NULL)+coord_flip()+labs(y="Count",x="Unique words", title="Malinowski 1922")
top50wordsplot

Wordclouds

Wordclouds are often avoided in scientific research due to their sometimes misleading arrangements and sizes of words. This can make them difficult to interpret. At the same time, word clouds can be useful in exploratory data analysis or applied research for quickly showing the main themes in a text, that can then be explored for further contextual information. Here we will make wordclouds of Malinowski’s text using two different methods.

malinowski1922tidy %>% count(word) %>%  with(wordcloud(word, n, max.words = 100))

Another way to make wordclouds using the WordCloud2 package.

#install the package
#require(devtools)
#install_github("lchiffon/wordcloud2")
#load package
library(wordcloud2)
#make wordcloud. you may want to expand out the figure for the full effect.
wordcloud2(data = malinowski1922tidy_wordcounts)

Analyzing pairs of words

We can also analyze pairs of words (bigrams). This can be useful for understanding the context around particular words as well as for identifying themes that are made up of multiple strings (e.g. “climate change”, “public health”).

bigrams<- malinowski1922 %>% unnest_tokens(output=bigrams,input=text, token="ngrams",n=2)
str(bigrams)
tibble [197,633 × 2] (S3: tbl_df/tbl/data.frame)
 $ gutenberg_id: int [1:197633] 55822 55822 55822 55822 55822 55822 55822 55822 55822 55822 ...
 $ bigrams     : chr [1:197633] "argonauts of" "of the" "the western" "western pacific" ...
##look at counts for each pair
bigrams %>% count(bigrams, sort = TRUE) %>% top_n(20)
# A tibble: 20 × 2
   bigrams         n
   <chr>       <int>
 1 <NA>         3192
 2 of the       2896
 3 in the       1551
 4 to the        912
 5 on the        721
 6 and the       559
 7 it is         504
 8 the kula      486
 9 of a          421
10 with the      418
11 the natives   405
12 to be         399
13 by the        391
14 from the      340
15 is the        330
16 the canoe     329
17 there is      282
18 that the      275
19 one of        268
20 in a          267

One challenge here is that again the stop words rise to the top of the frequencies. There are multiple ways we can handle this, but here we will remove any bigrams whre either the first or second word is a stop word.

#seperate words to pull out stop words
separated_words <- bigrams %>% separate(bigrams, c("word1", "word2"), sep = " ")
#filter out stop words
malinowski_bigrams <- separated_words %>% filter(!word1 %in% stop_words$word) %>% filter(!word2 %in% stop_words$word)

Try it

  1. Make a table of the top 100 bigrams sorted from most to least frequent.

  2. Pull out all bigrams where “island” is the second term and make a table of the most common bigrams in this subset.

  3. Pull out all bigrams where “canoe” is either the first or second term and make a table of the most common bigrams in this subset.

  4. What does this analysis tell you about this text? Can you think of any data in your own research that would benefit from ngram analysis?

Click for solution
malinowski_bigrams_count <- malinowski_bigrams %>% count(word1, word2, sort = TRUE)
malinowski_bigrams_count %>% top_n(20)
# A tibble: 21 × 3
   word1    word2        n
   <chr>    <chr>    <int>
 1 <NA>     <NA>      3192
 2 betel    nut         80
 3 coco     nut         74
 4 olden    days        59
 5 conch    shell       55
 6 tribal   life        46
 7 canoe    building    45
 8 kula     magic       45
 9 woodlark island      44
10 coco     nuts        33
# … with 11 more rows

#top 100 pairs of words
bigram100 <- head(malinowski_bigrams_count, 100)  %>% data.frame()
bigram100
          word1        word2    n
1          <NA>         <NA> 3192
2         betel          nut   80
3          coco          nut   74
4         olden         days   59
5         conch        shell   55
6        tribal         life   46
7         canoe     building   45
8          kula        magic   45
9      woodlark       island   44
10         coco         nuts   33
11        canoe        magic   30
12       flying      witches   29
13         kula   expedition   29
14      chapter           ii   27
15     communal       labour   26
16     southern       massim   26
17          arm       shells   25
18      magical        rites   25
19           op          cit   24
20         prow       boards   22
21    spondylus        shell   22
22       garden        magic   20
23          key        words   20
24         kula         ring   20
25       native       belief   20
26       inland         kula   19
27          key         word   18
28         kula  communities   18
29         kula    community   18
30      magical       formul   18
31         evil        magic   17
32         kula    valuables   17
33         lime          pot   17
34       social organisation   17
35        south        coast   17
36      chapter           vi   16
37         main       island   16
38        sugar         cane   16
39      village    community   16
40       ginger         root   15
41           ii     division   15
42        inter       tribal   15
43          nut          oil   15
44    professor     seligman   15
45     division           ii   14
46         free  translation   14
47     maternal        uncle   14
48     pandanus    streamers   14
49    trobriand      islands   14
50      uvalaku   expedition   14
51        white        man's   14
52       banana         leaf   13
53       beauty        magic   13
54  generations          ago   13
55         lime         pots   13
56       native        ideas   13
57     pandanus     streamer   13
58         prow        board   13
59        super       normal   13
60      chapter         xiii   12
61        conch       shells   12
62    fergusson       island   12
63         kula  expeditions   12
64      magical         rite   12
65      mwasila        magic   12
66       native         life   12
67     overseas   expedition   12
68         port      moresby   12
69     seligman           op   12
70        areca          nut   11
71         clay         pots   11
72       dawson      straits   11
73     division           vi   11
74         folk         lore   11
75         love        magic   11
76      magical       bundle   11
77         mint        plant   11
78     normanby       island   11
79        north         west   11
80    primitive    economics   11
81        trial          run   11
82       turtle        shell   11
83          axe       blades   10
84   ceremonial distribution   10
85           cf      chapter   10
86      chapter          iii   10
87      chapter          vii   10
88      counter         gift   10
89     division          iii   10
90        elder      brother   10
91         fish         hawk   10
92       flying        canoe   10
93         kula     district   10
94         kula     exchange   10
95         kula        gifts   10
96      lashing      creeper   10
97       mental     attitude   10
98          nut        betel   10
99     overseas  expeditions   10
100         red        paint   10

##look at words that appear next to  the word "island"
islandbigram <- malinowski_bigrams %>% filter(word1 == "island" | word2 == "island")
islandbigram %>% count(word1, word2, sort = TRUE) %>% top_n(20)
# A tibble: 62 × 3
   word1        word2      n
   <chr>        <chr>  <int>
 1 woodlark     island    44
 2 main         island    16
 3 fergusson    island    12
 4 normanby     island    11
 5 coral        island     4
 6 dobu         island     4
 7 neighbouring island     4
 8 island       called     3
 9 rossel       island     3
10 aignan       island     2
# … with 52 more rows

#just where island is the second term
islandbigram <- malinowski_bigrams %>% filter(word2 == "island")
islandbigram %>% count(word1, word2, sort = TRUE) %>% top_n(20)
# A tibble: 28 × 3
   word1        word2      n
   <chr>        <chr>  <int>
 1 woodlark     island    44
 2 main         island    16
 3 fergusson    island    12
 4 normanby     island    11
 5 coral        island     4
 6 dobu         island     4
 7 neighbouring island     4
 8 rossel       island     3
 9 aignan       island     2
10 amphlett     island     2
# … with 18 more rows

##look at words that appear next to  the word "canoe"
canoebigram <- malinowski_bigrams %>% filter(word1 == "canoe" | word2 == "canoe")
canoebigram %>% count(word1, word2, sort = TRUE) %>% top_n(20)
# A tibble: 39 × 3
   word1   word2        n
   <chr>   <chr>    <int>
 1 canoe   building    45
 2 canoe   magic       30
 3 flying  canoe       10
 4 canoe   flies        7
 5 canoe   builder      5
 6 masawa  canoe        5
 7 canoe   body         4
 8 canoe   spells       4
 9 canoe   thou         4
10 chief's canoe        4
# … with 29 more rows

Sentiment analysis

Texts often contain certain emotions, feelings, or sentiments that can tell us more about what they mean. In a way, coding text data for sentiments is similar to the qualitative reseach method of coding fieldnotes for themes. Because of this, you can develop your own custom lexicon for your research context. However, because this is a popular methodology, many existing sentiment analysis dictionaries have been developed and publicly shared.

We’ll work with the NRC Emotion Lexicon. First, we can load the NRC lexicon and look at the different types of sentiments that it contains.

# load the nrc sentiment dictionary
get_sentiments("nrc")
# A tibble: 13,875 × 2
   word        sentiment
   <chr>       <chr>    
 1 abacus      trust    
 2 abandon     fear     
 3 abandon     negative 
 4 abandon     sadness  
 5 abandoned   anger    
 6 abandoned   fear     
 7 abandoned   negative 
 8 abandoned   sadness  
 9 abandonment anger    
10 abandonment fear     
# … with 13,865 more rows
nrcdf <- get_sentiments("nrc")
#take a look at the top sentiments that occur in the lexicon
nrcdf %>% count(sentiment,sort=T)
# A tibble: 10 × 2
   sentiment        n
   <chr>        <int>
 1 negative      3318
 2 positive      2308
 3 fear          1474
 4 anger         1246
 5 trust         1230
 6 sadness       1187
 7 disgust       1056
 8 anticipation   837
 9 joy            687
10 surprise       532

Using inner_join() we can combine the sentiments with the words from Malinowski, effectively “tagging” each word with a particular sentiment.

#merge sentiments to malinowski data
malinowski1922_sentiment <- malinowski1922tidy %>% inner_join(get_sentiments("nrc"))

Try it

With the new merged and tagged dataframe, make a table of the top words in Malinowski that are associated with the sentiment “trust” and one other sentiment of choice. Reflect on how you might interprete these results. Do you find this information useful? Is there any place you could see sentiment analysis being useful in your own research?

Click for solution
# look at the top words associated with trust
malinowski1922_sentiment %>% filter(sentiment=="trust") %>% count(word,sort=T)
# A tibble: 480 × 2
   word         n
   <chr>    <int>
 1 food       228
 2 word       180
 3 exchange   149
 4 found      141
 5 trade      133
 6 rule       124
 7 clan       104
 8 account    101
 9 formula     92
10 real        92
# … with 470 more rows

#pick another sentient and pull out the top 20 words associated witth this sentiment. 
malinowski1922_sentiment %>% filter(sentiment=="surprise") %>% count(word,sort=T) %>% top_n(20)
# A tibble: 20 × 2
   word           n
   <chr>      <int>
 1 magical      307
 2 shell        181
 3 gift         131
 4 magician     100
 5 spirits       64
 6 tree          55
 7 finally       54
 8 sorcery       46
 9 death         44
10 ceremony      37
11 deal          37
12 leave         32
13 hero          30
14 remarkable    29
15 break         27
16 sun           27
17 feeling       25
18 mouth         23
19 catch         22
20 art           21
malinowski1922_sentiment %>% filter(sentiment=="sadness") %>% count(word,sort=T) %>% top_n(20)
# A tibble: 20 × 2
   word          n
   <chr>     <int>
 1 shell       181
 2 mother       54
 3 evil         53
 4 doubt        50
 5 death        44
 6 bad          40
 7 mortuary     37
 8 bottom       36
 9 leave        32
10 broken       31
11 shipwreck    28
12 danger       27
13 feeling      25
14 fall         23
15 sentence     23
16 art          21
17 hut          20
18 die          19
19 disease      18
20 lie          18

Case study: Permafrost and climate change survey

Now that we’ve learned a bit about text analysis using Malinowski let’s test our skills on a real world dataset. Here we will use data from a survey in two Inupiaq villages in Alaska to examine how indiviuals in these communities feel about climate change and thawing permafrost. These data are drawn from here: William B. Bowden 2013. Perceptions and implications of thawing permafrost and climate change in two Inupiaq villages of arctic Alaska Link. Let’s further examine the responses to two open ended questions: (Q5) What is causing it [permafrost around X village] to change? and (Q69) “What feelings do you have when thinking about the possibility of future climate change in and around [village name]?”.

First we load the data and subset out the columns of interest.

#we will work with the permafrost survey data.
surv<- read.csv("https://maddiebrown.github.io/ANTH630/data/Survey_AKP-SEL.csv", stringsAsFactors = F)
surv_subset <- surv %>% select(Village, Survey.Respondent, Age.Group, X69..Feelings, X5..PF.Cause.) 

Then we can quickly calculate the most frequent terms across all 80 responses.

class(surv$X69..Feelings) #make sure your column is a character variable
[1] "character"
surv_tidy <- surv_subset %>% unnest_tokens(word, X69..Feelings) %>% anti_join(stop_words)
#what are most common words?
feelingswordcount <- surv_tidy %>% count(word,sort=T)

Try it

Make wordclouds of the word frequency in responses about feelings related to climate change using two different methods.

Click for solution
surv_tidy %>% count(word) %>%  with(wordcloud(word, n, max.words = 100))

#wordcloud2(data = feelingswordcount)

Comparing word frequency across samples

Are there noticeable differences in responses across individuals from different sites? We can compare the responses about “What feelings do you have when thinking about the possibility of future climate change in and around [village name]?” from the permafrost survey, based on which village the respondent lives in.

#word frequency by village
surv_tidy <- surv_subset %>% unnest_tokens(word, X69..Feelings) %>% anti_join(stop_words)

#what are most common words?
surv_tidy %>% count(word,sort=T) %>% top_n(20)
        word  n
1     change 10
2    climate  8
3        sad  7
4    animals  6
5       cold  6
6     future  6
7      scary  6
8    weather  6
9      adapt  5
10   caribou  5
11        nc  5
12   worried  5
13  changing  4
14        dk  4
15    ground  4
16      move  4
17    people  4
18     worry  4
19    affect  3
20     blank  3
21 concerned  3
22      days  3
23   farther  3
24      feel  3
25      food  3
26      land  3
27     river  3
28    scared  3
29      time  3
30     water  3

#we can also look at the top words
byvillage <- surv_tidy %>% count(Village,word,sort=T) %>% ungroup()
byvillage %>% top_n(20)
   Village    word n
1      AKP animals 6
2      AKP  change 6
3      SEL   scary 6
4      AKP  future 5
5      AKP      nc 5
6      AKP caribou 4
7      AKP climate 4
8      AKP    cold 4
9      AKP     sad 4
10     AKP weather 4
11     AKP   worry 4
12     SEL  change 4
13     SEL climate 4
14     SEL      dk 4
15     SEL  ground 4
16     SEL    move 4
17     AKP    days 3
18     AKP  people 3
19     AKP worried 3
20     SEL   adapt 3
21     SEL   blank 3
22     SEL   river 3
23     SEL     sad 3
24     SEL   water 3

top_10 <- byvillage %>%
  group_by(Village) %>%
  top_n(10, n) %>%
  ungroup() %>%
  arrange(Village, desc(n))

ggplot(top_10, aes(x=reorder(word,n),y=n)) + geom_bar(stat="identity") +coord_flip() +ggtitle("Top terms by village") + labs(x="Word", y="Count") +facet_wrap(~ Village, scales = "free_y") 

Topic Modeling

In addition to analyzing word and bigram frequencies, we can also analyze texts using topic modeling. Topic modeling allows us to identify themes in the text without needing to clearly know which themes or groupings we expect to emerge. This can be very useful when you have large columes of messy data or data from multiple diverse sources that you need to parse. We will use Latent Dirichlet allocation or (LDA), following the explanation in Text Mining with R.

Before we can identify themes across responses however, we need to make sure each “document” or “response” has a unique identifier.

Try it

What is the primary key or unique identifier for this dataset? How do you know? Why can’t you use Survey.Respondent as a unique identifier?

Make a new primary key called “ID” that has a different value for each unique response.

Click for solution
surv_subset %>% select(Village, Survey.Respondent)
   Village Survey.Respondent
1      AKP                 1
2      AKP                 2
3      AKP                 3
4      AKP                 4
5      AKP                 5
6      AKP                 6
7      AKP                 7
8      AKP                 8
9      AKP                 9
10     AKP                10
11     AKP                11
12     AKP                12
13     AKP                13
14     AKP                14
15     AKP                15
16     AKP                16
17     AKP                17
18     AKP                18
19     AKP                19
20     AKP                20
21     AKP                21
22     AKP                22
23     AKP                23
24     AKP                24
25     AKP                25
26     AKP                26
27     AKP                27
28     AKP                28
29     AKP                29
30     AKP                30
31     AKP                31
32     AKP                32
33     AKP                33
34     AKP                34
35     AKP                35
36     AKP                36
37     AKP                37
38     AKP                38
39     AKP                39
40     SEL                 1
41     SEL                 2
42     SEL                 3
43     SEL                 4
44     SEL                 5
45     SEL                 6
46     SEL                 7
47     SEL                 8
48     SEL                 9
49     SEL                10
50     SEL                11
51     SEL                12
52     SEL                13
53     SEL                14
54     SEL                15
55     SEL                16
56     SEL                17
57     SEL                18
58     SEL                19
59     SEL                20
60     SEL                21
61     SEL                22
62     SEL                23
63     SEL                24
64     SEL                25
65     SEL                26
66     SEL                27
67     SEL                28
68     SEL                29
69     SEL                30
70     SEL                31
71     SEL                32
72     SEL                33
73     SEL                34
74     SEL                35
75     SEL                36
76     SEL                37
77     SEL                38
78     SEL                39
79     SEL                40
80     SEL                41
surv_subset<- surv_subset %>% mutate(ID=paste(Village,Survey.Respondent, sep="_"))

Frequency of word pairs per response

#look at the bigrams in these responses.
surv_subset %>% unnest_tokens(output=bigrams,input=X69..Feelings, token="ngrams",n=2) %>% count(bigrams,sort=T) %>% top_n(20)
         bigrams  n
1           <NA> 15
2        have to  9
3     don't know  4
4        it will  4
5        need to  4
6       to adapt  4
7          to be  4
8        to move  4
9        we have  4
10       we will  4
11       able to  3
12      about it  3
13       be able  3
14        had to  3
15 higher ground  3
16         if we  3
17 scary thought  3
18   the climate  3
19   the weather  3
20   think about  3
21     to higher  3
22       used to  3
23     will have  3
24      won't be  3

#look at pairwise counts per responses. how often do two words show up together in one person's response?
library(widyr)
surv_tidy %>% pairwise_count(word, Survey.Respondent,sort=T)
# A tibble: 3,322 × 3
   item1    item2        n
   <chr>    <chr>    <dbl>
 1 change   cold         3
 2 caribou  cold         3
 3 change   climate      3
 4 cold     change       3
 5 climate  change       3
 6 adapt    change       3
 7 scary    change       3
 8 cold     caribou      3
 9 changing sad          3
10 sad      changing     3
# … with 3,312 more rows
# so we see that "change" and "cold" appear in three responses, as do climate and change. however, we previously learned that the Survey.Respondent column is not a unique identifier for the responses. Let's run the same code, but with the new ID column we created.

###let's make a new surv_tidy object that incoporates the new ID we made
surv_tidy <- surv_subset %>%  unnest_tokens(word, X69..Feelings) %>% anti_join(stop_words)
surv_tidy %>% pairwise_count(word, ID,sort=T) %>% top_n(20)
# A tibble: 58 × 3
   item1    item2       n
   <chr>    <chr>   <dbl>
 1 adapt    change      3
 2 scary    change      3
 3 change   adapt       3
 4 change   scary       3
 5 climate  cold        2
 6 change   cold        2
 7 weather  cold        2
 8 cold     climate     2
 9 change   climate     2
10 relocate climate     2
# … with 48 more rows
#in this case the output is nearly the same, but in other cases this distinction can make a significant difference.

The first step in creating a topic model is to count the number of times each word appears in each individual document (or response in our case). Luckily, we can count by two variables using the count() function. Let’s create a new byresponse variable.

byresponse <- surv_tidy %>% count(ID,word,sort=T) %>% ungroup()

#check how many responses are included in the analysis. this allows you to double check that the new unique identifier we made worked as expected.
unique(byresponse$ID)
 [1] "AKP_26" "AKP_1"  "AKP_17" "AKP_18" "AKP_3"  "AKP_32" "AKP_5"  "SEL_21"
 [9] "AKP_10" "AKP_11" "AKP_12" "AKP_13" "AKP_14" "AKP_15" "AKP_16" "AKP_19"
[17] "AKP_2"  "AKP_20" "AKP_22" "AKP_23" "AKP_24" "AKP_25" "AKP_27" "AKP_28"
[25] "AKP_29" "AKP_30" "AKP_31" "AKP_33" "AKP_34" "AKP_35" "AKP_36" "AKP_37"
[33] "AKP_38" "AKP_4"  "AKP_6"  "AKP_7"  "AKP_8"  "AKP_9"  "SEL_1"  "SEL_10"
[41] "SEL_11" "SEL_13" "SEL_14" "SEL_15" "SEL_17" "SEL_18" "SEL_19" "SEL_2" 
[49] "SEL_20" "SEL_22" "SEL_24" "SEL_25" "SEL_26" "SEL_27" "SEL_29" "SEL_3" 
[57] "SEL_30" "SEL_31" "SEL_32" "SEL_33" "SEL_34" "SEL_35" "SEL_36" "SEL_37"
[65] "SEL_38" "SEL_39" "SEL_4"  "SEL_40" "SEL_5"  "SEL_6"  "SEL_7"  "SEL_8" 
[73] "SEL_9" 
length(unique(byresponse$ID))
[1] 73

Now we can convert our longform word list into a document-term matrix. Read more here

surv_dtm <- byresponse %>% cast_dtm(ID, word, n)
?cast_dtm #read up on how this function works

Run the LDA() function and choose a number of solutions. in this case, let’s try it with 2

surv_lda <- LDA(surv_dtm, k = 2, control = list(seed = 9999))
#look at our output
str(surv_lda)
Formal class 'LDA_VEM' [package "topicmodels"] with 14 slots
  ..@ alpha          : num 35.6
  ..@ call           : language LDA(x = surv_dtm, k = 2, control = list(seed = 9999))
  ..@ Dim            : int [1:2] 73 210
  ..@ control        :Formal class 'LDA_VEMcontrol' [package "topicmodels"] with 13 slots
  .. .. ..@ estimate.alpha: logi TRUE
  .. .. ..@ alpha         : num 25
  .. .. ..@ seed          : int 9999
  .. .. ..@ verbose       : int 0
  .. .. ..@ prefix        : chr "/var/folders/l_/hl0qrh9535l691r67nxlvd9c0000gn/T//RtmpfCj5rL/file46556f3259ed"
  .. .. ..@ save          : int 0
  .. .. ..@ nstart        : int 1
  .. .. ..@ best          : logi TRUE
  .. .. ..@ keep          : int 0
  .. .. ..@ estimate.beta : logi TRUE
  .. .. ..@ var           :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
  .. .. .. .. ..@ iter.max: int 500
  .. .. .. .. ..@ tol     : num 1e-06
  .. .. ..@ em            :Formal class 'OPTcontrol' [package "topicmodels"] with 2 slots
  .. .. .. .. ..@ iter.max: int 1000
  .. .. .. .. ..@ tol     : num 1e-04
  .. .. ..@ initialize    : chr "random"
  ..@ k              : int 2
  ..@ terms          : chr [1:210] "days" "cold" "april" "march" ...
  ..@ documents      : chr [1:73] "AKP_26" "AKP_1" "AKP_17" "AKP_18" ...
  ..@ beta           : num [1:2, 1:210] -4.73 -4.75 -5.22 -3.52 -4.58 ...
  ..@ gamma          : num [1:73, 1:2] 0.505 0.491 0.476 0.498 0.516 ...
  ..@ wordassignments:List of 5
  .. ..$ i   : int [1:330] 1 1 1 1 1 1 1 1 1 1 ...
  .. ..$ j   : int [1:330] 1 11 31 32 48 50 62 63 64 65 ...
  .. ..$ v   : num [1:330] 1 1 1 1 2 2 2 1 1 2 ...
  .. ..$ nrow: int 73
  .. ..$ ncol: int 210
  .. ..- attr(*, "class")= chr "simple_triplet_matrix"
  ..@ loglikelihood  : num [1:73] -64.3 -41.4 -66.3 -50.1 -49.8 ...
  ..@ iter           : int 8
  ..@ logLiks        : num(0) 
  ..@ n              : int 343

#examine the probability that each word is in a particular topic group
surv_topics <- tidy(surv_lda, matrix = "beta")
surv_topics
# A tibble: 420 × 3
   topic term       beta
   <int> <chr>     <dbl>
 1     1 days   0.00885 
 2     2 days   0.00864 
 3     1 cold   0.00540 
 4     2 cold   0.0296  
 5     1 april  0.0102  
 6     2 april  0.00140 
 7     1 march  0.00346 
 8     2 march  0.00821 
 9     1 months 0.000975
10     2 months 0.0107  
# … with 410 more rows

Examine the top words for each topic identified by the model.

top_words <- surv_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, desc(beta))
top_words
# A tibble: 20 × 3
   topic term      beta
   <int> <chr>    <dbl>
 1     1 climate 0.0446
 2     1 change  0.0363
 3     1 animals 0.0324
 4     1 sad     0.0258
 5     1 adapt   0.0204
 6     1 caribou 0.0196
 7     1 scary   0.0180
 8     1 blank   0.0173
 9     1 river   0.0159
10     1 future  0.0145
11     2 cold    0.0296
12     2 weather 0.0276
13     2 worried 0.0248
14     2 nc      0.0224
15     2 change  0.0220
16     2 future  0.0205
17     2 dk      0.0181
18     2 people  0.0175
19     2 land    0.0172
20     2 scary   0.0170

We can also examine the results graphically.

#plot these top words for each topic (adapted from https://www.tidytextmining.com/topicmodeling.html)
top_words %>% group_by(topic) %>%
  mutate(term = fct_reorder(term, beta)) %>% ungroup() %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") + theme_minimal()

Try it

Repeat the topic modeling analysis but using 6 topics instead of two.

Click for solution
##repeat the analysis but with 6 topics
surv_lda6 <- LDA(surv_dtm, k = 6, control = list(seed = 9999))
#examine the probability that each word is in a particular topic group
surv_topics6 <- tidy(surv_lda6, matrix = "beta")
surv_topics6
# A tibble: 1,260 × 3
   topic term       beta
   <int> <chr>     <dbl>
 1     1 days  4.88e-  2
 2     2 days  9.19e-262
 3     3 days  1.69e-269
 4     4 days  1.17e-270
 5     5 days  5.77e-267
 6     6 days  4.77e-268
 7     1 cold  1.63e-  2
 8     2 cold  5.81e-  2
 9     3 cold  1.68e-  2
10     4 cold  4.86e-235
# … with 1,250 more rows
#examine top words for each topic
top_words6 <- surv_topics6 %>%
  group_by(topic) %>%
  top_n(5, beta) %>%
  ungroup() %>%
  arrange(topic, desc(beta))
#plot these top words for each topic (adapted from https://www.tidytextmining.com/topicmodeling.html)
top_words6 %>% group_by(topic) %>%
  mutate(term = fct_reorder(term, beta)) %>% ungroup() %>%
  ggplot(aes(beta, term, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_y") + theme_minimal()

In this case, our sample is small, so topic modeling is not necessarily the best method to use. However, even from this small sample, you can see that some topics emerge from the text that were not previously apparent.

Manual text wrangling

Sometimes you’ll need to edit text or strings manually. For example, you may find that for your research question, you are less interested in differentiating between the terms running, run, and runner, than in identifying clusters of beliefs about running as a more general concept. On the other hand, you might want to differentiate between runners and running as beliefs about groups of people vs. the act of running. How you choose to transform text data in your research is up to your research questions and understanding of the cultural context.

R has a number of helpful functions for manually adjusting strings. We’ll cover a few to get you started. Let’s go back to the permafrost and climate change survey and look at responses to: (Q5) What is causing it [permafrost around X village] to change?.

First let’s look at the raw data. What are some potential issues in the strings below that might make text analysis difficult or ineffective?

surv$X5..PF.Cause.[10:30]
 [1] "climate change"                                                                                                                                  
 [2] "too warm"                                                                                                                                        
 [3] "warmer weather, shorter winters, lack of snow, fast springs. [will affect AKP because we] use Argos to go out."                                  
 [4] "weather warming"                                                                                                                                 
 [5] "temperature"                                                                                                                                     
 [6] "heat. A lot of heat."                                                                                                                            
 [7] "melting of the ground - goes down"                                                                                                               
 [8] "spirited answer. Her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
 [9] "probably global warming"                                                                                                                         
[10] "temperature outside is not steady"                                                                                                               
[11] "(blank)"                                                                                                                                         
[12] "N/A"                                                                                                                                             
[13] "most likely war weather or global warming"                                                                                                       
[14] "not much winter - hardly get snow. Always wind blown. Summer be rain, rain, rain. Late snow."                                                    
[15] "warmer winters"                                                                                                                                  
[16] "global warming"                                                                                                                                  
[17] "warming weather, longer summer/fall season"                                                                                                      
[18] "I have no idea"                                                                                                                                  
[19] "warmer climate"                                                                                                                                  
[20] "Seems like there's lots of rain & water causes ground to thaw. So maybe accumulated water? Maybe warm weather?"                                  
[21] "the heat wave - winter frosts"                                                                                                                   

Luckily we can manually adjust the strings to make them easier to analyze systematically. For example we might set characters to lowercase, trim whitespace and remove any empty or missing rows.

#make a new column to hold the tidy data
surv$cause_tidy <- surv$X5..PF.Cause.
#make lower case
surv$cause_tidy <- tolower(surv$cause_tidy)
#remove white space at beginning and end of string
surv$cause_tidy<- trimws(surv$cause_tidy)
#filter out blank or empty rows
surv<- surv %>% filter(surv$cause_tidy!="") 
surv <- surv %>% filter(surv$cause_tidy!="(blank)")
surv <- surv %>% filter(surv$cause_tidy!="n/a")

We can also directly replace particular strings. Here we change some strings with typos.

surv$cause_tidy<- surv$cause_tidy %>% str_replace("wamer", "warmer")
surv$cause_tidy<- surv$cause_tidy %>% str_replace("lnoger", "longer")

Another common string data transformation involves grouping together responses into more standardized categories. You can transform cll values individually or based on exact string matches. In addition, using %like% we can transform any strings where just part of the string contains a particular string. For example, we might decide that any time the string “warm” appears in a response, the overall theme of the response is associated with “global warming”. Or based on our ethnographic understanding of the context we might know that “seasonal changes” are important causes of permafrost in local cultural models. We can then look for some key terms that will allow us to rapidly change multiple responses that are likely to fit in this category. In this case, “late” and “early”.

#group some responses together based on the presence of a particular string
surv <- surv %>% mutate(cause_tidy=replace(cause_tidy,cause_tidy %like% "warm","global warming")) 
surv$cause_tidy[1:30]
 [1] "environment"                                                                                                                                     
 [2] "exhaust"                                                                                                                                         
 [3] "global warming"                                                                                                                                  
 [4] "global warming"                                                                                                                                  
 [5] "hot summers, early springs. in super cold winters the ground comes up & cracks and water comes out."                                             
 [6] "global warming"                                                                                                                                  
 [7] "climate change"                                                                                                                                  
 [8] "freezing & thawing in fall & spring"                                                                                                             
 [9] "global warming"                                                                                                                                  
[10] "climate change"                                                                                                                                  
[11] "global warming"                                                                                                                                  
[12] "global warming"                                                                                                                                  
[13] "global warming"                                                                                                                                  
[14] "temperature"                                                                                                                                     
[15] "heat. a lot of heat."                                                                                                                            
[16] "melting of the ground - goes down"                                                                                                               
[17] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[18] "global warming"                                                                                                                                  
[19] "temperature outside is not steady"                                                                                                               
[20] "global warming"                                                                                                                                  
[21] "not much winter - hardly get snow. always wind blown. summer be rain, rain, rain. late snow."                                                    
[22] "global warming"                                                                                                                                  
[23] "global warming"                                                                                                                                  
[24] "global warming"                                                                                                                                  
[25] "i have no idea"                                                                                                                                  
[26] "global warming"                                                                                                                                  
[27] "global warming"                                                                                                                                  
[28] "the heat wave - winter frosts"                                                                                                                   
[29] "global warming"                                                                                                                                  
[30] "global warming"                                                                                                                                  
surv <- surv %>% mutate(cause_tidy=replace(cause_tidy,cause_tidy %like% "early"|cause_tidy %like% "late","seasonal changes")) 
surv$cause_tidy[1:30]
 [1] "environment"                                                                                                                                     
 [2] "exhaust"                                                                                                                                         
 [3] "global warming"                                                                                                                                  
 [4] "global warming"                                                                                                                                  
 [5] "seasonal changes"                                                                                                                                
 [6] "global warming"                                                                                                                                  
 [7] "climate change"                                                                                                                                  
 [8] "freezing & thawing in fall & spring"                                                                                                             
 [9] "global warming"                                                                                                                                  
[10] "climate change"                                                                                                                                  
[11] "global warming"                                                                                                                                  
[12] "global warming"                                                                                                                                  
[13] "global warming"                                                                                                                                  
[14] "temperature"                                                                                                                                     
[15] "heat. a lot of heat."                                                                                                                            
[16] "melting of the ground - goes down"                                                                                                               
[17] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
[18] "global warming"                                                                                                                                  
[19] "temperature outside is not steady"                                                                                                               
[20] "global warming"                                                                                                                                  
[21] "seasonal changes"                                                                                                                                
[22] "global warming"                                                                                                                                  
[23] "global warming"                                                                                                                                  
[24] "global warming"                                                                                                                                  
[25] "i have no idea"                                                                                                                                  
[26] "global warming"                                                                                                                                  
[27] "global warming"                                                                                                                                  
[28] "the heat wave - winter frosts"                                                                                                                   
[29] "global warming"                                                                                                                                  
[30] "global warming"                                                                                                                                  

#compare the original with your categorizations
#surv %>% select(X5..PF.Cause.,cause_tidy)

We won’t get into too much detail today, but you can also search and select string data using regular expressions. You can read more in R4DS. Here let’s use str_detect() to pull out some strings with regular expressions.


#any responses ending in "ing"
surv$cause_tidy[str_detect(surv$cause_tidy,"ing$")]
 [1] "global warming"                      "global warming"                     
 [3] "global warming"                      "freezing & thawing in fall & spring"
 [5] "global warming"                      "global warming"                     
 [7] "global warming"                      "global warming"                     
 [9] "global warming"                      "global warming"                     
[11] "global warming"                      "global warming"                     
[13] "global warming"                      "global warming"                     
[15] "global warming"                      "global warming"                     
[17] "global warming"                      "global warming"                     
[19] "global warming"                      "global warming"                     
[21] "global warming"                      "global warming"                     
[23] "global warming"                      "global warming"                     
[25] "global warming"                      "global warming"                     
[27] "global warming"                      "global warming"                     
[29] "global warming"                      "global warming"                     
[31] "global warming"                      "global warming"                     
[33] "global warming"                      "global warming"                     
[35] "global warming"                     

#any reponses that contain a W followed by either an 'e' or an 'a'
surv$cause_tidy[str_detect(surv$cause_tidy,"w[ea]")]
 [1] "global warming"                                                                                                                                  
 [2] "global warming"                                                                                                                                  
 [3] "global warming"                                                                                                                                  
 [4] "global warming"                                                                                                                                  
 [5] "global warming"                                                                                                                                  
 [6] "global warming"                                                                                                                                  
 [7] "global warming"                                                                                                                                  
 [8] "spirited answer. her parents told of last people of culture to disappear - then weather and all surrounding began birthing pains for catastrophy"
 [9] "global warming"                                                                                                                                  
[10] "global warming"                                                                                                                                  
[11] "global warming"                                                                                                                                  
[12] "global warming"                                                                                                                                  
[13] "global warming"                                                                                                                                  
[14] "global warming"                                                                                                                                  
[15] "global warming"                                                                                                                                  
[16] "the heat wave - winter frosts"                                                                                                                   
[17] "global warming"                                                                                                                                  
[18] "global warming"                                                                                                                                  
[19] "weather"                                                                                                                                         
[20] "global warming"                                                                                                                                  
[21] "global warming"                                                                                                                                  
[22] "global warming"                                                                                                                                  
[23] "global warming"                                                                                                                                  
[24] "global warming"                                                                                                                                  
[25] "weather."                                                                                                                                        
[26] "global warming"                                                                                                                                  
[27] "global warming"                                                                                                                                  
[28] "global warming"                                                                                                                                  
[29] "global warming"                                                                                                                                  
[30] "global warming"                                                                                                                                  
[31] "global warming"                                                                                                                                  
[32] "global warming"                                                                                                                                  
[33] "global warming"                                                                                                                                  
[34] "global warming"                                                                                                                                  
[35] "global warming"                                                                                                                                  
[36] "global warming"                                                                                                                                  
[37] "global warming"                                                                                                                                  
[38] "global warming"                                                                                                                                  

#any responses that contain the string erosion 
surv$cause_tidy[str_detect(surv$cause_tidy,"erosion")]
[1] "erosion, and real hot summers and a lot of snow & rain."       
[2] "mud goes down river, cracking all along & falling in - erosion"
[3] "ground erosion"                                                
[4] "erosion"                                                       
# any responses that contain the string erosion, but which have any character occurring before the word erosion.
surv$cause_tidy[str_detect(surv$cause_tidy,".erosion")]
[1] "mud goes down river, cracking all along & falling in - erosion"
[2] "ground erosion"                                                

The utility of regular expressions is huge for quickly searching through and transforming large volumes of string data. We’ve only scratched the surface today.

Whenever transforming large volumes of data using string detection and regular expressions it is critical to double check that each operation is in fact working as you expected it to. Paying attention to the order of transformations is also important for preventing you from overwriting previous data transformations.

Creating new variables with str_detect()

Sometimes it is useful to create flags or indicator variables in your data. These can allow you to quickly filter out rows that have particular characteristics. For example, we can create a new binary column that indicates whether or not the response refers to global warming. This variable can then be used for further grouping, data visualization or other tasks.

surv <- surv %>% mutate(GlobalWarmingYN=str_detect(cause_tidy,"global warming"))
table(surv$GlobalWarmingYN) # how many responses contain the string global warming?

FALSE  TRUE 
   37    34 

Parts of speech tagging

We can also tag the parts of speech in a text. This allows us to focus an analysis on verbs, nouns, or other parts of speech that may be of interest. For example, in a study on sentiments, we might want to pull out adjectives in order to understand how people feel or describe a particular phenomenon. On the other hand, we might also pull out verbs in order to understand the types of actions people describe as associated with certain cultural practices or beliefs. Let’s tag the parts of speech in Malinowski 1922 to learn more about the places and cultural practices documented in this book.

Try it

Using the cnlp_annotate() function we can tag the parts of speech in Malinowski 1922. This function can take a long time to run. This is the last thing we will do today, so feel free to let it run and then take a break and come back to finish these problems.

  1. Make a new object using only the token part of the output from cnlp_annotate() and then examine the $upos column. What are all the unique parts of speech in this dataset?

  2. Select and examine the top 30 nouns and verbs in this dataset. Do any of the terms surprise you? How might this level of analysis of the text be meaningful for your interpretation of its themes?

Click for solution
library(cleanNLP)
cnlp_init_udpipe()
#tag parts of speech. takes a long time
malinowksiannotatedtext <- cnlp_annotate(malinowski1922tidy$word)
str(malinowksiannotatedtext) # look at the structure. because it is a list we have to pull out that particular section of the list
List of 2
 $ token   : tibble [81,939 × 11] (S3: tbl_df/tbl/data.frame)
  ..$ doc_id       : int [1:81939] 1 2 3 4 5 6 7 8 9 10 ...
  ..$ sid          : int [1:81939] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ tid          : chr [1:81939] "1" "1" "1" "1" ...
  ..$ token        : chr [1:81939] "argonauts" "western" "pacific" "account" ...
  ..$ token_with_ws: chr [1:81939] "argonauts" "western" "pacific" "account" ...
  ..$ lemma        : chr [1:81939] "argonaut" "western" "pacific" "account" ...
  ..$ upos         : chr [1:81939] "NOUN" "ADJ" "ADJ" "NOUN" ...
  ..$ xpos         : chr [1:81939] "NNS" "JJ" "JJ" "NN" ...
  ..$ feats        : chr [1:81939] "Number=Plur" "Degree=Pos" "Degree=Pos" "Number=Sing" ...
  ..$ tid_source   : chr [1:81939] "0" "0" "0" "0" ...
  ..$ relation     : chr [1:81939] "root" "root" "root" "root" ...
 $ document:'data.frame':   81199 obs. of  1 variable:
  ..$ doc_id: int [1:81199] 1 2 3 4 5 6 7 8 9 10 ...
 - attr(*, "class")= chr [1:2] "cnlp_annotation" "list"
malinowskiannotatedtextfull <-data.frame(malinowksiannotatedtext$token)
str(malinowskiannotatedtextfull)
'data.frame':   81939 obs. of  11 variables:
 $ doc_id       : int  1 2 3 4 5 6 7 8 9 10 ...
 $ sid          : int  1 1 1 1 1 1 1 1 1 1 ...
 $ tid          : chr  "1" "1" "1" "1" ...
 $ token        : chr  "argonauts" "western" "pacific" "account" ...
 $ token_with_ws: chr  "argonauts" "western" "pacific" "account" ...
 $ lemma        : chr  "argonaut" "western" "pacific" "account" ...
 $ upos         : chr  "NOUN" "ADJ" "ADJ" "NOUN" ...
 $ xpos         : chr  "NNS" "JJ" "JJ" "NN" ...
 $ feats        : chr  "Number=Plur" "Degree=Pos" "Degree=Pos" "Number=Sing" ...
 $ tid_source   : chr  "0" "0" "0" "0" ...
 $ relation     : chr  "root" "root" "root" "root" ...

# what are all the different parts of speech that have been tagged?
unique(malinowskiannotatedtextfull$upos)
 [1] "NOUN"  "ADJ"   "X"     "NUM"   "VERB"  "ADV"   "PRON"  "PART"  "INTJ" 
[10] "PROPN" "AUX"   "SYM"   "SCONJ" "ADP"   "DET"  

#verb analysis. first look at some of the verbs that occur in the book
#malinowskiannotatedtextfull %>% filter(upos=="VERB") %>% select(token,lemma) %>% data.frame() %>% top_n(30)
#top 50 verbs
malinowskiannotatedtextfull %>% filter(upos=="VERB") %>% count(token, sort=T) %>% top_n(30)
       token   n
1     called 248
2      found 141
3      means 139
4    carried 105
5  mentioned 101
6    meaning  99
7  performed  98
8    brought  91
9   obtained  89
10    flying  81
11   objects  76
12     makes  69
13  received  68
14  repeated  66
15      told  66
16    spoken  65
17     takes  61
18      left  60
19     bring  57
20   receive  56
21  consists  55
22    giving  55
23     visit  53
24     carry  52
25       eat  52
26       log  52
27    remain  52
28 connected  51
29 beginning  50
30    leaves  50

#what are the top 50 nouns?
malinowskiannotatedtextfull %>% filter(upos=="NOUN") %>% count(lemma, sort=T) %>% top_n(30) %>% data.frame()
        lemma    n
1       canoe 1118
2        kula  942
3       magic  880
4     village  624
5      native  596
6       spell  484
7      island  403
8        word  402
9        time  316
10       gift  315
11       sail  307
12  trobriand  307
13      chief  301
14        day  267
15 expedition  265
16        sea  257
17       form  254
18    chapter  243
19      shell  241
20       rite  232
21       food  228
22      beach  209
23   sinaketa  208
24      woman  208
25       myth  198
26    partner  194
27     people  189
28       rule  184
29       life  180
30   district  176