ANTH630 Home

Working with messy data

This lesson will cover how to work with messy, real-world data from the web. We will be working with pre-scraped datsets in class. Resources for webscraping with R are linked above. Even data that comes from existing databases can be messy.

Load in libraries

library(tidyverse)
library(DescTools)
library(tidytext)
library(ggpubr)

Goodreads data

First, we will work with the Best Books Ever Dataset drawn from GoodReads.

Lorena Casanova Lozano, & Sergio Costa Planells. (2020). Best Books Ever Dataset (1.0.0) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.4265096

Load in the data. This is a big dataset, so loading it may take a minute depending on your computer.

#https://zenodo.org/record/4265096
books <- read.csv("https://maddiebrown.github.io/ANTH630/data/books_1.Best_Books_Ever.csv")
#take a look at the data
str(books)
'data.frame':   52478 obs. of  25 variables:
 $ bookId          : chr  "2767052-the-hunger-games" "2.Harry_Potter_and_the_Order_of_the_Phoenix" "2657.To_Kill_a_Mockingbird" "1885.Pride_and_Prejudice" ...
 $ title           : chr  "The Hunger Games" "Harry Potter and the Order of the Phoenix" "To Kill a Mockingbird" "Pride and Prejudice" ...
 $ series          : chr  "The Hunger Games #1" "Harry Potter #5" "To Kill a Mockingbird" "" ...
 $ author          : chr  "Suzanne Collins" "J.K. Rowling, Mary GrandPré (Illustrator)" "Harper Lee" "Jane Austen, Anna Quindlen (Introduction)" ...
 $ rating          : num  4.33 4.5 4.28 4.26 3.6 4.37 3.95 4.26 4.6 4.3 ...
 $ description     : chr  "WINNING MEANS FAME AND FORTUNE.LOSING MEANS CERTAIN DEATH.THE HUNGER GAMES HAVE BEGUN. . . .In the ruins of a p"| __truncated__ "There is a door at the end of a silent corridor. And it’s haunting Harry Pottter’s dreams. Why else would he be"| __truncated__ "The unforgettable novel of a childhood in a sleepy Southern town and the crisis of conscience that rocked it, T"| __truncated__ "Alternate cover edition of ISBN 9780679783268Since its immediate success in 1813, Pride and Prejudice has remai"| __truncated__ ...
 $ language        : chr  "English" "English" "English" "English" ...
 $ isbn            : chr  "9780439023481" "9780439358071" "9999999999999" "9999999999999" ...
 $ genres          : chr  "['Young Adult', 'Fiction', 'Dystopia', 'Fantasy', 'Science Fiction', 'Romance', 'Adventure', 'Teen', 'Post Apoc"| __truncated__ "['Fantasy', 'Young Adult', 'Fiction', 'Magic', 'Childrens', 'Adventure', 'Audiobook', 'Middle Grade', 'Classics"| __truncated__ "['Classics', 'Fiction', 'Historical Fiction', 'School', 'Literature', 'Young Adult', 'Historical', 'Novels', 'R"| __truncated__ "['Classics', 'Fiction', 'Romance', 'Historical Fiction', 'Literature', 'Historical', 'Novels', 'Historical Roma"| __truncated__ ...
 $ characters      : chr  "['Katniss Everdeen', 'Peeta Mellark', 'Cato (Hunger Games)', 'Primrose Everdeen', 'Gale Hawthorne', 'Effie Trin"| __truncated__ "['Sirius Black', 'Draco Malfoy', 'Ron Weasley', 'Petunia Dursley', 'Vernon Dursley', 'Dudley Dursley', 'Severus"| __truncated__ "['Scout Finch', 'Atticus Finch', 'Jem Finch', 'Arthur Radley', 'Mayella Ewell', 'Aunt Alexandra', 'Bob Ewell', "| __truncated__ "['Mr. Bennet', 'Mrs. Bennet', 'Jane Bennet', 'Elizabeth Bennet', 'Mary Bennet', 'Kitty Bennet', 'Lydia Bennet',"| __truncated__ ...
 $ bookFormat      : chr  "Hardcover" "Paperback" "Paperback" "Paperback" ...
 $ edition         : chr  "First Edition" "US Edition" "" "Modern Library Classics, USA / CAN" ...
 $ pages           : chr  "374" "870" "324" "279" ...
 $ publisher       : chr  "Scholastic Press" "Scholastic Inc." "Harper Perennial Modern Classics" "Modern Library" ...
 $ publishDate     : chr  "09/14/08" "09/28/04" "05/23/06" "10/10/00" ...
 $ firstPublishDate: chr  "" "06/21/03" "07/11/60" "01/28/13" ...
 $ awards          : chr  "['Locus Award Nominee for Best Young Adult Book (2009)', 'Georgia Peach Book Award (2009)', 'Buxtehuder Bulle ("| __truncated__ "['Bram Stoker Award for Works for Young Readers (2003)', 'Anthony Award for Young Adult (2004)', \"Mythopoeic F"| __truncated__ "['Pulitzer Prize for Fiction (1961)', 'Audie Award for Classic (2007)', 'National Book Award Finalist for Ficti"| __truncated__ "[]" ...
 $ numRatings      : int  6376780 2507623 4501075 2998241 4964519 1834276 2740713 517740 110146 1074620 ...
 $ ratingsByStars  : chr  "['3444695', '1921313', '745221', '171994', '93557']" "['1593642', '637516', '222366', '39573', '14526']" "['2363896', '1333153', '573280', '149952', '80794']" "['1617567', '816659', '373311', '113934', '76770']" ...
 $ likedPercent    : int  96 98 95 94 78 96 91 96 98 94 ...
 $ setting         : chr  "['District 12, Panem', 'Capitol, Panem', 'Panem (United States)']" "['Hogwarts School of Witchcraft and Wizardry (United Kingdom)', 'London, England']" "['Maycomb, Alabama (United States)']" "['United Kingdom', 'Derbyshire, England (United Kingdom)', 'England', 'Hertfordshire, England (United Kingdom)']" ...
 $ coverImg        : chr  "https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1586722975l/2767052.jpg" "https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1546910265l/2.jpg" "https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1553383690l/2657.jpg" "https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1320399351l/1885.jpg" ...
 $ bbeScore        : int  2993816 2632233 2269402 1983116 1459448 1372809 1276599 1238556 1159802 1087732 ...
 $ bbeVotes        : int  30516 26923 23328 20452 14874 14168 13264 12949 12111 11211 ...
 $ price           : chr  "5.09" "7.38" "" "" ...

These data seem promising, but there are some transformations we can do to make the data more accessible. For example, take a look at the genre column. Currently, there are multiple genres listed in the same column. This type of data format is not uncommon when working with data from the web. There are several ways we could extract the genres into a tidy data format.

Try it

In your group, discuss how you might approach separating the genres in this dataset so that you can clearly compare the genres across books. Don’t scroll past these fungi to the next section until you’ve taken time to explore several approaches.

Genre analysis

First, with a limited number of genres, we can create a new column for each genre. The cells can be populated with Y/N in this case.

Second, with a large number of genres, we might make a long form dataset rather than a wide one. In this case, the data would be organized in two columns: Book, and Genre. This allows for multiple rows per book, depending on how many genres the book is in.

# Take a look at the genres
head(books$genres)
[1] "['Young Adult', 'Fiction', 'Dystopia', 'Fantasy', 'Science Fiction', 'Romance', 'Adventure', 'Teen', 'Post Apocalyptic', 'Action']"                 
[2] "['Fantasy', 'Young Adult', 'Fiction', 'Magic', 'Childrens', 'Adventure', 'Audiobook', 'Middle Grade', 'Classics', 'Science Fiction Fantasy']"       
[3] "['Classics', 'Fiction', 'Historical Fiction', 'School', 'Literature', 'Young Adult', 'Historical', 'Novels', 'Read For School', 'High School']"     
[4] "['Classics', 'Fiction', 'Romance', 'Historical Fiction', 'Literature', 'Historical', 'Novels', 'Historical Romance', 'Classic Literature', 'Adult']"
[5] "['Young Adult', 'Fantasy', 'Romance', 'Vampires', 'Fiction', 'Paranormal', 'Paranormal Romance', 'Supernatural', 'Teen', 'Urban Fantasy']"          
[6] "['Historical Fiction', 'Fiction', 'Young Adult', 'Historical', 'Classics', 'War', 'Holocaust', 'World War II', 'Books About Books', 'Audiobook']"   

# Select out only the `Science Fiction` books.
books %>% filter(genres == 'Science Fiction') # Why doesn't this work?
 [1] bookId           title            series           author          
 [5] rating           description      language         isbn            
 [9] genres           characters       bookFormat       edition         
[13] pages            publisher        publishDate      firstPublishDate
[17] awards           numRatings       ratingsByStars   likedPercent    
[21] setting          coverImg         bbeScore         bbeVotes        
[25] price           
<0 rows> (or 0-length row.names)
#try this way
#the % symbols denote that the string can occur anywhere within the broader string
books %>% filter( genres %like% "%Science Fiction%") %>% sample_n(20) %>% select(title)
                                  title
1                                  iBoy
2                           Genius Loci
3                 The Stepsister Scheme
4                             Turnabout
5                                Legacy
6                The Spine of the World
7                        Circle of Five
8                                 Ruins
9                          Exile's Song
10                            Redshirts
11 Song for the Unraveling of the World
12                           Roverandom
13            800 Leagues on the Amazon
14           There Will Come Soft Rains
15      One of Our Thursdays Is Missing
16                    Midnight's Choice
17                        Restless Soul
18                            Recursion
19                        Clan Daughter
20                             Das Buch

#make a new object with mutated columns
books2 <- books %>% mutate(SciFi=ifelse(genres %like% '%Science Fiction%',"Y","N"), Fantasy=ifelse(genres %like% '%Fantasy%',"Y","N"), Classics=ifelse(genres %like% '%Classics%',"Y","N"))

#look at handiwork
books2 %>% select(genres, SciFi,Fantasy,Classics) %>% head()
                                                                                                                                               genres
1                  ['Young Adult', 'Fiction', 'Dystopia', 'Fantasy', 'Science Fiction', 'Romance', 'Adventure', 'Teen', 'Post Apocalyptic', 'Action']
2        ['Fantasy', 'Young Adult', 'Fiction', 'Magic', 'Childrens', 'Adventure', 'Audiobook', 'Middle Grade', 'Classics', 'Science Fiction Fantasy']
3      ['Classics', 'Fiction', 'Historical Fiction', 'School', 'Literature', 'Young Adult', 'Historical', 'Novels', 'Read For School', 'High School']
4 ['Classics', 'Fiction', 'Romance', 'Historical Fiction', 'Literature', 'Historical', 'Novels', 'Historical Romance', 'Classic Literature', 'Adult']
5           ['Young Adult', 'Fantasy', 'Romance', 'Vampires', 'Fiction', 'Paranormal', 'Paranormal Romance', 'Supernatural', 'Teen', 'Urban Fantasy']
6    ['Historical Fiction', 'Fiction', 'Young Adult', 'Historical', 'Classics', 'War', 'Holocaust', 'World War II', 'Books About Books', 'Audiobook']
  SciFi Fantasy Classics
1     Y       Y        N
2     Y       Y        Y
3     N       N        Y
4     N       N        Y
5     N       Y        N
6     N       N        Y

This works great if we know the entire set of genres we want to examine. If the categories are unknown, we can turn to regular expressions to assist with subsetting data.

# Separating genres with string split. (https://stringr.tidyverse.org/reference/str_split.html) (https://stackoverflow.com/questions/71915593/unlist-and-split-a-column-to-add-to-rows-without-losing-information-of-other-col)
list<- books %>% mutate(genre2=str_split(books$genres, "', '")) %>% unnest(genre2)
# look at handiwork
list %>% select(title, genre2) %>% head()
# A tibble: 6 × 2
  title            genre2         
  <chr>            <chr>          
1 The Hunger Games ['Young Adult  
2 The Hunger Games Fiction        
3 The Hunger Games Dystopia       
4 The Hunger Games Fantasy        
5 The Hunger Games Science Fiction
6 The Hunger Games Romance        
#pretty good, but there are still some extra characters we need to remove. The \\ makes the square bracket selectable
list$genre2<- list$genre2 %>% str_replace("\\['","")
# examine the result
list %>% select(title, genre2) %>% head()
# A tibble: 6 × 2
  title            genre2         
  <chr>            <chr>          
1 The Hunger Games Young Adult    
2 The Hunger Games Fiction        
3 The Hunger Games Dystopia       
4 The Hunger Games Fantasy        
5 The Hunger Games Science Fiction
6 The Hunger Games Romance        

#a solution using separate_rows()
separated <- separate_rows(books,genres)
#look at handiwork
separated %>% select(title, genres) %>% head()
# A tibble: 6 × 2
  title            genres    
  <chr>            <chr>     
1 The Hunger Games ""        
2 The Hunger Games "Young"   
3 The Hunger Games "Adult"   
4 The Hunger Games "Fiction" 
5 The Hunger Games "Dystopia"
6 The Hunger Games "Fantasy" 

Identifying settings of the books

How might we analyze the settings of the books? This data is stored in the setting column. Notably, there may be multiple settings per book, some of which are fictional or geographic places.

Try it

  1. Let’s create a new column with a simplified version of the book settings. For example, let’s find all the books that are set in England. Make a new column that tags the books based on whether or not they are in England. Try using mutate(), replace(), %like%, and str_replace(). Which one(s) work in this context and which do not?
Click for solution
## Make a new column
books$setting2 <- books$setting
## Check out the first 10 rows
books$setting2[1:10]
 [1] "['District 12, Panem', 'Capitol, Panem', 'Panem (United States)']"                                               
 [2] "['Hogwarts School of Witchcraft and Wizardry (United Kingdom)', 'London, England']"                              
 [3] "['Maycomb, Alabama (United States)']"                                                                            
 [4] "['United Kingdom', 'Derbyshire, England (United Kingdom)', 'England', 'Hertfordshire, England (United Kingdom)']"
 [5] "['Forks, Washington (United States)', 'Phoenix, Arizona (United States)', 'Washington (state) (United States)']" 
 [6] "['Molching (Germany)', 'Germany']"                                                                               
 [7] "['England', 'United Kingdom']"                                                                                   
 [8] "['London, England']"                                                                                             
 [9] "['Middle-earth']"                                                                                                
[10] "['Atlanta, Georgia (United States)']"                                                                            
# change responses to be england
books2 <- books %>% mutate(setting2=replace(setting2,setting2 %like% "%England%","England"))
 #check out handiwork
books2 %>% select(setting, setting2) %>% head()
                                                                                                           setting
1                                                ['District 12, Panem', 'Capitol, Panem', 'Panem (United States)']
2                               ['Hogwarts School of Witchcraft and Wizardry (United Kingdom)', 'London, England']
3                                                                             ['Maycomb, Alabama (United States)']
4 ['United Kingdom', 'Derbyshire, England (United Kingdom)', 'England', 'Hertfordshire, England (United Kingdom)']
5  ['Forks, Washington (United States)', 'Phoenix, Arizona (United States)', 'Washington (state) (United States)']
6                                                                                ['Molching (Germany)', 'Germany']
                                                                                                         setting2
1                                               ['District 12, Panem', 'Capitol, Panem', 'Panem (United States)']
2                                                                                                         England
3                                                                            ['Maycomb, Alabama (United States)']
4                                                                                                         England
5 ['Forks, Washington (United States)', 'Phoenix, Arizona (United States)', 'Washington (state) (United States)']
6                                                                               ['Molching (Germany)', 'Germany']

Working with review data

Now let’s turn to working with review data. These are common types of data you might find on the web, yet are not always ready for data analysis. For this example, we will use some Food Reviews drawn from Amazon Customer Reviews

Chatterjee, Ishani, 2021, “Amazon Customer Review”, https://doi.org/10.7910/DVN/W96OFO, Harvard Dataverse, V1

#https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/W96OFO
food <- read.csv("https://maddiebrown.github.io/ANTH630/data/export_food.csv")

str(food)
'data.frame':   5000 obs. of  10 variables:
 $ X           : int  0 1 2 3 4 5 6 7 8 9 ...
 $ asin        : chr  "B07XC6WRJQ" "B07XC6WRJQ" "B07XC6WRJQ" "B07XC6WRJQ" ...
 $ product.name: chr  "Sparkling ICE Sparkling Water, Variety Pack" "Sparkling ICE Sparkling Water, Variety Pack" "Sparkling ICE Sparkling Water, Variety Pack" "Sparkling ICE Sparkling Water, Variety Pack" ...
 $ ratings     : int  4 5 1 1 1 1 5 1 2 5 ...
 $ reviews     : chr  "\n\n  I’ve been ordering this product monthly for over a year. These past few shipments have been off, and the "| __truncated__ "\n\n  I used to drink these Sparkling Ice waters in all flavors every day.  They all taste GREAT, my favorite w"| __truncated__ "\n\n  Update March 2019: I have been a regular subscriber of 3 cases a month of Sparkling Ice for a few years. "| __truncated__ "\n\n  I don't drink this myself, I buy it for my dad who I swear would drink it by the gallon if offered. He is"| __truncated__ ...
 $ helpful     : int  378 331 236 170 125 140 61 106 27 39 ...
 $ date        : chr  "11-Jul-18" "14-Sep-18" "21-May-17" "7-Aug-17" ...
 $ Unnamed..6  : logi  NA NA NA NA NA NA ...
 $ target      : chr  "p" "p" "n" "n" ...
 $ text        : chr  "\n\n  I’ve been ordering this product monthly for over a year. These past few shipments have been off, and the "| __truncated__ "\n\n  I used to drink these Sparkling Ice waters in all flavors every day.  They all taste GREAT, my favorite w"| __truncated__ "\n\n  Update March 2019: I have been a regular subscriber of 3 cases a month of Sparkling Ice for a few years. "| __truncated__ "\n\n  I don't drink this myself, I buy it for my dad who I swear would drink it by the gallon if offered. He is"| __truncated__ ...

Try it

With your group, discuss the kinds of anthropological or other research questions (depending on your field) that review or social media data might be able to help answer.

Subsetting Updated Reviews

Looks like some of these reviews have updates from the original review. Let’s pull out the text from the updates and see what the changes are like. We can subset out only the reviews where the string “update” is detected. Then, we can remove the text before the “update” so that we are only examining the part of the review that was updated. This isn’t a perfect method, but it’s a start.

#make a subset of only the rows with updates
updates <- food %>% filter(str_detect(reviews, "Update"))
#I am only subsetting based on capitalized "Update", since this seems to be the pattern used when a reviewer starts a section of the review with an update. The lowercase "update" seems to denote the phrase, "I'll post an update...". Be sure to examine each dataset to determine its characteristics and the best approach to subsetting based on strings.

#remove text before the update, so this is the only text we analyze
updates$text <- sub(".*Update", "Update", updates$text)

updates$text
[1] "Update: at Christmastime, I used Crisp Apple as a base for zero-carb mulled “cider.” Lacking proper mulling spices, a Constant Comment tea bag with its cinnamon/orange peel/clove flavoring did the trick nicely. A touch of malic acid restored the tanginess lost from heating the fizz (carbonic acid) out of it.\n\n"                                                                                                                                                                                                                                                                                                                         
[2] "Update: two hours into drinking this I actually had a mild food poisoning, my stomach started hurting and I wanted to vomit, then I had a mild diarrhea, I usually eat very healthy and never drink soda so maybe I’m just not used to those processed drinks but I do not recommend this product. The flavor I had was the kiwi strawberry one\n\n"                                                                                                                                                                                                                                                                                               
[3] "Update: I just bought the combo variety that unfortunately doesn't include the Coconut Pineapple in it for just under $10, because they raised this from $12 to over $20!!! (FYI: You can get 18 of the variety at Sam's Club for just under $11) I get so tired of & annoyed with sellers doing raising prices overnight, double and sometimes even triple the price. Ridiculous!!!  As far as the product: I love these Sparkling ICE drinks!  I have tried them all and like them all. (I don't much care for their lemonades.) The Coconut Pineapple is my favorite! A close second fave is Pink Grapefruit. They are both so so so yummy!\n\n"
[4] "Update: my favorite flavors are grapefruit, pineapple and black raspberry. I specifically don’t notice an artificial sugar aftertaste in these flavors.\n\n"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       
[5] "Update5 Stars for the drinks.  Son loves the drink0 Stars for the price.  This was the cheapest price around (and did not have to drive all over looking for it since this flavor is hard to find).  Not it is double.  Have to drive around looking for it; that is the only issue.\n\n"                                                                                                                                                                                                                                                                                                                                                          
[6] "Update:  I tried ordering it again, ~6months later, and what I got was completely flat.  I love this stuff when it's carbonated, but flat it's horrible.  Disappointed again.  Fortunately Amazon's great CS refunded my money, but I would think if it was flat they'd stop selling it until they got a better quality supply.\n\n"                                                                                                                                                                                                                                                                                                               

Try-it

  1. Analyze the sentiment of the words in the updated text using the Bing sentiment lexicon.
  2. What do you notice about these reviews? Are they trending towards more positive or negative words? Which words are associated with the these sentiments?
Click for solution
library(tidytext)

#unnest tokens
updatewords <- updates %>% unnest_tokens(word, text)

#load bing sentiment library
bing <- get_sentiments("bing")

#join datasets
updatesentiment <- updatewords %>% inner_join(bing)

#examine results
#updatesentiment

#now let's examine how many words are positive or negative
updatesentiment %>% group_by(sentiment) %>% count()
# A tibble: 2 × 2
# Groups:   sentiment [2]
  sentiment     n
  <chr>     <int>
1 negative     13
2 positive     21

#make a graph of most common positive and negative words
#create word lists
positive_wordcount<-updatesentiment %>% filter(sentiment=="positive") %>% count(word)
negative_wordcount<-updatesentiment %>% filter(sentiment=="negative") %>% count(word)

#make positive and negative plots
positiveplot <- positive_wordcount %>% arrange(n)  %>% ggplot(aes(x=word, y=n)) +geom_col() + xlab(NULL) + coord_flip() + labs(y="Count", x="Words", title="Positive words \nin updated reviews")
negativeplot <- negative_wordcount %>% ggplot(aes(x=word, y=n)) +geom_col() + xlab(NULL) + coord_flip() + labs(y="Count", x="Words", title="Negative words \nin updated reviews")

#make plot
ggarrange(positiveplot,negativeplot,ncol=2)

Taste and flavor analysis

How might you analyze which tastes (e.g. sweet, sour) show up the most frequently in the reviews? Are these different tastes associated with particular flavors (e.g. black cherry, orange mango)? What if you wanted to go further and understand the reviewers’ perception of the flavor (e.g. disgusting, ok, delicious)?

Try it

In your group, reason through how you might approach answering the above questions to link taste and flavor in the original review data set. Diagram out a plan on a sheet of paper to help you think through the steps needed to conduct this analysis.

Then, let’s take some time to work on starting this analysis. Work as a team to try different strategies. Some ideas for getting started are below.

Click for solution
str(food )
'data.frame':   5000 obs. of  10 variables:
 $ X           : int  0 1 2 3 4 5 6 7 8 9 ...
 $ asin        : chr  "B07XC6WRJQ" "B07XC6WRJQ" "B07XC6WRJQ" "B07XC6WRJQ" ...
 $ product.name: chr  "Sparkling ICE Sparkling Water, Variety Pack" "Sparkling ICE Sparkling Water, Variety Pack" "Sparkling ICE Sparkling Water, Variety Pack" "Sparkling ICE Sparkling Water, Variety Pack" ...
 $ ratings     : int  4 5 1 1 1 1 5 1 2 5 ...
 $ reviews     : chr  "\n\n  I’ve been ordering this product monthly for over a year. These past few shipments have been off, and the "| __truncated__ "\n\n  I used to drink these Sparkling Ice waters in all flavors every day.  They all taste GREAT, my favorite w"| __truncated__ "\n\n  Update March 2019: I have been a regular subscriber of 3 cases a month of Sparkling Ice for a few years. "| __truncated__ "\n\n  I don't drink this myself, I buy it for my dad who I swear would drink it by the gallon if offered. He is"| __truncated__ ...
 $ helpful     : int  378 331 236 170 125 140 61 106 27 39 ...
 $ date        : chr  "11-Jul-18" "14-Sep-18" "21-May-17" "7-Aug-17" ...
 $ Unnamed..6  : logi  NA NA NA NA NA NA ...
 $ target      : chr  "p" "p" "n" "n" ...
 $ text        : chr  "\n\n  I’ve been ordering this product monthly for over a year. These past few shipments have been off, and the "| __truncated__ "\n\n  I used to drink these Sparkling Ice waters in all flavors every day.  They all taste GREAT, my favorite w"| __truncated__ "\n\n  Update March 2019: I have been a regular subscriber of 3 cases a month of Sparkling Ice for a few years. "| __truncated__ "\n\n  I don't drink this myself, I buy it for my dad who I swear would drink it by the gallon if offered. He is"| __truncated__ ...

#food %>% select(reviews) %>% top_n(10)

food %>% filter(reviews %like any%  c("%Yummy%", "%yummy%", "%taste great%", "%Great flavor%", "%tasty%", "%delicious%")) %>% sample_n(10) %>% select(reviews)
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              reviews
1                                                                                                                                                                                                                                                                                                                                                                                                                                                           \n\n  very tasty and very good price.\n\n
2                                                                                                                                                                                                                                                                                                                                                                                                                                                                         \n\n  It’s so delicious\n\n
3                                                                                                                                                                                                                                                                                                                                                                                                                            \n\n  Price could be a lot better but I love this flavor!!! Yummy!!!\n\n
4                                                                                                                                                                                                                        \n\n  This is my least favorite flavor from this line. It doesn't really taste like strawberry at all and the lemonade flavor is off as well. While I regularly purchase this brand and others taste great, I would recommend other flavors such as the Black Raspberry.\n\n
5  \n\n  Let me say I love Sparkling Ice flavored water. I am enjoying trying all the flavors, and they are all a refreshing and tasty way to drink your water. The pomegranate is not as flavorful as some of the others, its a little weak tasting to my family and I. Its not bad, but if you closed your eyes, you'd never guess what this flavor was. My daughter said it tastes "pink" and that about describes it. I recommend the orange mango, the blackberry, for a little more flavor.\n\n
6                                                                                                                                                                                                                                                                                                                                                                                                                                                 \n\n  The black cherry is delicious. Addictive.\n\n
7                                                                                                                                                                                                                                                                                                                                                                    \n\n  Asking as it is cold, this drink is delicious...if it's less than refrigerator temp cold, it's got a sour taste to it.\n\n
8                                                                                                                                                                                                                                                                                                                                \n\n  These are very tasty and have vitamins in them and no calories. Mixes well with bourbon. However it is double the price that you pay at the grocery store.\n\n
9                                                                                                                                                                                                                                                                                                                                                                         \n\n  Great flavor sampler, but the price varies wildly from month to month - from as little as $5.32 to more than $20!\n\n
10                                                                                                                                                                                                                                                                                      \n\n  Well, obviously not _really_ addicted, but I've now got it on automatic shipment, and have to discipline myself to limit my consumption to no more than one a day. It's delicious _and_ refreshing.\n\n

tastewords <- c("%Yummy%", "%yummy%", "%taste great%", "%Great flavor%", "%tasty%", "%delicious%")


#food %>% filter(reviews %like any% c("%Ginger Lime%","%ginger lime%") & reviews %like any% tastewords)
#food %>% filter(reviews %like any% c("%cherry limeade%","%Cherry Limeade%") & reviews %like any% tastewords)
#food %>% filter(reviews %like any% c("%coconut-pineapple%", "%coconut pineapple%", "%Coconut Pineapple%") & reviews %like any% tastewords)