Helpful links

Introducing tidyverse

Tidyverse is a suite of packages for R that follow the principles of tidy data. Today's tutorial introduces key functions from various packages in the tidyverse that are essential for advanced data wrangling and analysis.

First load the library

library(tidyverse)

Usually, when we first open a new data file we might us head() or str() to get a sense of the data structure and values. In tidyverse, we can use glimpse() to view similar data.

Let's look at the otter data from previous weeks using glimpse().

otter <- read.csv("https://maddiebrown.github.io/ANTH630/data/sea_otter_counts_2017&2018_CLEANDATA.csv")
glimpse(otter)
Rows: 1,337
Columns: 8
$ region      <fct> "west prince of wales island, alaska", "west prince of wa…
$ site_name   <fct> Big Clam Bay, Big Clam Bay, Big Tree Bay, Big Tree Bay, B…
$ latitude_N  <dbl> 55.19457, 55.22769, 55.56796, 55.57610, 55.57079, 55.5727…
$ longitude_E <dbl> -132.9670, -132.9739, -133.1710, -133.2199, -133.2262, -1…
$ date_DDMMYY <fct> 18/7/18, 26/7/18, 18/7/18, 18/7/18, 18/7/18, 18/7/18, 18/…
$ year        <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 201…
$ replicate   <int> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ n_otter     <int> 1, 0, 1, 2, 1, 1, 3, 1, 13, 1, 1, 12, 2, 2, 1, 1, 1, 1, 1…

From the output, we can see the different variable type and a sample of values from each column.

Pipes

A key component of tidyverse is the ability to pipe together multiple functions with %>%. This allows the completion of multiple data transformations or analyses within the same line of code.

Let's try out using a pipe to calculate the mean number of otters observed in our dataset.

otter %>% summarise(mean = mean(n_otter, na.rm = T))
      mean
1 2.384443

Notice that because the pipe is operating on the otter dataset, there is no need to reference the dataset again when we refer to the column n_otter. In addition, the summarise() function is handy for any type of summary procedure. In this case, we summarize the n_otter column by applying the function mean() to it.

Filtering and selecting data

Tidyverse contains powerful data selection and filtering tools. filter() subsets rows according to a condition while select() subsets dataframes by variables.

Try it

Use filter() to subset any rows where n_otter is greater than 15.
Use select() to choose only the n_otter column.

Click for solution

otter %>% filter(n_otter > 15)

otter %>% dplyr::select(n_otter)

Columns can also be selected based on the strings in their names. This can be helpful when you have a large number of variables that can be intuitively subset.

# pull out the latitude and longitude based on their shared first letter otter
# %>% dplyr::select(starts_with('l')) or based on a common string otter %>%
# dplyr::select(contains('tude'))

We can also match variable names based on strings in another vector. This can be useful if you need to subset based on another set of criteria or a certain theme within your broader dataset. In this example, we have a vector called timetopics that contains a series of words related to time. Using select() and any_of() we can pull out all the variables from the otter dataset that match values in this vector. Any values that are not matched will be excluded.

timetopics <- c("year", "time", "date_DDMMYY", "minutes")

head(otter %>% dplyr::select(any_of(timetopics)))  # head is used to prevent a long output in the tutorial
  year date_DDMMYY
1 2018     18/7/18
2 2018     26/7/18
3 2018     18/7/18
4 2018     18/7/18
5 2018     18/7/18
6 2018     18/7/18

There are additional selection functions such as: matches(), all_of(), and ends_with() among others. You are encouraged to try out these various functions to learn their usecases.

Grouping data

Previously we grouped data using aggregate. In tidyverse, we have more control over selecting and linking multiple variables using group_by(). The code below groups the otter dataset by site names and counts how many rows there are for each site.

otter %>% group_by(site_name) %>% count()
# A tibble: 43 x 2
# Groups:   site_name [43]
   site_name            n
   <fct>            <int>
 1 Big Clam Bay         2
 2 Big Tree Bay        81
 3 Blanquizal Bay      84
 4 Chusini Cove 1      61
 5 Chusini Cove 2      48
 6 Dunbar Inlet        22
 7 Farallon Bay         2
 8 Garcia Cove         61
 9 Goat Mouth Inlet     4
10 Guktu Cove 1        44
# … with 33 more rows

You can also count the number of observations per group using tally().

otter %>% tally(n_otter)
     n
1 3188

Summarizing data

A few weeks ago we used aggregate() to group together and summarize variables. Let's return to the otter dataset and reanalyze it using tidyverse. Below are two ways of summarising the mean number of otters per observation at each site.

otter %>% group_by(site_name) %>% summarise(mean = mean(n_otter), sum = sum(n_otter))
# A tibble: 43 x 3
   site_name         mean   sum
   <fct>            <dbl> <int>
 1 Big Clam Bay      0.5      1
 2 Big Tree Bay      1.99   161
 3 Blanquizal Bay    2.64   222
 4 Chusini Cove 1    1.77   108
 5 Chusini Cove 2    2.92   140
 6 Dunbar Inlet      1.68    37
 7 Farallon Bay      0        0
 8 Garcia Cove       2.21   135
 9 Goat Mouth Inlet  0        0
10 Guktu Cove 1      3.05   134
# … with 33 more rows
notterpersite <- aggregate(formula = n_otter ~ site_name, FUN = sum, data = otter)

And how to select the top 5 sites with the most otter sightings overall.

notterpersite <- notterpersite[order(notterpersite$n_otter, decreasing = T), ]
top5 <- notterpersite[1:5, ]

otter %>% group_by(site_name) %>% summarise(sum = sum(n_otter)) %>% arrange(desc(sum)) %>% 
    top_n(5)
# A tibble: 5 x 2
  site_name         sum
  <fct>           <int>
1 Kaguk Cove        283
2 Shinaku Inlet     246
3 Blanquizal Bay    222
4 Salt Lake Bay 1   208
5 S16               197

How many observations are in the dataset?

otter %>% summarise(n = n())
     n
1 1337

Identifying unique values

Tidyverse also has functions for identifying each distinct or unique value in a data table. Using distinct() and n_distinct() we observe that in the otter data, there is only one region.

otter %>% distinct(region)
                               region
1 west prince of wales island, alaska

n_distinct(otter$region)
[1] 1

When applied to a whole dataframe, distinct() can also be used to remove any duplicate rows and retain only unique ones.

str(distinct(otter))  #note how many duplicate rows are removed
'data.frame':   1326 obs. of  8 variables:
 $ region     : Factor w/ 1 level "west prince of wales island, alaska": 1 1 1 1 1 1 1 1 1 1 ...
 $ site_name  : Factor w/ 43 levels "Big Clam Bay",..: 1 1 2 2 2 2 2 2 2 2 ...
 $ latitude_N : num  55.2 55.2 55.6 55.6 55.6 ...
 $ longitude_E: num  -133 -133 -133 -133 -133 ...
 $ date_DDMMYY: Factor w/ 48 levels "1/8/18","10/8/18",..: 11 30 11 11 11 11 11 11 11 11 ...
 $ year       : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
 $ replicate  : int  1 2 1 1 1 1 1 1 1 1 ...
 $ n_otter    : int  1 0 1 2 1 1 3 1 13 1 ...

Try it

Using tidyverse, let's reanswer some of the questions from a few weeks ago and add in a few more.

Select the latitude and longitude of the site with the highest number of otter sightings on any single day.
Which site had the most days with observations in 2017?
On which date are there observations from the greatest number of sites?
Which site has the greatest number of observations from any single day? Hint: check the group_by() help file if you are stuck.

Click for solution

## compare methods for selecting the latitude and longitude of the site with the
## highest number of observed otters on a single day
otter[otter$n_otter == max(otter$n_otter), c("latitude_N", "longitude_E")]
    latitude_N longitude_E
899   54.88742    -132.836
otter %>% filter(n_otter == max(n_otter)) %>% dplyr::select(latitude_N, longitude_E)  #depending on what other packages are loaded, sometimes you have to directly specify which package a function should be drawn from. 
  latitude_N longitude_E
1   54.88742    -132.836


## Select the name of the site with the most days of observations in 2017.
otter %>% filter(year == "2017") %>% group_by(site_name) %>% dplyr::summarise(nday = n_distinct(date_DDMMYY)) %>% 
    arrange(desc(nday))
# A tibble: 28 x 2
   site_name          nday
   <fct>             <int>
 1 N Fish Egg Island     3
 2 Blanquizal Bay        2
 3 Chusini Cove 1        2
 4 Dunbar Inlet          2
 5 Farallon Bay          2
 6 Garcia Cove           2
 7 Goat Mouth Inlet      2
 8 Guktu Cove 1          2
 9 Hetta Cove            2
10 Kaguk Cove            2
# … with 18 more rows

# which date has observations from the greatest number of sites?
otter %>% group_by(date_DDMMYY) %>% summarise(nsites = n_distinct(site_name)) %>% 
    arrange(desc(nsites))
# A tibble: 48 x 2
   date_DDMMYY nsites
   <fct>        <int>
 1 26/7/18          8
 2 12/6/18          6
 3 21/7/17          6
 4 5/8/17           5
 5 20/8/17          4
 6 25/7/17          4
 7 18/7/18          3
 8 18/8/17          3
 9 22/7/17          3
10 23/7/17          3
# … with 38 more rows

## which which site has the greatest number of observation points on any single
## day?
otter %>% group_by(site_name, date_DDMMYY) %>% count() %>% arrange(desc(n))
# A tibble: 101 x 3
# Groups:   site_name, date_DDMMYY [101]
   site_name       date_DDMMYY     n
   <fct>           <fct>       <int>
 1 Big Tree Bay    1/8/18         51
 2 Salt Lake Bay 2 12/7/18        51
 3 Salt Lake Bay 1 13/6/17        50
 4 S33             7/8/17         48
 5 Nossuk Bay 1    21/7/17        42
 6 Blanquizal Bay  6/8/17         41
 7 Guktu Cove 1    18/8/17        41
 8 Kaguk Cove      25/6/17        38
 9 Kaguk Cove      21/7/17        36
10 S32             26/7/17        34
# … with 91 more rows

Mutating data

Tidyverse also has the ability to create new variable with mutate(). The code below creates a new name column by pasting together the region and site names. The second line of code creates a new firstyear column equal to Y when the year is 2017 and N when the year is not 2017.


temp <- otter %>% mutate(name = paste(region, site_name, sep = "_"))



temp <- otter %>% mutate(firstyear = ifelse(otter$year == "2017", "Y", "N"))

Slicing and subsetting data

Data can be subset with tidyverse using slice() and top_n(). With slice() you can select any subset of rows from throughout the dataframe, while top_n() focuses on the highest values according to specific conditions.

The code below selects two different ranges of row numbers from the otter data.

slice(otter, 1:6)

otter %>% slice(5:10)

Try it

Using top_n() select the top 7 cases with the highest number of otters.

Click for solution

top_n(otter, 7, n_otter)
                               region      site_name latitude_N longitude_E
1 west prince of wales island, alaska Blanquizal Bay   55.63223   -133.4351
2 west prince of wales island, alaska   Guktu Cove 2   55.76019   -133.2917
3 west prince of wales island, alaska     Kaguk Cove   55.76019   -133.2917
4 west prince of wales island, alaska   Kinani Point   55.88026   -133.2852
5 west prince of wales island, alaska            S16   54.88742   -132.8360
6 west prince of wales island, alaska            S23   55.25139   -133.2133
7 west prince of wales island, alaska            S33   55.42019   -133.5438
  date_DDMMYY year replicate n_otter
1      5/8/17 2018         2      52
2     31/7/18 2018         2      51
3     31/7/18 2018         2      51
4      5/8/17 2018         2      75
5     23/7/17 2017        NA     150
6     25/7/17 2017        NA      45
7      7/8/17 2017        NA      50

Additional operators for subsetting data

Modulo (%%) and integer division (%/%), in modulo, only the remainder is left while in integer division, the remainder is ignored. You can see these functions in action below.

10/3
[1] 3.333333

10%%3
[1] 1

10%/%3
[1] 3

Sometimes it can be helpful to subset a random sample of rows from a datatable. Let's select 4 random rows from the otter data.

sample_n(otter, 4)
                               region         site_name latitude_N longitude_E
1 west prince of wales island, alaska               S32   55.63253   -133.4317
2 west prince of wales island, alaska        Kaguk Cove   55.74723   -133.2654
3 west prince of wales island, alaska S Fish Egg Island   55.46268   -133.1769
4 west prince of wales island, alaska   Salt Lake Bay 1   55.68955   -133.3901
  date_DDMMYY year replicate n_otter
1     26/7/17 2017        NA       1
2     31/7/18 2018         2       2
3     21/6/17 2017         1       1
4     13/6/17 2017         1       7

ANTH630: R Tutorial 5 - Data wrangling with tidyverse

Madeline Brown

12/31/2020

Helpful links

Introducing tidyverse

Pipes

Filtering and selecting data

Try it

Grouping data

Summarizing data

Identifying unique values

Try it

Mutating data

Slicing and subsetting data

Try it

Additional operators for subsetting data