Introducing tidyverse

Tidyverse is a suite of packages for R that follow the principles of tidy data. Today’s tutorial introduces key functions from various packages in the tidyverse that are essential for advanced data wrangling and analysis.

First load the library

library(tidyverse)

Usually, when we first open a new data file we might us head() or str() to get a sense of the data structure and values. In tidyverse, we can use glimpse() to view similar data.

Let’s look at the otter data from previous weeks using glimpse().

otter <- read.csv("https://maddiebrown.github.io/ANTH630/data/sea_otter_counts_2017&2018_CLEANDATA.csv")
glimpse(otter)
Rows: 1,337
Columns: 8
$ region      <fct> "west prince of wales island, alaska", "west prince of wal…
$ site_name   <fct> Big Clam Bay, Big Clam Bay, Big Tree Bay, Big Tree Bay, Bi…
$ latitude_N  <dbl> 55.19457, 55.22769, 55.56796, 55.57610, 55.57079, 55.57278…
$ longitude_E <dbl> -132.9670, -132.9739, -133.1710, -133.2199, -133.2262, -13…
$ date_DDMMYY <fct> 18/7/18, 26/7/18, 18/7/18, 18/7/18, 18/7/18, 18/7/18, 18/7…
$ year        <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018…
$ replicate   <int> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ n_otter     <int> 1, 0, 1, 2, 1, 1, 3, 1, 13, 1, 1, 12, 2, 2, 1, 1, 1, 1, 1,…

From the output, we can see the different variable type and a sample of values from each column.

Pipes

A key component of tidyverse is the ability to pipe together multiple functions with %>%. This allows the completion of multiple data transformations or analyses within the same line of code.

Let’s try out using a pipe to calculate the mean number of otters observed in our dataset.

otter %>% summarise(mean = mean(n_otter, na.rm = T))
      mean
1 2.384443

Notice that because the pipe is operating on the otter dataset, there is no need to reference the dataset again when we refer to the column n_otter. In addition, the summarise() function is handy for any type of summary procedure. In this case, we summarize the n_otter column by applying the function mean() to it.

Filtering and selecting data

Tidyverse contains powerful data selection and filtering tools. filter() subsets rows according to a condition while select() subsets dataframes by variables.

Try it

  1. Use filter() to subset any rows where n_otter is greater than 15.

  2. Use select() to choose only the n_otter column.

Click for solution

otter %>% filter(n_otter > 15)

otter %>% dplyr::select(n_otter)

Columns can also be selected based on the strings in their names. This can be helpful when you have a large number of variables that can be intuitively subset.

# pull out the latitude and longitude based on their shared first letter otter
# %>% dplyr::select(starts_with('l')) or based on a common string otter %>%
# dplyr::select(contains('tude'))

We can also match variable names based on strings in another vector. This can be useful if you need to subset based on another set of criteria or a certain theme within your broader dataset. In this example, we have a vector called timetopics that contains a series of words related to time. Using select() and any_of() we can pull out all the variables from the otter dataset that match values in this vector. Any values that are not matched will be excluded.

timetopics <- c("year", "time", "date_DDMMYY", "minutes")

head(otter %>% dplyr::select(any_of(timetopics)))  # head is used to prevent a long output in the tutorial
  year date_DDMMYY
1 2018     18/7/18
2 2018     26/7/18
3 2018     18/7/18
4 2018     18/7/18
5 2018     18/7/18
6 2018     18/7/18

There are additional selection functions such as: matches(), all_of(), and ends_with() among others. You are encouraged to try out these various functions to learn their usecases.

Grouping data

Previously we grouped data using aggregate. In tidyverse, we have more control over selecting and linking multiple variables using group_by(). The code below groups the otter dataset by site names and counts how many rows there are for each site.

otter %>% group_by(site_name) %>% count()
# A tibble: 43 × 2
# Groups:   site_name [43]
   site_name            n
   <fct>            <int>
 1 Big Clam Bay         2
 2 Big Tree Bay        81
 3 Blanquizal Bay      84
 4 Chusini Cove 1      61
 5 Chusini Cove 2      48
 6 Dunbar Inlet        22
 7 Farallon Bay         2
 8 Garcia Cove         61
 9 Goat Mouth Inlet     4
10 Guktu Cove 1        44
# … with 33 more rows

You can also count the number of observations per group using tally().

otter %>% tally(n_otter)
     n
1 3188

Summarizing data

A few weeks ago we used aggregate() to group together and summarize variables. Let’s return to the otter dataset and reanalyze it using tidyverse. Below are two ways of summarising the mean number of otters per observation at each site.

otter %>% group_by(site_name) %>% summarise(mean = mean(n_otter), sum = sum(n_otter))
# A tibble: 43 × 3
   site_name         mean   sum
   <fct>            <dbl> <int>
 1 Big Clam Bay      0.5      1
 2 Big Tree Bay      1.99   161
 3 Blanquizal Bay    2.64   222
 4 Chusini Cove 1    1.77   108
 5 Chusini Cove 2    2.92   140
 6 Dunbar Inlet      1.68    37
 7 Farallon Bay      0        0
 8 Garcia Cove       2.21   135
 9 Goat Mouth Inlet  0        0
10 Guktu Cove 1      3.05   134
# … with 33 more rows
notterpersite <- aggregate(formula = n_otter ~ site_name, FUN = sum, data = otter)

And how to select the top 5 sites with the most otter sightings overall.

notterpersite <- notterpersite[order(notterpersite$n_otter, decreasing = T), ]
top5 <- notterpersite[1:5, ]

otter %>% group_by(site_name) %>% summarise(sum = sum(n_otter)) %>% arrange(desc(sum)) %>% 
    top_n(5)
# A tibble: 5 × 2
  site_name         sum
  <fct>           <int>
1 Kaguk Cove        283
2 Shinaku Inlet     246
3 Blanquizal Bay    222
4 Salt Lake Bay 1   208
5 S16               197

How many observations are in the dataset?

otter %>% summarise(n = n())
     n
1 1337

Identifying unique values

Tidyverse also has functions for identifying each distinct or unique value in a data table. Using distinct() and n_distinct() we observe that in the otter data, there is only one region.

otter %>% distinct(region)
                               region
1 west prince of wales island, alaska

n_distinct(otter$region)
[1] 1

When applied to a whole dataframe, distinct() can also be used to remove any duplicate rows and retain only unique ones.

str(distinct(otter))  #note how many duplicate rows are removed
'data.frame':   1326 obs. of  8 variables:
 $ region     : Factor w/ 1 level "west prince of wales island, alaska": 1 1 1 1 1 1 1 1 1 1 ...
 $ site_name  : Factor w/ 43 levels "Big Clam Bay",..: 1 1 2 2 2 2 2 2 2 2 ...
 $ latitude_N : num  55.2 55.2 55.6 55.6 55.6 ...
 $ longitude_E: num  -133 -133 -133 -133 -133 ...
 $ date_DDMMYY: Factor w/ 48 levels "1/8/18","10/8/18",..: 11 30 11 11 11 11 11 11 11 11 ...
 $ year       : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
 $ replicate  : int  1 2 1 1 1 1 1 1 1 1 ...
 $ n_otter    : int  1 0 1 2 1 1 3 1 13 1 ...

Try it

Using tidyverse, let’s reanswer some of the questions from a few weeks ago and add in a few more.

  1. Select the latitude and longitude of the site with the highest number of otter sightings on any single day.

  2. Which site had the most days with observations in 2017?

  3. On which date are there observations from the greatest number of sites?

  4. Which site has the greatest number of observations from any single day? Hint: check the group_by() help file if you are stuck.

Click for solution

## compare methods for selecting the latitude and longitude of the site with the
## highest number of observed otters on a single day
otter[otter$n_otter == max(otter$n_otter), c("latitude_N", "longitude_E")]
    latitude_N longitude_E
899   54.88742    -132.836
otter %>% filter(n_otter == max(n_otter)) %>% dplyr::select(latitude_N, longitude_E)  #depending on what other packages are loaded, sometimes you have to directly specify which package a function should be drawn from. 
  latitude_N longitude_E
1   54.88742    -132.836


## Select the name of the site with the most days of observations in 2017.
otter %>% filter(year == "2017") %>% group_by(site_name) %>% dplyr::summarise(nday = n_distinct(date_DDMMYY)) %>% 
    arrange(desc(nday))
# A tibble: 28 × 2
   site_name          nday
   <fct>             <int>
 1 N Fish Egg Island     3
 2 Blanquizal Bay        2
 3 Chusini Cove 1        2
 4 Dunbar Inlet          2
 5 Farallon Bay          2
 6 Garcia Cove           2
 7 Goat Mouth Inlet      2
 8 Guktu Cove 1          2
 9 Hetta Cove            2
10 Kaguk Cove            2
# … with 18 more rows

# which date has observations from the greatest number of sites?
otter %>% group_by(date_DDMMYY) %>% summarise(nsites = n_distinct(site_name)) %>% 
    arrange(desc(nsites))
# A tibble: 48 × 2
   date_DDMMYY nsites
   <fct>        <int>
 1 26/7/18          8
 2 12/6/18          6
 3 21/7/17          6
 4 5/8/17           5
 5 20/8/17          4
 6 25/7/17          4
 7 18/7/18          3
 8 18/8/17          3
 9 22/7/17          3
10 23/7/17          3
# … with 38 more rows

## which which site has the greatest number of observation points on any single
## day?
otter %>% group_by(site_name, date_DDMMYY) %>% count() %>% arrange(desc(n))
# A tibble: 101 × 3
# Groups:   site_name, date_DDMMYY [101]
   site_name       date_DDMMYY     n
   <fct>           <fct>       <int>
 1 Big Tree Bay    1/8/18         51
 2 Salt Lake Bay 2 12/7/18        51
 3 Salt Lake Bay 1 13/6/17        50
 4 S33             7/8/17         48
 5 Nossuk Bay 1    21/7/17        42
 6 Blanquizal Bay  6/8/17         41
 7 Guktu Cove 1    18/8/17        41
 8 Kaguk Cove      25/6/17        38
 9 Kaguk Cove      21/7/17        36
10 S32             26/7/17        34
# … with 91 more rows

Mutating data

Tidyverse also has the ability to create new variable with mutate(). The code below creates a new name column by pasting together the region and site names. The second line of code creates a new firstyear column equal to Y when the year is 2017 and N when the year is not 2017.


temp <- otter %>% mutate(name = paste(region, site_name, sep = "_"))



temp <- otter %>% mutate(firstyear = ifelse(otter$year == "2017", "Y", "N"))

Slicing and subsetting data

Data can be subset with tidyverse using slice() and top_n(). With slice() you can select any subset of rows from throughout the dataframe, while top_n() focuses on the highest values according to specific conditions.

The code below selects two different ranges of row numbers from the otter data.

slice(otter, 1:6)

otter %>% slice(5:10)

Try it

Using top_n() select the top 7 cases with the highest number of otters.

Click for solution

top_n(otter, 7, n_otter)
                               region      site_name latitude_N longitude_E
1 west prince of wales island, alaska Blanquizal Bay   55.63223   -133.4351
2 west prince of wales island, alaska   Guktu Cove 2   55.76019   -133.2917
3 west prince of wales island, alaska     Kaguk Cove   55.76019   -133.2917
4 west prince of wales island, alaska   Kinani Point   55.88026   -133.2852
5 west prince of wales island, alaska            S16   54.88742   -132.8360
6 west prince of wales island, alaska            S23   55.25139   -133.2133
7 west prince of wales island, alaska            S33   55.42019   -133.5438
  date_DDMMYY year replicate n_otter
1      5/8/17 2018         2      52
2     31/7/18 2018         2      51
3     31/7/18 2018         2      51
4      5/8/17 2018         2      75
5     23/7/17 2017        NA     150
6     25/7/17 2017        NA      45
7      7/8/17 2017        NA      50

Additional operators for subsetting data

Modulo (%%) and integer division (%/%), in modulo, only the remainder is left while in integer division, the remainder is ignored. You can see these functions in action below.

10/3
[1] 3.333333

10%%3
[1] 1

10%/%3
[1] 3

Sometimes it can be helpful to subset a random sample of rows from a datatable. Let’s select 4 random rows from the otter data.

sample_n(otter, 4)
                               region      site_name latitude_N longitude_E
1 west prince of wales island, alaska    Garcia Cove   55.55433   -133.4661
2 west prince of wales island, alaska       Soda Bay   55.27065   -132.9931
3 west prince of wales island, alaska Blanquizal Bay   55.62299   -133.3997
4 west prince of wales island, alaska       Soda Bay   55.26934   -133.0101
  date_DDMMYY year replicate n_otter
1     26/7/17 2017         1       1
2     20/8/17 2017         2       1
3      6/8/17 2017         1       1
4     20/8/17 2017         2       1

Advanced tidyverse

Transforming data formats

Data are often structured in long or wide formats. Using pivot_longer() and pivot_wider(). Let’s look at an example with some household livestock data.

First read in the data and load the tidyverse library.

library(tidyverse)
livestock <- read.csv("https://maddiebrown.github.io/ANTH630/data/livestock.csv")
livestock
  HH Cow Sheep Pig
1  A   3     2   2
2  B   1     4   2
3  C   2     3   1
4  D   2     3   3
5  E   2     2   1
6  F   1     1   1

These data are in a long format. The columns “Sheep” “Pig” and “Cow” are all types of animals and their values are measured in the same units (counts). Because of this, we can convert the data into a long format where the animal columns are collapsed into a single column “animals” and the values are put into a new “count” column.

livestock_long <- livestock %>% pivot_longer(!HH, names_to = "animal", values_to = "count")
livestock_long
# A tibble: 18 × 3
   HH    animal count
   <fct> <chr>  <int>
 1 A     Cow        3
 2 A     Sheep      2
 3 A     Pig        2
 4 B     Cow        1
 5 B     Sheep      4
 6 B     Pig        2
 7 C     Cow        2
 8 C     Sheep      3
 9 C     Pig        1
10 D     Cow        2
11 D     Sheep      3
12 D     Pig        3
13 E     Cow        2
14 E     Sheep      2
15 E     Pig        1
16 F     Cow        1
17 F     Sheep      1
18 F     Pig        1

Suppose we wanted to convert the data back into a wide format? We can use pivot_wider() to do just this.

livestock_long %>% pivot_wider(names_from = "animal", values_from = "count")
# A tibble: 6 × 4
  HH      Cow Sheep   Pig
  <fct> <int> <int> <int>
1 A         3     2     2
2 B         1     4     2
3 C         2     3     1
4 D         2     3     3
5 E         2     2     1
6 F         1     1     1

Transforming columns

Sometimes values in a table are not in a tidy format or you need to separate out one part of the value to use in a different analysis.

For this next part of the lesson, we will use real survey data on Native foods in Yukon River communities. Full citation and additional information: Philip A. Loring, Anne Beaudreau, and Cecile Tang. 2020. Alaska Native Service Survey of Native Foods, Yukon River communities, 1940s-1970s. Arctic Data Center. doi:10.18739/A2GX44V7K. Link

Let’s load the data and take a look at the Func.Grp column.

reindeer <- read.csv("https://maddiebrown.github.io/ANTH630/data/Loring_et_al_reindeer_records.csv", 
    stringsAsFactors = F)
head(reindeer$Func.Grp)
[1] "Freshwater Fish"   "Freshwater Fish"   "Freshwater Fish"  
[4] "Freshwater Fish"   "Large Land Mammal" "Large Land Mammal"

What if we wanted to separate out the two words in the functional groups. Perhaps we know that we are interested in fish, so making a new column that divides the type of fish and the even more general category of “fish” might be useful.

First, let’s subset out only the rows that involve fish. Then, we can separate the Func.Grp column in the new datatable.

fish <- reindeer %>% filter(str_detect(Func.Grp, "Fish"))
head(fish %>% separate(Func.Grp, c("Type", "Animal")))
  Record.ID      Community Region Year Decade Statehood N.dogs Food.type
1         1 Arctic Village  Upper 1960   1960      post     75     Trout
2         2 Arctic Village  Upper 1960   1960      post     75 Whitefish
3         3 Arctic Village  Upper 1960   1960      post     75      Pike
4         4 Arctic Village  Upper 1960   1960      post     75  Grayling
5         7 Arctic Village  Upper 1961   1960      post     75     Trout
6         8 Arctic Village  Upper 1961   1960      post     75 Whitefish
  Common.name       Type Animal
1       Trout Freshwater   Fish
2   Whitefish Freshwater   Fish
3        Pike Freshwater   Fish
4    Grayling Freshwater   Fish
5       Trout Freshwater   Fish
6   Whitefish Freshwater   Fish

Try it

What if we wanted to unite two columns together into a new column? Perhaps we want to join the region and community. Try using unite() to achieve this outcome.

Click for solution

head(reindeer %>% unite(New, Community, Region, sep = ":"))
  Record.ID                  New Year Decade Statehood N.dogs Food.type
1         1 Arctic Village:Upper 1960   1960      post     75     Trout
2         2 Arctic Village:Upper 1960   1960      post     75 Whitefish
3         3 Arctic Village:Upper 1960   1960      post     75      Pike
4         4 Arctic Village:Upper 1960   1960      post     75  Grayling
5         5 Arctic Village:Upper 1960   1960      post     75   Caribou
6         6 Arctic Village:Upper 1960   1960      post     75     Moose
  Common.name          Func.Grp
1       Trout   Freshwater Fish
2   Whitefish   Freshwater Fish
3        Pike   Freshwater Fish
4    Grayling   Freshwater Fish
5     Caribou Large Land Mammal
6       Moose Large Land Mammal

Renaming and relocating columns

Renaming columns with tidyverse is accomplished using rename(). The code below renames the Func.Grp into FunctionalGroup.

reindeer <- rename(reindeer, FunctionalGroup = Func.Grp)

Try it

Now we have a new FunctionalGroup column but it is way at the back of the dataframe. Move the column to just after the Record.ID using relocate().

Click for solution

head(reindeer %>% relocate(FunctionalGroup, .after = Record.ID))
  Record.ID   FunctionalGroup      Community Region Year Decade Statehood
1         1   Freshwater Fish Arctic Village  Upper 1960   1960      post
2         2   Freshwater Fish Arctic Village  Upper 1960   1960      post
3         3   Freshwater Fish Arctic Village  Upper 1960   1960      post
4         4   Freshwater Fish Arctic Village  Upper 1960   1960      post
5         5 Large Land Mammal Arctic Village  Upper 1960   1960      post
6         6 Large Land Mammal Arctic Village  Upper 1960   1960      post
  N.dogs Food.type Common.name
1     75     Trout       Trout
2     75 Whitefish   Whitefish
3     75      Pike        Pike
4     75  Grayling    Grayling
5     75   Caribou     Caribou
6     75     Moose       Moose