Tidyverse is a suite of packages for R that follow the principles of tidy data. Today’s tutorial introduces key functions from various packages in the tidyverse that are essential for advanced data wrangling and analysis.
First load the library
library(tidyverse)
Usually, when we first open a new data file we might us head()
or str()
to get a sense of the data structure and values. In tidyverse, we can use glimpse()
to view similar data.
Let’s look at the otter data from previous weeks using glimpse()
.
otter <- read.csv("https://maddiebrown.github.io/ANTH630/data/sea_otter_counts_2017&2018_CLEANDATA.csv")
glimpse(otter)
Rows: 1,337
Columns: 8
$ region <fct> "west prince of wales island, alaska", "west prince of wal…
$ site_name <fct> Big Clam Bay, Big Clam Bay, Big Tree Bay, Big Tree Bay, Bi…
$ latitude_N <dbl> 55.19457, 55.22769, 55.56796, 55.57610, 55.57079, 55.57278…
$ longitude_E <dbl> -132.9670, -132.9739, -133.1710, -133.2199, -133.2262, -13…
$ date_DDMMYY <fct> 18/7/18, 26/7/18, 18/7/18, 18/7/18, 18/7/18, 18/7/18, 18/7…
$ year <int> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018…
$ replicate <int> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ n_otter <int> 1, 0, 1, 2, 1, 1, 3, 1, 13, 1, 1, 12, 2, 2, 1, 1, 1, 1, 1,…
From the output, we can see the different variable type and a sample of values from each column.
A key component of tidyverse is the ability to pipe together multiple functions with %>%
. This allows the completion of multiple data transformations or analyses within the same line of code.
Let’s try out using a pipe to calculate the mean number of otters observed in our dataset.
otter %>% summarise(mean = mean(n_otter, na.rm = T))
mean
1 2.384443
Notice that because the pipe is operating on the otter
dataset, there is no need to reference the dataset again when we refer to the column n_otter
. In addition, the summarise()
function is handy for any type of summary procedure. In this case, we summarize the n_otter
column by applying the function mean()
to it.
Tidyverse contains powerful data selection and filtering tools. filter()
subsets rows according to a condition while select()
subsets dataframes by variables.
Use filter()
to subset any rows where n_otter
is greater than 15.
Use select()
to choose only the n_otter column.
Click for solution
otter %>% filter(n_otter > 15)
otter %>% dplyr::select(n_otter)
Columns can also be selected based on the strings in their names. This can be helpful when you have a large number of variables that can be intuitively subset.
# pull out the latitude and longitude based on their shared first letter otter
# %>% dplyr::select(starts_with('l')) or based on a common string otter %>%
# dplyr::select(contains('tude'))
We can also match variable names based on strings in another vector. This can be useful if you need to subset based on another set of criteria or a certain theme within your broader dataset. In this example, we have a vector called timetopics
that contains a series of words related to time. Using select()
and any_of()
we can pull out all the variables from the otter dataset that match values in this vector. Any values that are not matched will be excluded.
timetopics <- c("year", "time", "date_DDMMYY", "minutes")
head(otter %>% dplyr::select(any_of(timetopics))) # head is used to prevent a long output in the tutorial
year date_DDMMYY
1 2018 18/7/18
2 2018 26/7/18
3 2018 18/7/18
4 2018 18/7/18
5 2018 18/7/18
6 2018 18/7/18
There are additional selection functions such as: matches()
, all_of()
, and ends_with()
among others. You are encouraged to try out these various functions to learn their usecases.
Previously we grouped data using aggregate. In tidyverse, we have more control over selecting and linking multiple variables using group_by()
. The code below groups the otter dataset by site names and counts how many rows there are for each site.
otter %>% group_by(site_name) %>% count()
# A tibble: 43 × 2
# Groups: site_name [43]
site_name n
<fct> <int>
1 Big Clam Bay 2
2 Big Tree Bay 81
3 Blanquizal Bay 84
4 Chusini Cove 1 61
5 Chusini Cove 2 48
6 Dunbar Inlet 22
7 Farallon Bay 2
8 Garcia Cove 61
9 Goat Mouth Inlet 4
10 Guktu Cove 1 44
# … with 33 more rows
You can also count the number of observations per group using tally()
.
otter %>% tally(n_otter)
n
1 3188
A few weeks ago we used aggregate()
to group together and summarize variables. Let’s return to the otter dataset and reanalyze it using tidyverse. Below are two ways of summarising the mean number of otters per observation at each site.
otter %>% group_by(site_name) %>% summarise(mean = mean(n_otter), sum = sum(n_otter))
# A tibble: 43 × 3
site_name mean sum
<fct> <dbl> <int>
1 Big Clam Bay 0.5 1
2 Big Tree Bay 1.99 161
3 Blanquizal Bay 2.64 222
4 Chusini Cove 1 1.77 108
5 Chusini Cove 2 2.92 140
6 Dunbar Inlet 1.68 37
7 Farallon Bay 0 0
8 Garcia Cove 2.21 135
9 Goat Mouth Inlet 0 0
10 Guktu Cove 1 3.05 134
# … with 33 more rows
notterpersite <- aggregate(formula = n_otter ~ site_name, FUN = sum, data = otter)
And how to select the top 5 sites with the most otter sightings overall.
notterpersite <- notterpersite[order(notterpersite$n_otter, decreasing = T), ]
top5 <- notterpersite[1:5, ]
otter %>% group_by(site_name) %>% summarise(sum = sum(n_otter)) %>% arrange(desc(sum)) %>%
top_n(5)
# A tibble: 5 × 2
site_name sum
<fct> <int>
1 Kaguk Cove 283
2 Shinaku Inlet 246
3 Blanquizal Bay 222
4 Salt Lake Bay 1 208
5 S16 197
How many observations are in the dataset?
otter %>% summarise(n = n())
n
1 1337
Tidyverse also has functions for identifying each distinct or unique value in a data table. Using distinct()
and n_distinct()
we observe that in the otter data, there is only one region.
otter %>% distinct(region)
region
1 west prince of wales island, alaska
n_distinct(otter$region)
[1] 1
When applied to a whole dataframe, distinct()
can also be used to remove any duplicate rows and retain only unique ones.
str(distinct(otter)) #note how many duplicate rows are removed
'data.frame': 1326 obs. of 8 variables:
$ region : Factor w/ 1 level "west prince of wales island, alaska": 1 1 1 1 1 1 1 1 1 1 ...
$ site_name : Factor w/ 43 levels "Big Clam Bay",..: 1 1 2 2 2 2 2 2 2 2 ...
$ latitude_N : num 55.2 55.2 55.6 55.6 55.6 ...
$ longitude_E: num -133 -133 -133 -133 -133 ...
$ date_DDMMYY: Factor w/ 48 levels "1/8/18","10/8/18",..: 11 30 11 11 11 11 11 11 11 11 ...
$ year : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
$ replicate : int 1 2 1 1 1 1 1 1 1 1 ...
$ n_otter : int 1 0 1 2 1 1 3 1 13 1 ...
Using tidyverse, let’s reanswer some of the questions from a few weeks ago and add in a few more.
Select the latitude and longitude of the site with the highest number of otter sightings on any single day.
Which site had the most days with observations in 2017?
On which date are there observations from the greatest number of sites?
Which site has the greatest number of observations from any single day? Hint: check the group_by()
help file if you are stuck.
Click for solution
## compare methods for selecting the latitude and longitude of the site with the
## highest number of observed otters on a single day
otter[otter$n_otter == max(otter$n_otter), c("latitude_N", "longitude_E")]
latitude_N longitude_E
899 54.88742 -132.836
otter %>% filter(n_otter == max(n_otter)) %>% dplyr::select(latitude_N, longitude_E) #depending on what other packages are loaded, sometimes you have to directly specify which package a function should be drawn from.
latitude_N longitude_E
1 54.88742 -132.836
## Select the name of the site with the most days of observations in 2017.
otter %>% filter(year == "2017") %>% group_by(site_name) %>% dplyr::summarise(nday = n_distinct(date_DDMMYY)) %>%
arrange(desc(nday))
# A tibble: 28 × 2
site_name nday
<fct> <int>
1 N Fish Egg Island 3
2 Blanquizal Bay 2
3 Chusini Cove 1 2
4 Dunbar Inlet 2
5 Farallon Bay 2
6 Garcia Cove 2
7 Goat Mouth Inlet 2
8 Guktu Cove 1 2
9 Hetta Cove 2
10 Kaguk Cove 2
# … with 18 more rows
# which date has observations from the greatest number of sites?
otter %>% group_by(date_DDMMYY) %>% summarise(nsites = n_distinct(site_name)) %>%
arrange(desc(nsites))
# A tibble: 48 × 2
date_DDMMYY nsites
<fct> <int>
1 26/7/18 8
2 12/6/18 6
3 21/7/17 6
4 5/8/17 5
5 20/8/17 4
6 25/7/17 4
7 18/7/18 3
8 18/8/17 3
9 22/7/17 3
10 23/7/17 3
# … with 38 more rows
## which which site has the greatest number of observation points on any single
## day?
otter %>% group_by(site_name, date_DDMMYY) %>% count() %>% arrange(desc(n))
# A tibble: 101 × 3
# Groups: site_name, date_DDMMYY [101]
site_name date_DDMMYY n
<fct> <fct> <int>
1 Big Tree Bay 1/8/18 51
2 Salt Lake Bay 2 12/7/18 51
3 Salt Lake Bay 1 13/6/17 50
4 S33 7/8/17 48
5 Nossuk Bay 1 21/7/17 42
6 Blanquizal Bay 6/8/17 41
7 Guktu Cove 1 18/8/17 41
8 Kaguk Cove 25/6/17 38
9 Kaguk Cove 21/7/17 36
10 S32 26/7/17 34
# … with 91 more rows
Tidyverse also has the ability to create new variable with mutate()
. The code below creates a new name
column by pasting together the region and site names. The second line of code creates a new firstyear
column equal to Y
when the year is 2017 and N
when the year is not 2017.
temp <- otter %>% mutate(name = paste(region, site_name, sep = "_"))
temp <- otter %>% mutate(firstyear = ifelse(otter$year == "2017", "Y", "N"))
Data can be subset with tidyverse using slice()
and top_n()
. With slice()
you can select any subset of rows from throughout the dataframe, while top_n()
focuses on the highest values according to specific conditions.
The code below selects two different ranges of row numbers from the otter data.
slice(otter, 1:6)
otter %>% slice(5:10)
Using top_n()
select the top 7 cases with the highest number of otters.
Click for solution
top_n(otter, 7, n_otter)
region site_name latitude_N longitude_E
1 west prince of wales island, alaska Blanquizal Bay 55.63223 -133.4351
2 west prince of wales island, alaska Guktu Cove 2 55.76019 -133.2917
3 west prince of wales island, alaska Kaguk Cove 55.76019 -133.2917
4 west prince of wales island, alaska Kinani Point 55.88026 -133.2852
5 west prince of wales island, alaska S16 54.88742 -132.8360
6 west prince of wales island, alaska S23 55.25139 -133.2133
7 west prince of wales island, alaska S33 55.42019 -133.5438
date_DDMMYY year replicate n_otter
1 5/8/17 2018 2 52
2 31/7/18 2018 2 51
3 31/7/18 2018 2 51
4 5/8/17 2018 2 75
5 23/7/17 2017 NA 150
6 25/7/17 2017 NA 45
7 7/8/17 2017 NA 50
Modulo (%%) and integer division (%/%), in modulo, only the remainder is left while in integer division, the remainder is ignored. You can see these functions in action below.
10/3
[1] 3.333333
10%%3
[1] 1
10%/%3
[1] 3
Sometimes it can be helpful to subset a random sample of rows from a datatable. Let’s select 4 random rows from the otter
data.
sample_n(otter, 4)
region site_name latitude_N longitude_E
1 west prince of wales island, alaska Garcia Cove 55.55433 -133.4661
2 west prince of wales island, alaska Soda Bay 55.27065 -132.9931
3 west prince of wales island, alaska Blanquizal Bay 55.62299 -133.3997
4 west prince of wales island, alaska Soda Bay 55.26934 -133.0101
date_DDMMYY year replicate n_otter
1 26/7/17 2017 1 1
2 20/8/17 2017 2 1
3 6/8/17 2017 1 1
4 20/8/17 2017 2 1
Data are often structured in long or wide formats. Using pivot_longer()
and pivot_wider()
. Let’s look at an example with some household livestock data.
First read in the data and load the tidyverse library.
library(tidyverse)
livestock <- read.csv("https://maddiebrown.github.io/ANTH630/data/livestock.csv")
livestock
HH Cow Sheep Pig
1 A 3 2 2
2 B 1 4 2
3 C 2 3 1
4 D 2 3 3
5 E 2 2 1
6 F 1 1 1
These data are in a long format. The columns “Sheep” “Pig” and “Cow” are all types of animals and their values are measured in the same units (counts). Because of this, we can convert the data into a long format where the animal columns are collapsed into a single column “animals” and the values are put into a new “count” column.
livestock_long <- livestock %>% pivot_longer(!HH, names_to = "animal", values_to = "count")
livestock_long
# A tibble: 18 × 3
HH animal count
<fct> <chr> <int>
1 A Cow 3
2 A Sheep 2
3 A Pig 2
4 B Cow 1
5 B Sheep 4
6 B Pig 2
7 C Cow 2
8 C Sheep 3
9 C Pig 1
10 D Cow 2
11 D Sheep 3
12 D Pig 3
13 E Cow 2
14 E Sheep 2
15 E Pig 1
16 F Cow 1
17 F Sheep 1
18 F Pig 1
Suppose we wanted to convert the data back into a wide format? We can use pivot_wider()
to do just this.
livestock_long %>% pivot_wider(names_from = "animal", values_from = "count")
# A tibble: 6 × 4
HH Cow Sheep Pig
<fct> <int> <int> <int>
1 A 3 2 2
2 B 1 4 2
3 C 2 3 1
4 D 2 3 3
5 E 2 2 1
6 F 1 1 1
Sometimes values in a table are not in a tidy format or you need to separate out one part of the value to use in a different analysis.
For this next part of the lesson, we will use real survey data on Native foods in Yukon River communities. Full citation and additional information: Philip A. Loring, Anne Beaudreau, and Cecile Tang. 2020. Alaska Native Service Survey of Native Foods, Yukon River communities, 1940s-1970s. Arctic Data Center. doi:10.18739/A2GX44V7K. Link
Let’s load the data and take a look at the Func.Grp
column.
reindeer <- read.csv("https://maddiebrown.github.io/ANTH630/data/Loring_et_al_reindeer_records.csv",
stringsAsFactors = F)
head(reindeer$Func.Grp)
[1] "Freshwater Fish" "Freshwater Fish" "Freshwater Fish"
[4] "Freshwater Fish" "Large Land Mammal" "Large Land Mammal"
What if we wanted to separate out the two words in the functional groups. Perhaps we know that we are interested in fish, so making a new column that divides the type of fish and the even more general category of “fish” might be useful.
First, let’s subset out only the rows that involve fish. Then, we can separate the Func.Grp
column in the new datatable.
fish <- reindeer %>% filter(str_detect(Func.Grp, "Fish"))
head(fish %>% separate(Func.Grp, c("Type", "Animal")))
Record.ID Community Region Year Decade Statehood N.dogs Food.type
1 1 Arctic Village Upper 1960 1960 post 75 Trout
2 2 Arctic Village Upper 1960 1960 post 75 Whitefish
3 3 Arctic Village Upper 1960 1960 post 75 Pike
4 4 Arctic Village Upper 1960 1960 post 75 Grayling
5 7 Arctic Village Upper 1961 1960 post 75 Trout
6 8 Arctic Village Upper 1961 1960 post 75 Whitefish
Common.name Type Animal
1 Trout Freshwater Fish
2 Whitefish Freshwater Fish
3 Pike Freshwater Fish
4 Grayling Freshwater Fish
5 Trout Freshwater Fish
6 Whitefish Freshwater Fish
What if we wanted to unite two columns together into a new column? Perhaps we want to join the region and community. Try using unite()
to achieve this outcome.
Click for solution
head(reindeer %>% unite(New, Community, Region, sep = ":"))
Record.ID New Year Decade Statehood N.dogs Food.type
1 1 Arctic Village:Upper 1960 1960 post 75 Trout
2 2 Arctic Village:Upper 1960 1960 post 75 Whitefish
3 3 Arctic Village:Upper 1960 1960 post 75 Pike
4 4 Arctic Village:Upper 1960 1960 post 75 Grayling
5 5 Arctic Village:Upper 1960 1960 post 75 Caribou
6 6 Arctic Village:Upper 1960 1960 post 75 Moose
Common.name Func.Grp
1 Trout Freshwater Fish
2 Whitefish Freshwater Fish
3 Pike Freshwater Fish
4 Grayling Freshwater Fish
5 Caribou Large Land Mammal
6 Moose Large Land Mammal
Renaming columns with tidyverse is accomplished using rename()
. The code below renames the Func.Grp
into FunctionalGroup
.
reindeer <- rename(reindeer, FunctionalGroup = Func.Grp)
Now we have a new FunctionalGroup
column but it is way at the back of the dataframe. Move the column to just after the Record.ID
using relocate()
.
Click for solution
head(reindeer %>% relocate(FunctionalGroup, .after = Record.ID))
Record.ID FunctionalGroup Community Region Year Decade Statehood
1 1 Freshwater Fish Arctic Village Upper 1960 1960 post
2 2 Freshwater Fish Arctic Village Upper 1960 1960 post
3 3 Freshwater Fish Arctic Village Upper 1960 1960 post
4 4 Freshwater Fish Arctic Village Upper 1960 1960 post
5 5 Large Land Mammal Arctic Village Upper 1960 1960 post
6 6 Large Land Mammal Arctic Village Upper 1960 1960 post
N.dogs Food.type Common.name
1 75 Trout Trout
2 75 Whitefish Whitefish
3 75 Pike Pike
4 75 Grayling Grayling
5 75 Caribou Caribou
6 75 Moose Moose