T-tests allow us to test whether or not two sample means are significantly different. R has several tools for conducting t-tests. We will cover some today using example code and data from Drennan and the R Companion for Drennan.
The first comparisons we will make are between the floor area of houses from both the formative and classic periods. Load in the data and create separate vectors for each time period.
HouseFloors <- read.csv("https://maddiebrown.github.io/ANTH630/data/drennan_datasets/HouseFloors.csv")
#
Formative <- HouseFloors %>% filter(HouseFloors$Period == "Formative") %>% pull(Area)
Classic <- HouseFloors %>% filter(HouseFloors$Period == "Classic") %>% pull(Area) #pull creates a vector, while select creates a dataframe
A quick way to compare sample means is through creating a boxplot. This displays the range of value in both samples.
boxplot(list(Formative = Formative, Classic = Classic))
We can also calculate descriptive stats for each group and run normality tests using stat.desc()
from the pastecs
package.
library(pastecs)
by(HouseFloors$Area, HouseFloors$Period, stat.desc, basic = F, norm = T) # this last argument runs normality tests and if the results are non-significant, this means the errors in the model are likely normally distributed
HouseFloors$Period: Classic
median mean SE.mean CI.mean.0.95 var std.dev
26.3000000 26.2500000 0.6252752 1.2552924 20.3303922 4.5089236
coef.var skewness skew.2SE kurtosis kurt.2SE normtest.W
0.1717685 -0.2442232 -0.3695718 -0.5570531 -0.4284409 0.9845577
normtest.p
0.7315843
------------------------------------------------------------
HouseFloors$Period: Formative
median mean SE.mean CI.mean.0.95 var std.dev
24.3000000 23.7750000 0.5978119 1.2192453 11.4361290 3.3817346
coef.var skewness skew.2SE kurtosis kurt.2SE normtest.W
0.1422391 -0.3110996 -0.3753096 -0.3901647 -0.2410295 0.9816014
normtest.p
0.8445857
Because the normality tests are not significant, we can proceed with conducting t-tests to compare the means.
Carlson has created some useful code in the Drennan companion for creating descriptive statistics tables and bullet graphs, which are not easily created in R. We will define his functions with the code below.
Dstats <- function(x) {
n <- length(x)
Md <- median(x)
Xbar <- mean(x)
IQR <- IQR(x, type = 2)
s <- sd(x)
SE <- sd(x)/sqrt(length(x))
return(c(n = n, Md = Md, Xbar = Xbar, IQR = IQR, s = s, SE = SE))
}
BPlot <- function(x, mwid = 0.04, ...) {
val <- max(sapply(x, function(s) sd(s)/sqrt(length(s)))) * 3
yr <- range(sapply(x, mean)) + c(-val, val)
plot(NA, xlim = c(0.5, length(x) + 0.5), ylim = yr, xaxt = "n", xlab = "", las = 1,
...)
axis(1, seq_along(x), names(x))
for (i in seq_along(x)) {
tval <- qt(c(0.005, 0.025, 0.1, 0.9, 0.975, 0.995), length(x[[i]]) - 1)
mn <- mean(x[[i]])
se <- sd(x[[i]])/sqrt(length(x[[i]]))
y <- mn + se * tval
segments(i, y[1], i, y[6], lwd = 1, lend = "square")
segments(i, y[2], i, y[5], lwd = 5, lend = "square")
segments(i, y[3], i, y[4], lwd = 9, lend = "square")
segments(i - mwid, mn, i + mwid, mn, lend = "square")
}
}
Now we can use this function to create a bullet plot. Each level on the plot denotes a different confidence level. The smallest level is 90%, then 95% and then 99%. From these plots, we can see that there is less overlap in the confidence intervals of the two distributions compared to what was seen in distributions of the boxplot alone.
Areas <- list(Formative = Formative, Classic = Classic)
BPlot(Areas, mwid = 0.06, ylab = bquote(x^2))
You can also manually calculate the confidence ranges at different confidence levels. Carlson’s code for doing so can be found below:
n <- length(Formative)
mn <- mean(Formative)
tval <- qt(0.9, n - 1)
ER <- sd(Formative)/sqrt(n) * tval
cat("80%-Confidence level =", round(mn, 1), "±", round(ER, 2), "m2\n")
80%-Confidence level = 23.8 ± 0.78 m2
# 80%-Confidence level = 23.8 ± 0.78 m2
Run a t-test to compare whether or not the mean floor area is significantly different between the formative and classic periods. This can be completed using the two vectors we created previously. Looking at the 95% confidence interval, the true difference between the means is likely to be between -4.19 and -0.75 in 95% of samples. This means that it is unlikely that the true difference between the two sample means is zero and that they are from different populations.
t.test(Formative, Classic)
Welch Two Sample t-test
data: Formative and Classic
t = -2.861, df = 78.686, p-value = 0.005406
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.1969878 -0.7530122
sample estimates:
mean of x mean of y
23.775 26.250
You can also run a t-test using formula notation.
t.test(Area ~ Period, data = HouseFloors)
Welch Two Sample t-test
data: Area by Period
t = 2.861, df = 78.686, p-value = 0.005406
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.7530122 4.1969878
sample estimates:
mean in group Classic mean in group Formative
26.250 23.775
The version of the t-test we ran above is the more conservative version of a t-test, where it is not assumed that the variances between samples are equal. This will generally create a lower threshold for significance than when the variances are equal.
There is a function var.test()
in R that allows us to compare variances across samples. See Carlson for additional details and explanation of the results.
var.test(Formative, Classic)
F test to compare two variances
data: Formative and Classic
F = 0.56251, num df = 31, denom df = 51, p-value = 0.08948
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.3038643 1.0952839
sample estimates:
ratio of variances
0.5625139
The output of this F test comparing variance is not significant at the \(p<0.5\) level, so there is no significant difference in variances between the two samples. This suggests that the adjusted t-test we used above was not necessary. We can re-run the t-test, this time adding in the argument var.equal=T
, which adjusts the t-test by assuming that the variances are equal.
t.test(Area ~ Period, data = HouseFloors, var.equal = T)
Two Sample t-test
data: Area by Period
t = 2.6742, df = 82, p-value = 0.009038
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.6338816 4.3161184
sample estimates:
mean in group Classic mean in group Formative
26.250 23.775
The results from this t-test are the same. In some cases, this can make a difference, so pay attention to the variance in choosing which version of the t-test to use.
Data are often structured in long or wide formats. Using pivot_longer()
and pivot_wider()
. Let’s look at an example with some household livestock data.
First read in the data and load the tidyverse library.
library(tidyverse)
livestock <- read.csv("https://maddiebrown.github.io/ANTH630/data/livestock.csv")
livestock
HH Cow Sheep Pig
1 A 3 2 2
2 B 1 4 2
3 C 2 3 1
4 D 2 3 3
5 E 2 2 1
6 F 1 1 1
These data are in a long format. The columns “Sheep” “Pig” and “Cow” are all types of animals and their values are measured in the same units (counts). Because of this, we can convert the data into a long format where the animal columns are collapsed into a single column “animals” and the values are put into a new “count” column.
livestock_long <- livestock %>% pivot_longer(!HH, names_to = "animal", values_to = "count")
livestock_long
# A tibble: 18 x 3
HH animal count
<fct> <chr> <int>
1 A Cow 3
2 A Sheep 2
3 A Pig 2
4 B Cow 1
5 B Sheep 4
6 B Pig 2
7 C Cow 2
8 C Sheep 3
9 C Pig 1
10 D Cow 2
11 D Sheep 3
12 D Pig 3
13 E Cow 2
14 E Sheep 2
15 E Pig 1
16 F Cow 1
17 F Sheep 1
18 F Pig 1
Suppose we wanted to convert the data back into a wide format? We can use pivot_wider()
to do just this.
livestock_long %>% pivot_wider(names_from = "animal", values_from = "count")
# A tibble: 6 x 4
HH Cow Sheep Pig
<fct> <int> <int> <int>
1 A 3 2 2
2 B 1 4 2
3 C 2 3 1
4 D 2 3 3
5 E 2 2 1
6 F 1 1 1
Sometimes values in a table are not in a tidy format or you need to separate out one part of the value to use in a different analysis.
For this next part of the lesson, we will use real survey data on Native foods in Yukon River communities. Full citation and additional information: Philip A. Loring, Anne Beaudreau, and Cecile Tang. 2020. Alaska Native Service Survey of Native Foods, Yukon River communities, 1940s-1970s. Arctic Data Center. doi:10.18739/A2GX44V7K. Link
Let’s load the data and take a look at the Func.Grp
column.
reindeer <- read.csv("https://maddiebrown.github.io/ANTH630/data/Loring_et_al_reindeer_records.csv",
stringsAsFactors = F)
head(reindeer$Func.Grp)
[1] "Freshwater Fish" "Freshwater Fish" "Freshwater Fish"
[4] "Freshwater Fish" "Large Land Mammal" "Large Land Mammal"
What if we wanted to separate out the two words in the functional groups. Perhaps we know that we are interested in fish, so making a new column that divides the type of fish and the even more general category of “fish” might be useful.
First, let’s subset out only the rows that involve fish. Then, we can separate the Func.Grp
column in the new datatable.
fish <- reindeer %>% filter(str_detect(Func.Grp, "Fish"))
head(fish %>% separate(Func.Grp, c("Type", "Animal")))
Record.ID Community Region Year Decade Statehood N.dogs Food.type
1 1 Arctic Village Upper 1960 1960 post 75 Trout
2 2 Arctic Village Upper 1960 1960 post 75 Whitefish
3 3 Arctic Village Upper 1960 1960 post 75 Pike
4 4 Arctic Village Upper 1960 1960 post 75 Grayling
5 7 Arctic Village Upper 1961 1960 post 75 Trout
6 8 Arctic Village Upper 1961 1960 post 75 Whitefish
Common.name Type Animal
1 Trout Freshwater Fish
2 Whitefish Freshwater Fish
3 Pike Freshwater Fish
4 Grayling Freshwater Fish
5 Trout Freshwater Fish
6 Whitefish Freshwater Fish
What if we wanted to unite two columns together into a new column? Perhaps we want to join the region and community. Try using unite()
to achieve this outcome.
Click for solution
head(reindeer %>% unite(New, Community, Region, sep = ":"))
Record.ID New Year Decade Statehood N.dogs Food.type
1 1 Arctic Village:Upper 1960 1960 post 75 Trout
2 2 Arctic Village:Upper 1960 1960 post 75 Whitefish
3 3 Arctic Village:Upper 1960 1960 post 75 Pike
4 4 Arctic Village:Upper 1960 1960 post 75 Grayling
5 5 Arctic Village:Upper 1960 1960 post 75 Caribou
6 6 Arctic Village:Upper 1960 1960 post 75 Moose
Common.name Func.Grp
1 Trout Freshwater Fish
2 Whitefish Freshwater Fish
3 Pike Freshwater Fish
4 Grayling Freshwater Fish
5 Caribou Large Land Mammal
6 Moose Large Land Mammal
Renaming columns with tidyverse is accomplished using rename()
. The code below renames the Func.Grp
into FunctionalGroup
.
reindeer <- rename(reindeer, FunctionalGroup = Func.Grp)
Now we have a new FunctionalGroup
column but it is way at the back of the dataframe. Move the column to just after the Record.ID
using relocate()
.
Click for solution
head(reindeer %>% relocate(FunctionalGroup, .after = Record.ID))
Record.ID FunctionalGroup Community Region Year Decade Statehood
1 1 Freshwater Fish Arctic Village Upper 1960 1960 post
2 2 Freshwater Fish Arctic Village Upper 1960 1960 post
3 3 Freshwater Fish Arctic Village Upper 1960 1960 post
4 4 Freshwater Fish Arctic Village Upper 1960 1960 post
5 5 Large Land Mammal Arctic Village Upper 1960 1960 post
6 6 Large Land Mammal Arctic Village Upper 1960 1960 post
N.dogs Food.type Common.name
1 75 Trout Trout
2 75 Whitefish Whitefish
3 75 Pike Pike
4 75 Grayling Grayling
5 75 Caribou Caribou
6 75 Moose Moose