T-tests allow us to test whether or not two sample means are significantly different. R has several tools for conducting t-tests. We will cover some today using example code and data from Drennan and the R Companion for Drennan.
The first comparisons we will make are between the floor area of houses from both the formative and classic periods. Load in the data and create separate vectors for each time period.
HouseFloors <- read.csv("https://maddiebrown.github.io/ANTH630/data/drennan_datasets/HouseFloors.csv")
#
Formative <- HouseFloors %>% filter(HouseFloors$Period == "Formative") %>% pull(Area)
Classic <- HouseFloors %>% filter(HouseFloors$Period == "Classic") %>% pull(Area) #pull creates a vector, while select creates a dataframe
A quick way to compare sample means is through creating a boxplot. This displays the range of value in both samples.
boxplot(list(Formative = Formative, Classic = Classic))
We can also calculate descriptive stats for each group and run normality tests using stat.desc()
from the pastecs
package.
library(pastecs)
by(HouseFloors$Area, HouseFloors$Period, stat.desc, basic = F, norm = T) # this last argument runs normality tests and if the results are non-significant, this means the errors in the model are likely normally distributed
HouseFloors$Period: Classic
median mean SE.mean CI.mean.0.95 var std.dev
26.3000000 26.2500000 0.6252752 1.2552924 20.3303922 4.5089236
coef.var skewness skew.2SE kurtosis kurt.2SE normtest.W
0.1717685 -0.2442232 -0.3695718 -0.5570531 -0.4284409 0.9845577
normtest.p
0.7315843
------------------------------------------------------------
HouseFloors$Period: Formative
median mean SE.mean CI.mean.0.95 var std.dev
24.3000000 23.7750000 0.5978119 1.2192453 11.4361290 3.3817346
coef.var skewness skew.2SE kurtosis kurt.2SE normtest.W
0.1422391 -0.3110996 -0.3753096 -0.3901647 -0.2410295 0.9816014
normtest.p
0.8445857
Because the normality tests are not significant, we can proceed with conducting t-tests to compare the means.
Carlson has created some useful code in the Drennan companion for creating descriptive statistics tables and bullet graphs, which are not easily created in R. We will define his functions with the code below.
Dstats <- function(x) {
n <- length(x)
Md <- median(x)
Xbar <- mean(x)
IQR <- IQR(x, type = 2)
s <- sd(x)
SE <- sd(x)/sqrt(length(x))
return(c(n = n, Md = Md, Xbar = Xbar, IQR = IQR, s = s, SE = SE))
}
BPlot <- function(x, mwid = 0.04, ...) {
val <- max(sapply(x, function(s) sd(s)/sqrt(length(s)))) * 3
yr <- range(sapply(x, mean)) + c(-val, val)
plot(NA, xlim = c(0.5, length(x) + 0.5), ylim = yr, xaxt = "n", xlab = "", las = 1,
...)
axis(1, seq_along(x), names(x))
for (i in seq_along(x)) {
tval <- qt(c(0.005, 0.025, 0.1, 0.9, 0.975, 0.995), length(x[[i]]) - 1)
mn <- mean(x[[i]])
se <- sd(x[[i]])/sqrt(length(x[[i]]))
y <- mn + se * tval
segments(i, y[1], i, y[6], lwd = 1, lend = "square")
segments(i, y[2], i, y[5], lwd = 5, lend = "square")
segments(i, y[3], i, y[4], lwd = 9, lend = "square")
segments(i - mwid, mn, i + mwid, mn, lend = "square")
}
}
Now we can use this function to create a bullet plot. Each level on the plot denotes a different confidence level. The smallest level is 90%, then 95% and then 99%. From these plots, we can see that there is less overlap in the confidence intervals of the two distributions compared to what was seen in distributions of the boxplot alone.
Areas <- list(Formative = Formative, Classic = Classic)
BPlot(Areas, mwid = 0.06, ylab = bquote(x^2))
You can also manually calculate the confidence ranges at different confidence levels. Carlson’s code for doing so can be found below:
n <- length(Formative)
mn <- mean(Formative)
tval <- qt(0.9, n - 1)
ER <- sd(Formative)/sqrt(n) * tval
cat("80%-Confidence level =", round(mn, 1), "±", round(ER, 2), "m2\n")
80%-Confidence level = 23.8 ± 0.78 m2
# 80%-Confidence level = 23.8 ± 0.78 m2
Run a t-test to compare whether or not the mean floor area is significantly different between the formative and classic periods. This can be completed using the two vectors we created previously. Looking at the 95% confidence interval, the true difference between the means is likely to be between -4.19 and -0.75 in 95% of samples. This means that it is unlikely that the true difference between the two sample means is zero and that they are from different populations.
t.test(Formative, Classic)
Welch Two Sample t-test
data: Formative and Classic
t = -2.861, df = 78.686, p-value = 0.005406
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.1969878 -0.7530122
sample estimates:
mean of x mean of y
23.775 26.250
You can also run a t-test using formula notation.
t.test(Area ~ Period, data = HouseFloors)
Welch Two Sample t-test
data: Area by Period
t = 2.861, df = 78.686, p-value = 0.005406
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.7530122 4.1969878
sample estimates:
mean in group Classic mean in group Formative
26.250 23.775
The version of the t-test we ran above is the more conservative version of a t-test, where it is not assumed that the variances between samples are equal. This will generally create a lower threshold for significance than when the variances are equal.
There is a function var.test()
in R that allows us to compare variances across samples. See Carlson for additional details and explanation of the results.
var.test(Formative, Classic)
F test to compare two variances
data: Formative and Classic
F = 0.56251, num df = 31, denom df = 51, p-value = 0.08948
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.3038643 1.0952839
sample estimates:
ratio of variances
0.5625139
The output of this F test comparing variance is not significant at the \(p<0.5\) level, so there is no significant difference in variances between the two samples. This suggests that the adjusted t-test we used above was not necessary. We can re-run the t-test, this time adding in the argument var.equal=T
, which adjusts the t-test by assuming that the variances are equal.
t.test(Area ~ Period, data = HouseFloors, var.equal = T)
Two Sample t-test
data: Area by Period
t = 2.6742, df = 82, p-value = 0.009038
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.6338816 4.3161184
sample estimates:
mean in group Classic mean in group Formative
26.250 23.775
The results from this t-test are the same. In some cases, this can make a difference, so pay attention to the variance in choosing which version of the t-test to use.
Suppose we want to compare the means of several groups to see whether or not they are different? This is where ANOVA comes in handy. Comparing more than two sample means usually starts with examining the basic descriptive statistics and plotting the frequency distributions of each group.
apple <- read.csv("https://maddiebrown.github.io/ANTH630/data/owan02.csv")
# look at the structure of the data
str(apple)
'data.frame': 8 obs. of 5 variables:
$ X1: int 2569 2928 2865 3844 3027 2336 3211 3037
$ X2: int 2074 2885 3378 3906 2782 3018 3383 3447
$ X3: int 2505 2315 2667 2390 3021 3085 3308 3231
$ X4: int 2838 2351 3001 2439 2199 3318 3601 3291
$ X5: int 1532 2552 3083 2330 2079 3366 2416 3100
# add in a temporary rowID to help with pivoting
apple$ID <- c(1:nrow(apple))
# pivot the data from wide to long
apple <- apple %>% pivot_longer(!ID, names_to = "Type", values_to = "Growth")
library(pastecs)
by(apple$Growth, apple$Type, stat.desc)
apple$Type: X1
nbr.val nbr.null nbr.na min max range
8.000000e+00 0.000000e+00 0.000000e+00 2.336000e+03 3.844000e+03 1.508000e+03
sum median mean SE.mean CI.mean.0.95 var
2.381700e+04 2.977500e+03 2.977125e+03 1.583867e+02 3.745250e+02 2.006907e+05
std.dev coef.var
4.479852e+02 1.504758e-01
------------------------------------------------------------
apple$Type: X2
nbr.val nbr.null nbr.na min max range
8.000000e+00 0.000000e+00 0.000000e+00 2.074000e+03 3.906000e+03 1.832000e+03
sum median mean SE.mean CI.mean.0.95 var
2.487300e+04 3.198000e+03 3.109125e+03 1.951877e+02 4.615455e+02 3.047858e+05
std.dev coef.var
5.520741e+02 1.775658e-01
------------------------------------------------------------
apple$Type: X3
nbr.val nbr.null nbr.na min max range
8.000000e+00 0.000000e+00 0.000000e+00 2.315000e+03 3.308000e+03 9.930000e+02
sum median mean SE.mean CI.mean.0.95 var
2.252200e+04 2.844000e+03 2.815250e+03 1.388775e+02 3.283931e+02 1.542956e+05
std.dev coef.var
3.928048e+02 1.395275e-01
------------------------------------------------------------
apple$Type: X4
nbr.val nbr.null nbr.na min max range
8.000000e+00 0.000000e+00 0.000000e+00 2.199000e+03 3.601000e+03 1.402000e+03
sum median mean SE.mean CI.mean.0.95 var
2.303800e+04 2.919500e+03 2.879750e+03 1.810236e+02 4.280527e+02 2.621562e+05
std.dev coef.var
5.120119e+02 1.777974e-01
------------------------------------------------------------
apple$Type: X5
nbr.val nbr.null nbr.na min max range
8.000000e+00 0.000000e+00 0.000000e+00 1.532000e+03 3.366000e+03 1.834000e+03
sum median mean SE.mean CI.mean.0.95 var
2.045800e+04 2.484000e+03 2.557250e+03 2.144990e+02 5.072095e+02 3.680785e+05
std.dev coef.var
6.066947e+02 2.372450e-01
Next we can plot the distribution of observations for each of the cultivar types.
boxplot(Growth ~ Type, apple)
Plot the means with error bars denoting the confidence intervals for each mean.
# adapted from Field et al. 2012: 157
ggplot(apple, aes(Type, Growth)) + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.y = mean,
geom = "line", aes(group = 1), colour = "purple", linetype = 2) + stat_summary(fun.data = mean_cl_boot,
geom = "errorbar", width = 0.3)
ANOVA assumes that the variances in the groups are homogeneous and that the data are normally distributed. These assumptions can be checked using leveneTest()
. If the results are non-significant, we assume that the variances are homogeneous and the data are normal and can proceed with an ANOVA.
# leveneTest(outcome variable, group, center=median/mean) # from Field et al
# 2012: 438
library(car)
leveneTest(apple$Growth, apple$Type)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 4 0.4284 0.7871
35
ANOVA models are created with the basic formula outlined below and viewed using summary()
.
# model <- aov(outcome~predictor, data=dataframe) summary(model)
Let’s run an ANOVA on a real data set.
applemodel <- aov(Growth ~ Type, data = apple)
summary(applemodel)
The output from the ANOVA model tells us the effect of type on growth as explained by the model and that explained by random variation (residuals) (see Field et al. 2012: 439). In this case, we see that there is a 28.4% chance of getting an F-value this size due to chance, which means that the type of cultivar did not have a significant effect on growth.
We can plot the model results. This will create 4 plots, the first two of which Field et al. 2012 state are useful for understanding ANOVA models. In the first, there should be clear lines for each group in the sample. If instead the data are spread out or shaped oddly this suggests the data are not normal. In the second QQ plot, the residuals should follow the line.
par(mfrow = c(2, 2))
applemodel <- aov(Growth ~ Type, data = apple)
plot(applemodel)
Sometimes, the data will not have homogenous variance. In this case, you can adjust the ANOVA based on the heterogeneity in group variance using Welch’s F (Field et al. 2012: 441). This is a one-way ANOVA that also produces an F-ratio.
oneway.test(outcome ~ predictor, data = dataframe)
ANOVAs can only tell us whether or not there is a significant difference between the means of several groups. It cannot tell us which groups are different nor if the differences are significant. We can use several different types of post-hoc tests to answer these questions.
A pairwise t-test conducts a t-test for each set of pairs in the data. The basic model of a pairwise t-test can be found below. This corrects for the fact that we are conducting multiple tests. There are multiple methods for adjusting the p-value available. See the help file for details. Available methods include Bonferroni, where “p-values are multiplied by the number of comparisons” (p.adjust help file) and is conservative.
# pairwise.t.test(outcome, predictor, paired=F, p.adjust.method='method') # from
# Field et al. 2012: 447)
pairwise.t.test(apple$Growth, apple$Type, paired = F, p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: apple$Growth and apple$Type
X1 X2 X3 X4
X2 1.00 - - -
X3 1.00 1.00 - -
X4 1.00 1.00 1.00 -
X5 1.00 0.37 1.00 1.00
P value adjustment method: bonferroni
The pairwise comparisons allow us to examine which groups within the data are significantly different from one another. In this case, we see that none of them are significantly different.
Tukey’s honest significant difference test can tell us whether or not the observed differences from the ANOVA are statistically different.
applemodel <- aov(Growth ~ Type, data = apple)
TukeyHSD(applemodel)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Growth ~ Type, data = apple)
$Type
diff lwr upr p adj
X2-X1 132.000 -598.1767 862.1767 0.9847805
X3-X1 -161.875 -892.0517 568.3017 0.9678464
X4-X1 -97.375 -827.5517 632.8017 0.9952020
X5-X1 -419.875 -1150.0517 310.3017 0.4749275
X3-X2 -293.875 -1024.0517 436.3017 0.7750989
X4-X2 -229.375 -959.5517 500.8017 0.8937492
X5-X2 -551.875 -1282.0517 178.3017 0.2136349
X4-X3 64.500 -665.6767 794.6767 0.9990363
X5-X3 -258.000 -988.1767 472.1767 0.8463381
X5-X4 -322.500 -1052.6767 407.6767 0.7109124
We can plot the 95% confidence levels. We see that there is a lot of overlap across the cultivars. There is likely no difference across the group means.
applemodel <- aov(Growth ~ Type, data = apple)
plot(TukeyHSD(applemodel))
# see
# https://bioinformatics-core-shared-training.github.io/linear-models-r/anova.html#section_2:_anova
# for additional description of these functions.
You can also use the glht()
function from the multcomp
package to conduct multiple types of post-hoc tests with different p-value corrections and confidence intervals.
Using what we’ve learned about running ANOVAs in R, analyze the excavation data from a site in New Mexico. Details about the data are included below.
In your analysis, complete the following steps:
Data: Excavation Depth and Archaeology “Four different excavation sites at an archeological area in New Mexico gave the following depths (cm) for significant archaeological discoveries.” X1 = depths at Site I X2 = depths at Site II X3 = depths at Site III X4 = depths at Site IV
Reference: Mimbres Mogollon Archaeology by Woosley and McIntyre, Univ. of New Mexico Press Link
Click for solution
excavation <- read.csv("https://maddiebrown.github.io/ANTH630/data/owan01.csv")
excavation
X1 X2 X3 X4
1 93 85 100 96
2 120 45 75 58
3 65 80 65 95
4 105 28 40 90
5 115 75 73 65
6 82 70 65 80
7 99 65 50 85
8 87 55 30 95
9 100 50 45 82
10 90 40 50 NA
11 78 NA 45 NA
12 95 NA 55 NA
13 93 NA NA NA
14 88 NA NA NA
15 110 NA NA NA
# add in a temporary rowID to help with pivoting
excavation$ID <- c(1:nrow(excavation))
# pivot the data from wide to long
excavation <- excavation %>% pivot_longer(!ID, names_to = "Site", values_to = "Depth")
# examine the descriptive stats
by(excavation$Depth, excavation$Site, stat.desc)
excavation$Site: X1
nbr.val nbr.null nbr.na min max range
15.0000000 0.0000000 0.0000000 65.0000000 120.0000000 55.0000000
sum median mean SE.mean CI.mean.0.95 var
1420.0000000 93.0000000 94.6666667 3.7118429 7.9611113 206.6666667
std.dev coef.var
14.3759058 0.1518582
------------------------------------------------------------
excavation$Site: X2
nbr.val nbr.null nbr.na min max range
10.0000000 0.0000000 5.0000000 28.0000000 85.0000000 57.0000000
sum median mean SE.mean CI.mean.0.95 var
593.0000000 60.0000000 59.3000000 5.9105367 13.3705630 349.3444444
std.dev coef.var
18.6907583 0.3151899
------------------------------------------------------------
excavation$Site: X3
nbr.val nbr.null nbr.na min max range
12.0000000 0.0000000 3.0000000 30.0000000 100.0000000 70.0000000
sum median mean SE.mean CI.mean.0.95 var
693.0000000 52.5000000 57.7500000 5.4760152 12.0526282 359.8409091
std.dev coef.var
18.9694731 0.3284757
------------------------------------------------------------
excavation$Site: X4
nbr.val nbr.null nbr.na min max range
9.0000000 0.0000000 6.0000000 58.0000000 96.0000000 38.0000000
sum median mean SE.mean CI.mean.0.95 var
746.0000000 85.0000000 82.8888889 4.5167726 10.4156963 183.6111111
std.dev coef.var
13.5503177 0.1634757
# plot the data by site
boxplot(Depth ~ Site, data = excavation)
# plot error bars. adapted from Field et al. 2012: 157
ggplot(excavation, aes(Site, Depth)) + stat_summary(fun.y = mean, geom = "point") +
stat_summary(fun.y = mean, geom = "line", aes(group = 1), colour = "purple",
linetype = 2) + stat_summary(fun.data = mean_cl_boot, geom = "errorbar",
width = 0.3)
# check for normality
leveneTest(excavation$Depth, excavation$Site)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 0.7978 0.5021
42
# create a model
excavationmodel <- aov(Depth ~ Site, data = excavation)
summary(excavationmodel)
Df Sum Sq Mean Sq F value Pr(>F)
Site 3 12397 4132 15.14 7.99e-07 ***
Residuals 42 11465 273
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
14 observations deleted due to missingness
# plot the model results, look at residuals
par(mfrow = c(2, 2))
plot(excavationmodel)
par(mfrow = c(1, 1))
# pairwise comparisons
pairwise.t.test(excavation$Depth, excavation$Site, paired = F, p.adjust.method = "bonferroni")
Pairwise comparisons using t tests with pooled SD
data: excavation$Depth and excavation$Site
X1 X2 X3
X2 2.9e-05 - -
X3 5.1e-06 1.0000 -
X4 0.5898 0.0203 0.0077
P value adjustment method: bonferroni
# tukey's hsd
TukeyHSD(excavationmodel)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Depth ~ Site, data = excavation)
$Site
diff lwr upr p adj
X2-X1 -35.36667 -53.409116 -17.324217 0.0000279
X3-X1 -36.91667 -54.033237 -19.800096 0.0000050
X4-X1 -11.77778 -30.411940 6.856384 0.3412430
X3-X2 -1.55000 -20.473081 17.373081 0.9962282
X4-X2 23.58889 3.282782 43.894996 0.0171237
X4-X3 25.13889 5.650816 44.626962 0.0067914
# can plot the 95% confidence levels.
plot(TukeyHSD(excavationmodel))
Data are often scattered across several different table. As we saw with relational databases, these tables can be joined based on common keys or IDs. There are several different join functions in tidyverse that we will cover today.
A left join simply matches the values from the column on the left side of the equation to the values in a column in the right side of the column. Let’s practice this using the livestock
and hhdata
datasets.
Load libraries and datasets.
library(tidyverse)
livestock <- read.csv("https://maddiebrown.github.io/ANTH630/data/livestock.csv")
hhdata <- read.csv("https://maddiebrown.github.io/ANTH630/data/hhdata.csv")
Use left_join()
to join the tables. In a left_join, all the columns from both datasets are included as are all the rows from the table on the left.
left_join(hhdata, livestock, by = c(HH = "HH"))
HH Location Members HasVehicle Cow Sheep Pig
1 A College Park 3 Y 3 2 2
2 B College Park 4 N 1 4 2
3 C College Park 3 Y 2 3 1
4 D University Park 4 Y 2 3 3
5 E University Park 6 Y 2 2 1
6 F University Park 5 N 1 1 1
What happens if there are not matches for every value of the table on the lefthand side?
livestock_sub <- livestock[1:4, ]
left_join(hhdata, livestock_sub, by = c(HH = "HH"))
HH Location Members HasVehicle Cow Sheep Pig
1 A College Park 3 Y 3 2 2
2 B College Park 4 N 1 4 2
3 C College Park 3 Y 2 3 1
4 D University Park 4 Y 2 3 3
5 E University Park 6 Y NA NA NA
6 F University Park 5 N NA NA NA
Any rows without a match are simply left blank.
In a right join, we keep all columns from both tables and all the rows from the table on the right.
right_join(hhdata, livestock_sub, by = c(HH = "HH"))
HH Location Members HasVehicle Cow Sheep Pig
1 A College Park 3 Y 3 2 2
2 B College Park 4 N 1 4 2
3 C College Park 3 Y 2 3 1
4 D University Park 4 Y 2 3 3
In an inner join, we are transforming which rows remain in the final dataset. All columns are returned from both datasets, but only the rows in the table on the left (x) that match rows in the table on the righr (y) are retained.
inner_join(hhdata, livestock_sub, by = c(HH = "HH"))
HH Location Members HasVehicle Cow Sheep Pig
1 A College Park 3 Y 3 2 2
2 B College Park 4 N 1 4 2
3 C College Park 3 Y 2 3 1
4 D University Park 4 Y 2 3 3
You’ll notice only the four rows from the hhdata table that match the livestock_sub data are kept.
In a full join, all rows and columns from both tables are kept. Any missing values are converted to NA.
full_join(hhdata, livestock_sub, by = c(HH = "HH"))
HH Location Members HasVehicle Cow Sheep Pig
1 A College Park 3 Y 3 2 2
2 B College Park 4 N 1 4 2
3 C College Park 3 Y 2 3 1
4 D University Park 4 Y 2 3 3
5 E University Park 6 Y NA NA NA
6 F University Park 5 N NA NA NA
Unlike the previous join functions, semi_join()
and anti_join()
are considered filtering rather than mutating joins. These joins do not combine the data columns from multiple tables but rather, filter out the rows in one table based on the values in another table.
semi_join(hhdata, livestock_sub, by = c(HH = "HH"))
HH Location Members HasVehicle
1 A College Park 3 Y
2 B College Park 4 N
3 C College Park 3 Y
4 D University Park 4 Y
In the output above, we see that the hhdata
dataframe has been subset to include only the rows that are also found in the livestock_sub
dataframe. No new columns from the second table are added to the first.
anti_join()
works in the same way as semi_join()
except it only keeps the rows that are not found in the second table. We will use anti_join()
when we learn about text mining in a couple of weeks.
anti_join(hhdata, livestock_sub, by = c(HH = "HH"))
HH Location Members HasVehicle
1 E University Park 6 Y
2 F University Park 5 N
Now only the households for which there are no observations in the livestock_sub
dataframe remain.