Useful links

T-tests

T-tests allow us to test whether or not two sample means are significantly different. R has several tools for conducting t-tests. We will cover some today using example code and data from Drennan and the R Companion for Drennan.

The first comparisons we will make are between the floor area of houses from both the formative and classic periods. Load in the data and create separate vectors for each time period.

HouseFloors <- read.csv("https://maddiebrown.github.io/ANTH630/data/drennan_datasets/HouseFloors.csv")
# 
Formative <- HouseFloors %>% filter(HouseFloors$Period == "Formative") %>% pull(Area)
Classic <- HouseFloors %>% filter(HouseFloors$Period == "Classic") %>% pull(Area)  #pull creates a vector, while select creates a dataframe

Comparing sample means

A quick way to compare sample means is through creating a boxplot. This displays the range of value in both samples.

boxplot(list(Formative = Formative, Classic = Classic))

We can also calculate descriptive stats for each group and run normality tests using stat.desc() from the pastecs package.

library(pastecs)
by(HouseFloors$Area, HouseFloors$Period, stat.desc, basic = F, norm = T)  # this last argument runs normality tests and if the results are non-significant, this means the errors in the model are likely normally distributed
HouseFloors$Period: Classic
      median         mean      SE.mean CI.mean.0.95          var      std.dev 
  26.3000000   26.2500000    0.6252752    1.2552924   20.3303922    4.5089236 
    coef.var     skewness     skew.2SE     kurtosis     kurt.2SE   normtest.W 
   0.1717685   -0.2442232   -0.3695718   -0.5570531   -0.4284409    0.9845577 
  normtest.p 
   0.7315843 
------------------------------------------------------------ 
HouseFloors$Period: Formative
      median         mean      SE.mean CI.mean.0.95          var      std.dev 
  24.3000000   23.7750000    0.5978119    1.2192453   11.4361290    3.3817346 
    coef.var     skewness     skew.2SE     kurtosis     kurt.2SE   normtest.W 
   0.1422391   -0.3110996   -0.3753096   -0.3901647   -0.2410295    0.9816014 
  normtest.p 
   0.8445857

Because the normality tests are not significant, we can proceed with conducting t-tests to compare the means.

Bullet graphs

Carlson has created some useful code in the Drennan companion for creating descriptive statistics tables and bullet graphs, which are not easily created in R. We will define his functions with the code below.

Dstats <- function(x) {
    n <- length(x)
    Md <- median(x)
    Xbar <- mean(x)
    IQR <- IQR(x, type = 2)
    s <- sd(x)
    SE <- sd(x)/sqrt(length(x))
    return(c(n = n, Md = Md, Xbar = Xbar, IQR = IQR, s = s, SE = SE))
}


BPlot <- function(x, mwid = 0.04, ...) {
    val <- max(sapply(x, function(s) sd(s)/sqrt(length(s)))) * 3
    yr <- range(sapply(x, mean)) + c(-val, val)
    plot(NA, xlim = c(0.5, length(x) + 0.5), ylim = yr, xaxt = "n", xlab = "", las = 1, 
        ...)
    axis(1, seq_along(x), names(x))
    for (i in seq_along(x)) {
        tval <- qt(c(0.005, 0.025, 0.1, 0.9, 0.975, 0.995), length(x[[i]]) - 1)
        mn <- mean(x[[i]])
        se <- sd(x[[i]])/sqrt(length(x[[i]]))
        y <- mn + se * tval
        segments(i, y[1], i, y[6], lwd = 1, lend = "square")
        segments(i, y[2], i, y[5], lwd = 5, lend = "square")
        segments(i, y[3], i, y[4], lwd = 9, lend = "square")
        segments(i - mwid, mn, i + mwid, mn, lend = "square")
    }
}

Now we can use this function to create a bullet plot. Each level on the plot denotes a different confidence level. The smallest level is 90%, then 95% and then 99%. From these plots, we can see that there is less overlap in the confidence intervals of the two distributions compared to what was seen in distributions of the boxplot alone.

Areas <- list(Formative = Formative, Classic = Classic)
BPlot(Areas, mwid = 0.06, ylab = bquote(x^2))

You can also manually calculate the confidence ranges at different confidence levels. Carlson’s code for doing so can be found below:

n <- length(Formative)
mn <- mean(Formative)
tval <- qt(0.9, n - 1)
ER <- sd(Formative)/sqrt(n) * tval
cat("80%-Confidence level =", round(mn, 1), "±", round(ER, 2), "m2\n")
80%-Confidence level = 23.8 ± 0.78 m2
# 80%-Confidence level = 23.8 ± 0.78 m2

t-tests

Run a t-test to compare whether or not the mean floor area is significantly different between the formative and classic periods. This can be completed using the two vectors we created previously. Looking at the 95% confidence interval, the true difference between the means is likely to be between -4.19 and -0.75 in 95% of samples. This means that it is unlikely that the true difference between the two sample means is zero and that they are from different populations.

t.test(Formative, Classic)

    Welch Two Sample t-test

data:  Formative and Classic
t = -2.861, df = 78.686, p-value = 0.005406
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.1969878 -0.7530122
sample estimates:
mean of x mean of y 
   23.775    26.250

You can also run a t-test using formula notation.

t.test(Area ~ Period, data = HouseFloors)

    Welch Two Sample t-test

data:  Area by Period
t = 2.861, df = 78.686, p-value = 0.005406
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.7530122 4.1969878
sample estimates:
  mean in group Classic mean in group Formative 
                 26.250                  23.775

Checking variance and t-test assumptions

The version of the t-test we ran above is the more conservative version of a t-test, where it is not assumed that the variances between samples are equal. This will generally create a lower threshold for significance than when the variances are equal.

There is a function var.test() in R that allows us to compare variances across samples. See Carlson for additional details and explanation of the results.

var.test(Formative, Classic)

    F test to compare two variances

data:  Formative and Classic
F = 0.56251, num df = 31, denom df = 51, p-value = 0.08948
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.3038643 1.0952839
sample estimates:
ratio of variances 
         0.5625139

The output of this F test comparing variance is not significant at the \(p<0.5\) level, so there is no significant difference in variances between the two samples. This suggests that the adjusted t-test we used above was not necessary. We can re-run the t-test, this time adding in the argument var.equal=T, which adjusts the t-test by assuming that the variances are equal.

t.test(Area ~ Period, data = HouseFloors, var.equal = T)

    Two Sample t-test

data:  Area by Period
t = 2.6742, df = 82, p-value = 0.009038
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.6338816 4.3161184
sample estimates:
  mean in group Classic mean in group Formative 
                 26.250                  23.775

The results from this t-test are the same. In some cases, this can make a difference, so pay attention to the variance in choosing which version of the t-test to use.

Advanced tidyverse

Transforming data formats

Data are often structured in long or wide formats. Using pivot_longer() and pivot_wider(). Let’s look at an example with some household livestock data.

First read in the data and load the tidyverse library.

library(tidyverse)
livestock <- read.csv("https://maddiebrown.github.io/ANTH630/data/livestock.csv")
livestock
  HH Cow Sheep Pig
1  A   3     2   2
2  B   1     4   2
3  C   2     3   1
4  D   2     3   3
5  E   2     2   1
6  F   1     1   1

These data are in a long format. The columns “Sheep” “Pig” and “Cow” are all types of animals and their values are measured in the same units (counts). Because of this, we can convert the data into a long format where the animal columns are collapsed into a single column “animals” and the values are put into a new “count” column.

livestock_long <- livestock %>% pivot_longer(!HH, names_to = "animal", values_to = "count")
livestock_long
# A tibble: 18 x 3
   HH    animal count
   <fct> <chr>  <int>
 1 A     Cow        3
 2 A     Sheep      2
 3 A     Pig        2
 4 B     Cow        1
 5 B     Sheep      4
 6 B     Pig        2
 7 C     Cow        2
 8 C     Sheep      3
 9 C     Pig        1
10 D     Cow        2
11 D     Sheep      3
12 D     Pig        3
13 E     Cow        2
14 E     Sheep      2
15 E     Pig        1
16 F     Cow        1
17 F     Sheep      1
18 F     Pig        1

Suppose we wanted to convert the data back into a wide format? We can use pivot_wider() to do just this.

livestock_long %>% pivot_wider(names_from = "animal", values_from = "count")
# A tibble: 6 x 4
  HH      Cow Sheep   Pig
  <fct> <int> <int> <int>
1 A         3     2     2
2 B         1     4     2
3 C         2     3     1
4 D         2     3     3
5 E         2     2     1
6 F         1     1     1

Transforming columns

Sometimes values in a table are not in a tidy format or you need to separate out one part of the value to use in a different analysis.

For this next part of the lesson, we will use real survey data on Native foods in Yukon River communities. Full citation and additional information: Philip A. Loring, Anne Beaudreau, and Cecile Tang. 2020. Alaska Native Service Survey of Native Foods, Yukon River communities, 1940s-1970s. Arctic Data Center. doi:10.18739/A2GX44V7K. Link

Let’s load the data and take a look at the Func.Grp column.

reindeer <- read.csv("https://maddiebrown.github.io/ANTH630/data/Loring_et_al_reindeer_records.csv", 
    stringsAsFactors = F)
head(reindeer$Func.Grp)
[1] "Freshwater Fish"   "Freshwater Fish"   "Freshwater Fish"  
[4] "Freshwater Fish"   "Large Land Mammal" "Large Land Mammal"

What if we wanted to separate out the two words in the functional groups. Perhaps we know that we are interested in fish, so making a new column that divides the type of fish and the even more general category of “fish” might be useful.

First, let’s subset out only the rows that involve fish. Then, we can separate the Func.Grp column in the new datatable.

fish <- reindeer %>% filter(str_detect(Func.Grp, "Fish"))
head(fish %>% separate(Func.Grp, c("Type", "Animal")))
  Record.ID      Community Region Year Decade Statehood N.dogs Food.type
1         1 Arctic Village  Upper 1960   1960      post     75     Trout
2         2 Arctic Village  Upper 1960   1960      post     75 Whitefish
3         3 Arctic Village  Upper 1960   1960      post     75      Pike
4         4 Arctic Village  Upper 1960   1960      post     75  Grayling
5         7 Arctic Village  Upper 1961   1960      post     75     Trout
6         8 Arctic Village  Upper 1961   1960      post     75 Whitefish
  Common.name       Type Animal
1       Trout Freshwater   Fish
2   Whitefish Freshwater   Fish
3        Pike Freshwater   Fish
4    Grayling Freshwater   Fish
5       Trout Freshwater   Fish
6   Whitefish Freshwater   Fish

Try it

What if we wanted to unite two columns together into a new column? Perhaps we want to join the region and community. Try using unite() to achieve this outcome.

Click for solution

head(reindeer %>% unite(New, Community, Region, sep = ":"))
  Record.ID                  New Year Decade Statehood N.dogs Food.type
1         1 Arctic Village:Upper 1960   1960      post     75     Trout
2         2 Arctic Village:Upper 1960   1960      post     75 Whitefish
3         3 Arctic Village:Upper 1960   1960      post     75      Pike
4         4 Arctic Village:Upper 1960   1960      post     75  Grayling
5         5 Arctic Village:Upper 1960   1960      post     75   Caribou
6         6 Arctic Village:Upper 1960   1960      post     75     Moose
  Common.name          Func.Grp
1       Trout   Freshwater Fish
2   Whitefish   Freshwater Fish
3        Pike   Freshwater Fish
4    Grayling   Freshwater Fish
5     Caribou Large Land Mammal
6       Moose Large Land Mammal

Renaming and relocating columns

Renaming columns with tidyverse is accomplished using rename(). The code below renames the Func.Grp into FunctionalGroup.

reindeer <- rename(reindeer, FunctionalGroup = Func.Grp)

Try it

Now we have a new FunctionalGroup column but it is way at the back of the dataframe. Move the column to just after the Record.ID using relocate().

Click for solution

head(reindeer %>% relocate(FunctionalGroup, .after = Record.ID))
  Record.ID   FunctionalGroup      Community Region Year Decade Statehood
1         1   Freshwater Fish Arctic Village  Upper 1960   1960      post
2         2   Freshwater Fish Arctic Village  Upper 1960   1960      post
3         3   Freshwater Fish Arctic Village  Upper 1960   1960      post
4         4   Freshwater Fish Arctic Village  Upper 1960   1960      post
5         5 Large Land Mammal Arctic Village  Upper 1960   1960      post
6         6 Large Land Mammal Arctic Village  Upper 1960   1960      post
  N.dogs Food.type Common.name
1     75     Trout       Trout
2     75 Whitefish   Whitefish
3     75      Pike        Pike
4     75  Grayling    Grayling
5     75   Caribou     Caribou
6     75     Moose       Moose

ANTH630: t-tests and advanced tidyverse

Madeline Brown

Useful links

T-tests

Comparing sample means

Bullet graphs

t-tests

Checking variance and t-test assumptions

Advanced tidyverse

Transforming data formats

Transforming columns

Try it

Renaming and relocating columns

Try it