T-tests

T-tests allow us to test whether or not two sample means are significantly different. R has several tools for conducting t-tests. We will cover some today using example code and data from Drennan and the R Companion for Drennan.

The first comparisons we will make are between the floor area of houses from both the formative and classic periods. Load in the data and create separate vectors for each time period.

HouseFloors <- read.csv("https://maddiebrown.github.io/ANTH630/data/drennan_datasets/HouseFloors.csv")
# 
Formative <- HouseFloors %>% filter(HouseFloors$Period == "Formative") %>% pull(Area)
Classic <- HouseFloors %>% filter(HouseFloors$Period == "Classic") %>% pull(Area)  #pull creates a vector, while select creates a dataframe

Comparing sample means

A quick way to compare sample means is through creating a boxplot. This displays the range of value in both samples.

boxplot(list(Formative = Formative, Classic = Classic))

We can also calculate descriptive stats for each group and run normality tests using stat.desc() from the pastecs package.

library(pastecs)
by(HouseFloors$Area, HouseFloors$Period, stat.desc, basic = F, norm = T)  # this last argument runs normality tests and if the results are non-significant, this means the errors in the model are likely normally distributed
HouseFloors$Period: Classic
      median         mean      SE.mean CI.mean.0.95          var      std.dev 
  26.3000000   26.2500000    0.6252752    1.2552924   20.3303922    4.5089236 
    coef.var     skewness     skew.2SE     kurtosis     kurt.2SE   normtest.W 
   0.1717685   -0.2442232   -0.3695718   -0.5570531   -0.4284409    0.9845577 
  normtest.p 
   0.7315843 
------------------------------------------------------------ 
HouseFloors$Period: Formative
      median         mean      SE.mean CI.mean.0.95          var      std.dev 
  24.3000000   23.7750000    0.5978119    1.2192453   11.4361290    3.3817346 
    coef.var     skewness     skew.2SE     kurtosis     kurt.2SE   normtest.W 
   0.1422391   -0.3110996   -0.3753096   -0.3901647   -0.2410295    0.9816014 
  normtest.p 
   0.8445857 

Because the normality tests are not significant, we can proceed with conducting t-tests to compare the means.

Bullet graphs

Carlson has created some useful code in the Drennan companion for creating descriptive statistics tables and bullet graphs, which are not easily created in R. We will define his functions with the code below.

Dstats <- function(x) {
    n <- length(x)
    Md <- median(x)
    Xbar <- mean(x)
    IQR <- IQR(x, type = 2)
    s <- sd(x)
    SE <- sd(x)/sqrt(length(x))
    return(c(n = n, Md = Md, Xbar = Xbar, IQR = IQR, s = s, SE = SE))
}


BPlot <- function(x, mwid = 0.04, ...) {
    val <- max(sapply(x, function(s) sd(s)/sqrt(length(s)))) * 3
    yr <- range(sapply(x, mean)) + c(-val, val)
    plot(NA, xlim = c(0.5, length(x) + 0.5), ylim = yr, xaxt = "n", xlab = "", las = 1, 
        ...)
    axis(1, seq_along(x), names(x))
    for (i in seq_along(x)) {
        tval <- qt(c(0.005, 0.025, 0.1, 0.9, 0.975, 0.995), length(x[[i]]) - 1)
        mn <- mean(x[[i]])
        se <- sd(x[[i]])/sqrt(length(x[[i]]))
        y <- mn + se * tval
        segments(i, y[1], i, y[6], lwd = 1, lend = "square")
        segments(i, y[2], i, y[5], lwd = 5, lend = "square")
        segments(i, y[3], i, y[4], lwd = 9, lend = "square")
        segments(i - mwid, mn, i + mwid, mn, lend = "square")
    }
}

Now we can use this function to create a bullet plot. Each level on the plot denotes a different confidence level. The smallest level is 90%, then 95% and then 99%. From these plots, we can see that there is less overlap in the confidence intervals of the two distributions compared to what was seen in distributions of the boxplot alone.

Areas <- list(Formative = Formative, Classic = Classic)
BPlot(Areas, mwid = 0.06, ylab = bquote(x^2))

You can also manually calculate the confidence ranges at different confidence levels. Carlson’s code for doing so can be found below:

n <- length(Formative)
mn <- mean(Formative)
tval <- qt(0.9, n - 1)
ER <- sd(Formative)/sqrt(n) * tval
cat("80%-Confidence level =", round(mn, 1), "±", round(ER, 2), "m2\n")
80%-Confidence level = 23.8 ± 0.78 m2
# 80%-Confidence level = 23.8 ± 0.78 m2

t-tests

Run a t-test to compare whether or not the mean floor area is significantly different between the formative and classic periods. This can be completed using the two vectors we created previously. Looking at the 95% confidence interval, the true difference between the means is likely to be between -4.19 and -0.75 in 95% of samples. This means that it is unlikely that the true difference between the two sample means is zero and that they are from different populations.

t.test(Formative, Classic)

    Welch Two Sample t-test

data:  Formative and Classic
t = -2.861, df = 78.686, p-value = 0.005406
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.1969878 -0.7530122
sample estimates:
mean of x mean of y 
   23.775    26.250 

You can also run a t-test using formula notation.

t.test(Area ~ Period, data = HouseFloors)

    Welch Two Sample t-test

data:  Area by Period
t = 2.861, df = 78.686, p-value = 0.005406
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.7530122 4.1969878
sample estimates:
  mean in group Classic mean in group Formative 
                 26.250                  23.775 

Checking variance and t-test assumptions

The version of the t-test we ran above is the more conservative version of a t-test, where it is not assumed that the variances between samples are equal. This will generally create a lower threshold for significance than when the variances are equal.

There is a function var.test() in R that allows us to compare variances across samples. See Carlson for additional details and explanation of the results.

var.test(Formative, Classic)

    F test to compare two variances

data:  Formative and Classic
F = 0.56251, num df = 31, denom df = 51, p-value = 0.08948
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.3038643 1.0952839
sample estimates:
ratio of variances 
         0.5625139 

The output of this F test comparing variance is not significant at the \(p<0.5\) level, so there is no significant difference in variances between the two samples. This suggests that the adjusted t-test we used above was not necessary. We can re-run the t-test, this time adding in the argument var.equal=T, which adjusts the t-test by assuming that the variances are equal.

t.test(Area ~ Period, data = HouseFloors, var.equal = T)

    Two Sample t-test

data:  Area by Period
t = 2.6742, df = 82, p-value = 0.009038
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.6338816 4.3161184
sample estimates:
  mean in group Classic mean in group Formative 
                 26.250                  23.775 

The results from this t-test are the same. In some cases, this can make a difference, so pay attention to the variance in choosing which version of the t-test to use.

ANOVA in R

Suppose we want to compare the means of several groups to see whether or not they are different? This is where ANOVA comes in handy. Comparing more than two sample means usually starts with examining the basic descriptive statistics and plotting the frequency distributions of each group.


apple <- read.csv("https://maddiebrown.github.io/ANTH630/data/owan02.csv")

# look at the structure of the data
str(apple)
'data.frame':   8 obs. of  5 variables:
 $ X1: int  2569 2928 2865 3844 3027 2336 3211 3037
 $ X2: int  2074 2885 3378 3906 2782 3018 3383 3447
 $ X3: int  2505 2315 2667 2390 3021 3085 3308 3231
 $ X4: int  2838 2351 3001 2439 2199 3318 3601 3291
 $ X5: int  1532 2552 3083 2330 2079 3366 2416 3100
# add in a temporary rowID to help with pivoting
apple$ID <- c(1:nrow(apple))
# pivot the data from wide to long
apple <- apple %>% pivot_longer(!ID, names_to = "Type", values_to = "Growth")
library(pastecs)
by(apple$Growth, apple$Type, stat.desc)
apple$Type: X1
     nbr.val     nbr.null       nbr.na          min          max        range 
8.000000e+00 0.000000e+00 0.000000e+00 2.336000e+03 3.844000e+03 1.508000e+03 
         sum       median         mean      SE.mean CI.mean.0.95          var 
2.381700e+04 2.977500e+03 2.977125e+03 1.583867e+02 3.745250e+02 2.006907e+05 
     std.dev     coef.var 
4.479852e+02 1.504758e-01 
------------------------------------------------------------ 
apple$Type: X2
     nbr.val     nbr.null       nbr.na          min          max        range 
8.000000e+00 0.000000e+00 0.000000e+00 2.074000e+03 3.906000e+03 1.832000e+03 
         sum       median         mean      SE.mean CI.mean.0.95          var 
2.487300e+04 3.198000e+03 3.109125e+03 1.951877e+02 4.615455e+02 3.047858e+05 
     std.dev     coef.var 
5.520741e+02 1.775658e-01 
------------------------------------------------------------ 
apple$Type: X3
     nbr.val     nbr.null       nbr.na          min          max        range 
8.000000e+00 0.000000e+00 0.000000e+00 2.315000e+03 3.308000e+03 9.930000e+02 
         sum       median         mean      SE.mean CI.mean.0.95          var 
2.252200e+04 2.844000e+03 2.815250e+03 1.388775e+02 3.283931e+02 1.542956e+05 
     std.dev     coef.var 
3.928048e+02 1.395275e-01 
------------------------------------------------------------ 
apple$Type: X4
     nbr.val     nbr.null       nbr.na          min          max        range 
8.000000e+00 0.000000e+00 0.000000e+00 2.199000e+03 3.601000e+03 1.402000e+03 
         sum       median         mean      SE.mean CI.mean.0.95          var 
2.303800e+04 2.919500e+03 2.879750e+03 1.810236e+02 4.280527e+02 2.621562e+05 
     std.dev     coef.var 
5.120119e+02 1.777974e-01 
------------------------------------------------------------ 
apple$Type: X5
     nbr.val     nbr.null       nbr.na          min          max        range 
8.000000e+00 0.000000e+00 0.000000e+00 1.532000e+03 3.366000e+03 1.834000e+03 
         sum       median         mean      SE.mean CI.mean.0.95          var 
2.045800e+04 2.484000e+03 2.557250e+03 2.144990e+02 5.072095e+02 3.680785e+05 
     std.dev     coef.var 
6.066947e+02 2.372450e-01 

Next we can plot the distribution of observations for each of the cultivar types.

boxplot(Growth ~ Type, apple)

Plot the means with error bars denoting the confidence intervals for each mean.

# adapted from Field et al. 2012: 157
ggplot(apple, aes(Type, Growth)) + stat_summary(fun.y = mean, geom = "point") + stat_summary(fun.y = mean, 
    geom = "line", aes(group = 1), colour = "purple", linetype = 2) + stat_summary(fun.data = mean_cl_boot, 
    geom = "errorbar", width = 0.3)

Normality tests

ANOVA assumes that the variances in the groups are homogeneous and that the data are normally distributed. These assumptions can be checked using leveneTest(). If the results are non-significant, we assume that the variances are homogeneous and the data are normal and can proceed with an ANOVA.

# leveneTest(outcome variable, group, center=median/mean) # from Field et al
# 2012: 438
library(car)
leveneTest(apple$Growth, apple$Type)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  4  0.4284 0.7871
      35               

Running an ANOVA

ANOVA models are created with the basic formula outlined below and viewed using summary().

# model <- aov(outcome~predictor, data=dataframe) summary(model)

Let’s run an ANOVA on a real data set.

applemodel <- aov(Growth ~ Type, data = apple)
summary(applemodel)

The output from the ANOVA model tells us the effect of type on growth as explained by the model and that explained by random variation (residuals) (see Field et al. 2012: 439). In this case, we see that there is a 28.4% chance of getting an F-value this size due to chance, which means that the type of cultivar did not have a significant effect on growth.

We can plot the model results. This will create 4 plots, the first two of which Field et al. 2012 state are useful for understanding ANOVA models. In the first, there should be clear lines for each group in the sample. If instead the data are spread out or shaped oddly this suggests the data are not normal. In the second QQ plot, the residuals should follow the line.

par(mfrow = c(2, 2))
applemodel <- aov(Growth ~ Type, data = apple)
plot(applemodel)

One-way ANOVA

Sometimes, the data will not have homogenous variance. In this case, you can adjust the ANOVA based on the heterogeneity in group variance using Welch’s F (Field et al. 2012: 441). This is a one-way ANOVA that also produces an F-ratio.

oneway.test(outcome ~ predictor, data = dataframe)

Post-hoc tests

ANOVAs can only tell us whether or not there is a significant difference between the means of several groups. It cannot tell us which groups are different nor if the differences are significant. We can use several different types of post-hoc tests to answer these questions.

A pairwise t-test conducts a t-test for each set of pairs in the data. The basic model of a pairwise t-test can be found below. This corrects for the fact that we are conducting multiple tests. There are multiple methods for adjusting the p-value available. See the help file for details. Available methods include Bonferroni, where “p-values are multiplied by the number of comparisons” (p.adjust help file) and is conservative.

# pairwise.t.test(outcome, predictor, paired=F, p.adjust.method='method') # from
# Field et al. 2012: 447)
pairwise.t.test(apple$Growth, apple$Type, paired = F, p.adjust.method = "bonferroni")

    Pairwise comparisons using t tests with pooled SD 

data:  apple$Growth and apple$Type 

   X1   X2   X3   X4  
X2 1.00 -    -    -   
X3 1.00 1.00 -    -   
X4 1.00 1.00 1.00 -   
X5 1.00 0.37 1.00 1.00

P value adjustment method: bonferroni 

The pairwise comparisons allow us to examine which groups within the data are significantly different from one another. In this case, we see that none of them are significantly different.

Tukey’s HSD

Tukey’s honest significant difference test can tell us whether or not the observed differences from the ANOVA are statistically different.

applemodel <- aov(Growth ~ Type, data = apple)
TukeyHSD(applemodel)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Growth ~ Type, data = apple)

$Type
          diff        lwr      upr     p adj
X2-X1  132.000  -598.1767 862.1767 0.9847805
X3-X1 -161.875  -892.0517 568.3017 0.9678464
X4-X1  -97.375  -827.5517 632.8017 0.9952020
X5-X1 -419.875 -1150.0517 310.3017 0.4749275
X3-X2 -293.875 -1024.0517 436.3017 0.7750989
X4-X2 -229.375  -959.5517 500.8017 0.8937492
X5-X2 -551.875 -1282.0517 178.3017 0.2136349
X4-X3   64.500  -665.6767 794.6767 0.9990363
X5-X3 -258.000  -988.1767 472.1767 0.8463381
X5-X4 -322.500 -1052.6767 407.6767 0.7109124

We can plot the 95% confidence levels. We see that there is a lot of overlap across the cultivars. There is likely no difference across the group means.

applemodel <- aov(Growth ~ Type, data = apple)
plot(TukeyHSD(applemodel))

# see
# https://bioinformatics-core-shared-training.github.io/linear-models-r/anova.html#section_2:_anova
# for additional description of these functions.

You can also use the glht() function from the multcomp package to conduct multiple types of post-hoc tests with different p-value corrections and confidence intervals.

Try it

Using what we’ve learned about running ANOVAs in R, analyze the excavation data from a site in New Mexico. Details about the data are included below.

In your analysis, complete the following steps:

  1. Plot data distributions and compare means descriptively
  2. Test for homogeneity of variance
  3. Create an ANOVA model and summarize output
  4. Plot and examine residuals
  5. Conduct and plot post-hoc tests to determine which sites are different from one another.
  6. What insights does the ANOVA tell you about these sites?

Data: Excavation Depth and Archaeology “Four different excavation sites at an archeological area in New Mexico gave the following depths (cm) for significant archaeological discoveries.” X1 = depths at Site I X2 = depths at Site II X3 = depths at Site III X4 = depths at Site IV

Reference: Mimbres Mogollon Archaeology by Woosley and McIntyre, Univ. of New Mexico Press Link

Click for solution

excavation <- read.csv("https://maddiebrown.github.io/ANTH630/data/owan01.csv")
excavation
    X1 X2  X3 X4
1   93 85 100 96
2  120 45  75 58
3   65 80  65 95
4  105 28  40 90
5  115 75  73 65
6   82 70  65 80
7   99 65  50 85
8   87 55  30 95
9  100 50  45 82
10  90 40  50 NA
11  78 NA  45 NA
12  95 NA  55 NA
13  93 NA  NA NA
14  88 NA  NA NA
15 110 NA  NA NA
# add in a temporary rowID to help with pivoting
excavation$ID <- c(1:nrow(excavation))
# pivot the data from wide to long
excavation <- excavation %>% pivot_longer(!ID, names_to = "Site", values_to = "Depth")
# examine the descriptive stats
by(excavation$Depth, excavation$Site, stat.desc)
excavation$Site: X1
     nbr.val     nbr.null       nbr.na          min          max        range 
  15.0000000    0.0000000    0.0000000   65.0000000  120.0000000   55.0000000 
         sum       median         mean      SE.mean CI.mean.0.95          var 
1420.0000000   93.0000000   94.6666667    3.7118429    7.9611113  206.6666667 
     std.dev     coef.var 
  14.3759058    0.1518582 
------------------------------------------------------------ 
excavation$Site: X2
     nbr.val     nbr.null       nbr.na          min          max        range 
  10.0000000    0.0000000    5.0000000   28.0000000   85.0000000   57.0000000 
         sum       median         mean      SE.mean CI.mean.0.95          var 
 593.0000000   60.0000000   59.3000000    5.9105367   13.3705630  349.3444444 
     std.dev     coef.var 
  18.6907583    0.3151899 
------------------------------------------------------------ 
excavation$Site: X3
     nbr.val     nbr.null       nbr.na          min          max        range 
  12.0000000    0.0000000    3.0000000   30.0000000  100.0000000   70.0000000 
         sum       median         mean      SE.mean CI.mean.0.95          var 
 693.0000000   52.5000000   57.7500000    5.4760152   12.0526282  359.8409091 
     std.dev     coef.var 
  18.9694731    0.3284757 
------------------------------------------------------------ 
excavation$Site: X4
     nbr.val     nbr.null       nbr.na          min          max        range 
   9.0000000    0.0000000    6.0000000   58.0000000   96.0000000   38.0000000 
         sum       median         mean      SE.mean CI.mean.0.95          var 
 746.0000000   85.0000000   82.8888889    4.5167726   10.4156963  183.6111111 
     std.dev     coef.var 
  13.5503177    0.1634757 
# plot the data by site
boxplot(Depth ~ Site, data = excavation)

# plot error bars. adapted from Field et al. 2012: 157
ggplot(excavation, aes(Site, Depth)) + stat_summary(fun.y = mean, geom = "point") + 
    stat_summary(fun.y = mean, geom = "line", aes(group = 1), colour = "purple", 
        linetype = 2) + stat_summary(fun.data = mean_cl_boot, geom = "errorbar", 
    width = 0.3)

# check for normality
leveneTest(excavation$Depth, excavation$Site)
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  3  0.7978 0.5021
      42               
# create a model
excavationmodel <- aov(Depth ~ Site, data = excavation)
summary(excavationmodel)
            Df Sum Sq Mean Sq F value   Pr(>F)    
Site         3  12397    4132   15.14 7.99e-07 ***
Residuals   42  11465     273                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
14 observations deleted due to missingness
# plot the model results, look at residuals
par(mfrow = c(2, 2))
plot(excavationmodel)

par(mfrow = c(1, 1))
# pairwise comparisons
pairwise.t.test(excavation$Depth, excavation$Site, paired = F, p.adjust.method = "bonferroni")

    Pairwise comparisons using t tests with pooled SD 

data:  excavation$Depth and excavation$Site 

   X1      X2     X3    
X2 2.9e-05 -      -     
X3 5.1e-06 1.0000 -     
X4 0.5898  0.0203 0.0077

P value adjustment method: bonferroni 
# tukey's hsd
TukeyHSD(excavationmodel)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = Depth ~ Site, data = excavation)

$Site
           diff        lwr        upr     p adj
X2-X1 -35.36667 -53.409116 -17.324217 0.0000279
X3-X1 -36.91667 -54.033237 -19.800096 0.0000050
X4-X1 -11.77778 -30.411940   6.856384 0.3412430
X3-X2  -1.55000 -20.473081  17.373081 0.9962282
X4-X2  23.58889   3.282782  43.894996 0.0171237
X4-X3  25.13889   5.650816  44.626962 0.0067914
# can plot the 95% confidence levels.
plot(TukeyHSD(excavationmodel))

Joining data tables

Data are often scattered across several different table. As we saw with relational databases, these tables can be joined based on common keys or IDs. There are several different join functions in tidyverse that we will cover today.

left_join()

A left join simply matches the values from the column on the left side of the equation to the values in a column in the right side of the column. Let’s practice this using the livestock and hhdata datasets.

Load libraries and datasets.

library(tidyverse)
livestock <- read.csv("https://maddiebrown.github.io/ANTH630/data/livestock.csv")
hhdata <- read.csv("https://maddiebrown.github.io/ANTH630/data/hhdata.csv")

Use left_join() to join the tables. In a left_join, all the columns from both datasets are included as are all the rows from the table on the left.

left_join(hhdata, livestock, by = c(HH = "HH"))
  HH        Location Members HasVehicle Cow Sheep Pig
1  A    College Park       3          Y   3     2   2
2  B    College Park       4          N   1     4   2
3  C    College Park       3          Y   2     3   1
4  D University Park       4          Y   2     3   3
5  E University Park       6          Y   2     2   1
6  F University Park       5          N   1     1   1

What happens if there are not matches for every value of the table on the lefthand side?

livestock_sub <- livestock[1:4, ]
left_join(hhdata, livestock_sub, by = c(HH = "HH"))
  HH        Location Members HasVehicle Cow Sheep Pig
1  A    College Park       3          Y   3     2   2
2  B    College Park       4          N   1     4   2
3  C    College Park       3          Y   2     3   1
4  D University Park       4          Y   2     3   3
5  E University Park       6          Y  NA    NA  NA
6  F University Park       5          N  NA    NA  NA

Any rows without a match are simply left blank.

right_join()

In a right join, we keep all columns from both tables and all the rows from the table on the right.

right_join(hhdata, livestock_sub, by = c(HH = "HH"))
  HH        Location Members HasVehicle Cow Sheep Pig
1  A    College Park       3          Y   3     2   2
2  B    College Park       4          N   1     4   2
3  C    College Park       3          Y   2     3   1
4  D University Park       4          Y   2     3   3

inner_join()

In an inner join, we are transforming which rows remain in the final dataset. All columns are returned from both datasets, but only the rows in the table on the left (x) that match rows in the table on the righr (y) are retained.

inner_join(hhdata, livestock_sub, by = c(HH = "HH"))
  HH        Location Members HasVehicle Cow Sheep Pig
1  A    College Park       3          Y   3     2   2
2  B    College Park       4          N   1     4   2
3  C    College Park       3          Y   2     3   1
4  D University Park       4          Y   2     3   3

You’ll notice only the four rows from the hhdata table that match the livestock_sub data are kept.

full_join()

In a full join, all rows and columns from both tables are kept. Any missing values are converted to NA.

full_join(hhdata, livestock_sub, by = c(HH = "HH"))
  HH        Location Members HasVehicle Cow Sheep Pig
1  A    College Park       3          Y   3     2   2
2  B    College Park       4          N   1     4   2
3  C    College Park       3          Y   2     3   1
4  D University Park       4          Y   2     3   3
5  E University Park       6          Y  NA    NA  NA
6  F University Park       5          N  NA    NA  NA

semi_join()

Unlike the previous join functions, semi_join() and anti_join() are considered filtering rather than mutating joins. These joins do not combine the data columns from multiple tables but rather, filter out the rows in one table based on the values in another table.

semi_join(hhdata, livestock_sub, by = c(HH = "HH"))
  HH        Location Members HasVehicle
1  A    College Park       3          Y
2  B    College Park       4          N
3  C    College Park       3          Y
4  D University Park       4          Y

In the output above, we see that the hhdata dataframe has been subset to include only the rows that are also found in the livestock_sub dataframe. No new columns from the second table are added to the first.

anti_join()

anti_join() works in the same way as semi_join() except it only keeps the rows that are not found in the second table. We will use anti_join() when we learn about text mining in a couple of weeks.

anti_join(hhdata, livestock_sub, by = c(HH = "HH"))
  HH        Location Members HasVehicle
1  E University Park       6          Y
2  F University Park       5          N

Now only the households for which there are no observations in the livestock_sub dataframe remain.