ANTH475 Home

This lesson will introduce you to the R programming language, including how to conduct basic statistical analyses and create beautiful visualizations. We’ll focus on the basics of how to use R as a language and then showcase some of the powerful features of R that may be of interest to anthropologists and interdisciplinary researchers.

Installing RStudio and R

To install RStudio, click here and choose the Open Source License.

To install R, visit CRAN and choose the current version of R for your operating system.

Syntax in R

R works by typing out a function, putting something into the function, and then running the code to see the result. For example, if you want to make a simple plot, you use the function plot() with the data that you wish to plot listed inside the function’s parentheses. Here, let’s use a built in dataset in R to quickly make a scatterplot using a built-in dataset.

plot(cars)

The syntax of line of R code tries to be as human readable as possible. For example, if we want the dots in our plot to be red, we can we can add an argument into our function that tells R to make the points red.


plot(cars, col="red")

Objects and Logical Tests

While you can directly enter data each time you want to use it, R’s power comes from assigning data to named objects or variables. Objects are assigned using the <- symbol, which means “everything on the right is now referred to by the object name on the left.” We can refer to this symbol as an assignment operator or becomes.

Let’s make an object or variable called treeheight with the value 15. Then run the line of code. What is the output?

treeheight <- 15

The line above only creates the object, but doesn’t show us the result. To see what the objecttreeheight is equal to, you next have to call the object.

treeheight
[1] 15

You can also create variables using = but this is generally discouraged as a practice. This is because = is too easily confused with ==, which has a different meaning. A single = means that whatever is on the left hand side is now equal to the value on the right. A double equals sign instead asks R to test whether or not the value on the left is or is not equal to the value on the right, an equivalency test. The output is a logical vector.

We will learn more about logical tests later, but for now, let’s look at these examples.

5==5  #note that a double equals sign checks for equivalency in R
[1] TRUE
5==6  #Comments in R are prefaced by a hashtag (#). This tells R not to run this line of code, and that it is for your reference only.
[1] FALSE
#5=6  # Why doesn't this last line work?

Try-it

  1. Run a test to evaluate if treeheight is greater than 10
  2. Run the following: treeheight=="treeheight". What is the result?
  3. What is the result of adding 3 to treeheight? What about adding 3 to “treeheight”?
  4. Reassign treeheight to 20.
  5. What happens if you run bread?
Click for solution
treeheight > 10
[1] TRUE
treeheight=="treeheight"
[1] FALSE
treeheight+3
[1] 18
#"treeheight"+3  # why doesn't this work? Note: I have these lines of code commented out to keep the document compiling properly
treeheight<-10
#bread

R is case sensitive

Something very important to keep in mind with R is that it is case sensitive, unlike some other languages. This is very important to know for keeping track of different variables and often a cause of many coding errors. For example, we can create three different objects referring to trees by changing the capitalization.

Tree <- "tree"
TREE <- "tree again"

#tree # why doesn't this work?

tree <- "a third tree"

Tree
[1] "tree"
TREE
[1] "tree again"
tree
[1] "a third tree"

Vectors

Above, we made a treeheight object that has one value in it. When working with real data, we often have multiple values that we want to analyze. In this case, we make a vector, or one dimensional data object, to list multiple values. For the treeheight object, we can use the c() function to add multiple tree heights.

treeheight <- c(15, 20, 12, 15, 18)
treeheight
[1] 15 20 12 15 18

Notice that the value of treeheight has been overwritten with the new vector.

Whenever you encounter a new function or want to look up how to use a function, you can refer to the help file. What does c() do?

?c()

Vectors can also be made using nested functions. For example, we can populate a list of tree heights using sequence and repeat functions.

treeheight <- c(10, rep(15, 3), 20, 20)
treeheight
[1] 10 15 15 15 20 20

treeheight <- c(10, seq(10,15), 20, 20)
treeheight
[1] 10 10 11 12 13 14 15 20 20

treeheight <- c(10, 10:15, 20, 20)
treeheight
[1] 10 10 11 12 13 14 15 20 20

You can select individual elements from a vector:

treeheight[2]
[1] 10
treeheight[c(2,3)]
[1] 10 11

Don’t forget that R is case sensitive. What happens when you enter Treeheight?

Treeheight

You can add new values to an existing vector:

treeheight <- c(treeheight, 50)
treeheight
 [1] 10 10 11 12 13 14 15 20 20 50

Let’s go back to our original treeheight object.

treeheight <- c(15, 20, 12, 15, 18)
treeheight
[1] 15 20 12 15 18

Now that we have a list of tree heights, we can run some basic statistics on this dataset.

mean(treeheight)
[1] 16
median(treeheight)
[1] 15
max(treeheight)
[1] 20
range(treeheight)
[1] 12 20
summary(treeheight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     12      15      15      16      18      20 
str(treeheight)
 num [1:5] 15 20 12 15 18

Variable classes

In the last function, notice that R tells us this object is a numeric data class. We can also look at the data class with the class() function. In addition, R has built in checks for different classes, such as is.numeric().

class(treeheight)
[1] "numeric"
is.numeric(treeheight)
[1] TRUE

Primary data types in R include: 1. numeric, 2. string or character, 3. logical, and 4. factor. Note that output data types do not always match the input data. String data are entered with "" or '' surrounding the characters.

Vectors can have either string, logical, or numeric data; but only one class of element per vector. To illustrate, let’s re-assign our treeheight object to some new mixed variable types. To input string variable, we need to use "" around the value.

treeheight <- c(10, 14, "twelve", 20)
treeheight
class(treeheight)

Note that R will automatically convert the numeric data into string data. This is called “coercion”.

Try it

First let’s make a few different objects and then examine them with class(). Try to predict in advance which data class each object will be.

tree <- "a third tree"
x <- TRUE
y <- "5"
z <- 5

Dataframes

Dataframes basics

So far we have been working with vectors, or one-dimensional data. We can also load in dataframe, which are like tables or spreadsheets of multiple connected variables. While data can be stored in lists and matrices, the most common and flexible data format you will use in R is a dataframe. Dataframes can contain multiple classes of data, but only one class of data per vector. Data frames are usually organized with each row representing a single case or observation. Columns denote variables which apply across rows.

For example, let’s make a new trees dataframe that includes the heights, species, and products of different trees.

Try it

  1. Make a treeheight vector with the values: 15, 20, 12, 15, and 18. Keep this order. Hint: Use the c() function.

  2. Make a treetype vector with the values: apple, walnut, apple, hazelnut, and pear. Keep this order.

  3. Make a treeproduct vector with the values fruit and nut that matches the fruits and nuts in the same order as the treetype vector.

  4. Use the data.frame() function to make a new dataframe called trees that includes the three variable vectors you just created.

  5. Look at the structure of your new dataframe using the str() function.

Click for solution

Solution:

treeheight <- c(15, 20, 12, 15, 18)
treetype <- c("apple", "walnut", "apple", "hazelnut", "pear")
treeproduct <- c("fruit", "nut", "fruit", "nut", "fruit") 

trees <- data.frame(treeheight, treetype, treeproduct)
trees
str(trees)

Selecting and subsetting variables

Subsetting with $

In wide format, each row in a dataframe is a case, while the columns are variables that are measures for each case. To select a variable in a dataframe, you use the $ operator.

Call the treetype column using the $ operator. What data class is it?

trees$treetype
class(trees$treetype)

We can also examine the spread of the data by making a histogram and selecting the treeheight variable.

hist(trees$treeheight)

We can also select multiple variables to run operations, such as creating a table of the counts of the number of trees of particular heights that have different tree products.

table(trees$treeheight,trees$treeproduct)
    
     fruit nut
  12     1   0
  15     1   1
  18     1   0
  20     0   1

Subsetting with [,]

Dataframes can be subset using the format dfname[row#,col#], or by calling columns by name.

trees[1,1]
trees[,3]
trees[2,]
trees[,"treetype"]

You can also subset dataframes based on logical tests. Let’s look at all the tree types for the trees over 15ft tall. Then let’s examine all the columns for any rows where the tree product is fruit.

trees[trees$treeheight > 15,c("treetype")]
trees[trees$treeheight > 15,2]


#Trees[trees$product==fruit,]

What’s wrong with this last line of code? (Hint: 3 things)

Try-it

  1. Fix the above code to display all columns all the columns for any rows where the tree product is fruit.
Click for solution
trees[trees$treeproduct=="fruit",]
  treeheight treetype treeproduct
1         15    apple       fruit
3         12    apple       fruit
5         18     pear       fruit
  1. Using the != operator, we can also select all rows which are ‘not equal’ to a given value. Select all the rows for where the tree product is not fruit.
Click for solution
trees[trees$treeproduct!="fruit",]
  treeheight treetype treeproduct
2         20   walnut         nut
4         15 hazelnut         nut
  1. Using the | operator in between logical checks, we can also select all rows which are equal to one condition or another condition. Select all the rows for where the tree type is either apple or pear. Hint: Run each check for tree type individually, then connect them with the | operator.
Click for solution
trees[trees$treetype=="apple" | trees$treetype=="pear",]
  treeheight treetype treeproduct
1         15    apple       fruit
3         12    apple       fruit
5         18     pear       fruit

Analyzing numeric variables

We can also run calculations on vectors/variables as a whole. Something to keep note of is that R will recycle through each vector during vector arithmetic. R doesn’t always return a warning when this is occurring, so be sure to keep this in mind.

trees$treeheight
[1] 15 20 12 15 18
trees$treeheight / 2
[1]  7.5 10.0  6.0  7.5  9.0
trees$treeheight / c(10,1) # how does R treat the two vectors during this operation?
[1]  1.5 20.0  1.2 15.0  1.8

Summarizing character variables

There are several ways of quickly assessing the basic attributes of a character vector/variable.

unique(trees$treetype) # returns the name of each unique type of tree
[1] "apple"    "walnut"   "hazelnut" "pear"    
length(unique(trees$treetype)) # how many unique tree types are there?
[1] 4
table(trees$treetype) # returns a table of the number of times each tree type appears in the dataframe

   apple hazelnut     pear   walnut 
       2        1        1        1 

Final tips

Missing values

In an ideal world, every data cell would be filled in every data table…but this is rarely the case. Sometimes (ok, frequently) we encounter missing values. But what is a missing value and how does R deal with them? How do you know a missing value when you see it?

R codes missing values as NA (not “NA” which is a character/string element). Having missing values in a dataframe can cause some functions to fail. Check out the following example.

missingparts <- c(1, 2, 3,  NA)
mean(missingparts) # what is the result?
[1] NA
mean(missingparts, na.rm=T) # we can tell the function to ignore any NA values in the data
[1] 2

Try it

How do you know you have missing values rather than another issue in your code? There are a few functions that allow us to pick out the NAs. Try examining the missingparts vector with str(), summary(), and is.na(). What is the result of each of these functions and how might this output be useful?

Click for solution
str(missingparts)
 num [1:4] 1 2 3 NA
summary(missingparts)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0     1.5     2.0     2.0     2.5     3.0       1 
is.na(missingparts) 
[1] FALSE FALSE FALSE  TRUE
missingparts[is.na(missingparts)] #you can also subset out only the values that are equal to NA. This is not so useful here, but can be useful when you want to isolate rows in a dataframe that have missing values in particular columns.
[1] NA