This lesson will introduce you to the R programming language, including how to conduct basic statistical analyses and create beautiful visualizations. We’ll focus on the basics of how to use R as a language and then showcase some of the powerful features of R that may be of interest to anthropologists and interdisciplinary researchers.
To install RStudio, click here and choose the Open Source License.
To install R, visit CRAN and choose the current version of R for your operating system.
R works by typing out a function, putting something into the
function, and then running the code to see the result. For example, if
you want to make a simple plot, you use the function plot()
with the data that you wish to plot listed inside the function’s
parentheses. Here, let’s use a built in dataset in R to quickly make a
scatterplot using a built-in dataset.
plot(cars)
The syntax of line of R code tries to be as human readable as possible. For example, if we want the dots in our plot to be red, we can we can add an argument into our function that tells R to make the points red.
plot(cars, col="red")
While you can directly enter data each time you want to use it, R’s
power comes from assigning data to named objects or variables. Objects
are assigned using the <- symbol, which means
“everything on the right is now referred to by the object name on the
left.” We can refer to this symbol as an assignment operator or
becomes.
Let’s make an object or variable called treeheight with
the value 15. Then run the line of code. What is the
output?
treeheight <- 15
The line above only creates the object, but doesn’t show us the
result. To see what the objecttreeheight is equal to, you
next have to call the object.
treeheight
[1] 15
You can also create variables using = but this is
generally discouraged as a practice. This is because = is
too easily confused with ==, which has a different meaning.
A single = means that whatever is on the left hand side is
now equal to the value on the right. A double equals sign instead asks R
to test whether or not the value on the left is or is not equal to the
value on the right, an equivalency test. The output is a logical
vector.
We will learn more about logical tests later, but for now, let’s look at these examples.
5==5 #note that a double equals sign checks for equivalency in R
[1] TRUE
5==6 #Comments in R are prefaced by a hashtag (#). This tells R not to run this line of code, and that it is for your reference only.
[1] FALSE
#5=6 # Why doesn't this last line work?
treeheight is greater than
10treeheight=="treeheight". What is
the result?treeheight to 20.bread?treeheight > 10
[1] TRUE
treeheight=="treeheight"
[1] FALSE
treeheight+3
[1] 18
#"treeheight"+3 # why doesn't this work? Note: I have these lines of code commented out to keep the document compiling properly
treeheight<-10
#bread
Something very important to keep in mind with R is that it is case sensitive, unlike some other languages. This is very important to know for keeping track of different variables and often a cause of many coding errors. For example, we can create three different objects referring to trees by changing the capitalization.
Tree <- "tree"
TREE <- "tree again"
#tree # why doesn't this work?
tree <- "a third tree"
Tree
[1] "tree"
TREE
[1] "tree again"
tree
[1] "a third tree"
Above, we made a treeheight object that has one value in
it. When working with real data, we often have multiple values that we
want to analyze. In this case, we make a vector, or one dimensional data
object, to list multiple values. For the treeheight object,
we can use the c() function to add multiple tree
heights.
treeheight <- c(15, 20, 12, 15, 18)
treeheight
[1] 15 20 12 15 18
Notice that the value of treeheight has been overwritten
with the new vector.
Whenever you encounter a new function or want to look up how to use a
function, you can refer to the help file. What does c()
do?
?c()
Vectors can also be made using nested functions. For example, we can populate a list of tree heights using sequence and repeat functions.
treeheight <- c(10, rep(15, 3), 20, 20)
treeheight
[1] 10 15 15 15 20 20
treeheight <- c(10, seq(10,15), 20, 20)
treeheight
[1] 10 10 11 12 13 14 15 20 20
treeheight <- c(10, 10:15, 20, 20)
treeheight
[1] 10 10 11 12 13 14 15 20 20
You can select individual elements from a vector:
treeheight[2]
[1] 10
treeheight[c(2,3)]
[1] 10 11
Don’t forget that R is case sensitive. What happens when you enter
Treeheight?
Treeheight
You can add new values to an existing vector:
treeheight <- c(treeheight, 50)
treeheight
[1] 10 10 11 12 13 14 15 20 20 50
Let’s go back to our original treeheight object.
treeheight <- c(15, 20, 12, 15, 18)
treeheight
[1] 15 20 12 15 18
Now that we have a list of tree heights, we can run some basic statistics on this dataset.
mean(treeheight)
[1] 16
median(treeheight)
[1] 15
max(treeheight)
[1] 20
range(treeheight)
[1] 12 20
summary(treeheight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12 15 15 16 18 20
str(treeheight)
num [1:5] 15 20 12 15 18
In the last function, notice that R tells us this object is a numeric
data class. We can also look at the data class with the class()
function. In addition, R has built in checks for different classes, such
as is.numeric().
class(treeheight)
[1] "numeric"
is.numeric(treeheight)
[1] TRUE
Primary data types in R include: 1. numeric, 2. string or character,
3. logical, and 4. factor. Note that output data types do not always
match the input data. String data are entered with "" or
'' surrounding the characters.
Vectors can have either string, logical, or numeric data; but only
one class of element per vector. To illustrate, let’s re-assign our
treeheight object to some new mixed variable types. To
input string variable, we need to use "" around the
value.
treeheight <- c(10, 14, "twelve", 20)
treeheight
class(treeheight)
Note that R will automatically convert the numeric data into string data. This is called “coercion”.
First let’s make a few different objects and then examine them with
class(). Try to predict in advance which data class each
object will be.
tree <- "a third tree"
x <- TRUE
y <- "5"
z <- 5
So far we have been working with vectors, or one-dimensional data. We can also load in dataframe, which are like tables or spreadsheets of multiple connected variables. While data can be stored in lists and matrices, the most common and flexible data format you will use in R is a dataframe. Dataframes can contain multiple classes of data, but only one class of data per vector. Data frames are usually organized with each row representing a single case or observation. Columns denote variables which apply across rows.
For example, let’s make a new trees dataframe that
includes the heights, species, and products of different trees.
Make a treeheight vector with the values: 15, 20,
12, 15, and 18. Keep this order. Hint: Use the c()
function.
Make a treetype vector with the values: apple,
walnut, apple, hazelnut, and pear. Keep this order.
Make a treeproduct vector with the values
fruit and nut that matches the fruits and nuts in the
same order as the treetype vector.
Use the data.frame() function to make a new
dataframe called trees that includes the three variable
vectors you just created.
Look at the structure of your new dataframe using the
str() function.
Solution:
treeheight <- c(15, 20, 12, 15, 18)
treetype <- c("apple", "walnut", "apple", "hazelnut", "pear")
treeproduct <- c("fruit", "nut", "fruit", "nut", "fruit")
trees <- data.frame(treeheight, treetype, treeproduct)
trees
str(trees)
$In wide format, each row in a dataframe is a case, while the columns
are variables that are measures for each case. To select a variable in a
dataframe, you use the $ operator.
Call the treetype column using the $
operator. What data class is it?
trees$treetype
class(trees$treetype)
We can also examine the spread of the data by making a histogram and
selecting the treeheight variable.
hist(trees$treeheight)
We can also select multiple variables to run operations, such as creating a table of the counts of the number of trees of particular heights that have different tree products.
table(trees$treeheight,trees$treeproduct)
fruit nut
12 1 0
15 1 1
18 1 0
20 0 1
[,]Dataframes can be subset using the format
dfname[row#,col#], or by calling columns by name.
trees[1,1]
trees[,3]
trees[2,]
trees[,"treetype"]
You can also subset dataframes based on logical tests. Let’s look at all the tree types for the trees over 15ft tall. Then let’s examine all the columns for any rows where the tree product is fruit.
trees[trees$treeheight > 15,c("treetype")]
trees[trees$treeheight > 15,2]
#Trees[trees$product==fruit,]
What’s wrong with this last line of code? (Hint: 3 things)
trees[trees$treeproduct=="fruit",]
treeheight treetype treeproduct
1 15 apple fruit
3 12 apple fruit
5 18 pear fruit
!= operator, we can also select all rows
which are ‘not equal’ to a given value. Select all the rows for where
the tree product is not fruit.trees[trees$treeproduct!="fruit",]
treeheight treetype treeproduct
2 20 walnut nut
4 15 hazelnut nut
| operator in between logical checks, we can
also select all rows which are equal to one condition or another
condition. Select all the rows for where the tree type is either apple
or pear. Hint: Run each check for tree type individually, then connect
them with the | operator.trees[trees$treetype=="apple" | trees$treetype=="pear",]
treeheight treetype treeproduct
1 15 apple fruit
3 12 apple fruit
5 18 pear fruit
We can also run calculations on vectors/variables as a whole. Something to keep note of is that R will recycle through each vector during vector arithmetic. R doesn’t always return a warning when this is occurring, so be sure to keep this in mind.
trees$treeheight
[1] 15 20 12 15 18
trees$treeheight / 2
[1] 7.5 10.0 6.0 7.5 9.0
trees$treeheight / c(10,1) # how does R treat the two vectors during this operation?
[1] 1.5 20.0 1.2 15.0 1.8
There are several ways of quickly assessing the basic attributes of a character vector/variable.
unique(trees$treetype) # returns the name of each unique type of tree
[1] "apple" "walnut" "hazelnut" "pear"
length(unique(trees$treetype)) # how many unique tree types are there?
[1] 4
table(trees$treetype) # returns a table of the number of times each tree type appears in the dataframe
apple hazelnut pear walnut
2 1 1 1
In an ideal world, every data cell would be filled in every data table…but this is rarely the case. Sometimes (ok, frequently) we encounter missing values. But what is a missing value and how does R deal with them? How do you know a missing value when you see it?
R codes missing values as NA (not “NA” which is a
character/string element). Having missing values in a dataframe can
cause some functions to fail. Check out the following example.
missingparts <- c(1, 2, 3, NA)
mean(missingparts) # what is the result?
[1] NA
mean(missingparts, na.rm=T) # we can tell the function to ignore any NA values in the data
[1] 2
How do you know you have missing values rather than another issue in
your code? There are a few functions that allow us to pick out the NAs.
Try examining the missingparts vector with
str(), summary(), and is.na().
What is the result of each of these functions and how might this output
be useful?
str(missingparts)
num [1:4] 1 2 3 NA
summary(missingparts)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.0 1.5 2.0 2.0 2.5 3.0 1
is.na(missingparts)
[1] FALSE FALSE FALSE TRUE
missingparts[is.na(missingparts)] #you can also subset out only the values that are equal to NA. This is not so useful here, but can be useful when you want to isolate rows in a dataframe that have missing values in particular columns.
[1] NA