ANTH630 Home

R is an incredibly powerful open-source tool for anthropologists, ecologists, humanities scholars, and others interested in data analysis and visualization. This tutorial will introduce you to the the basic components of coding in R.

Installing RStudio and R

To install RStudio, click here and choose the Open Source License.

To install R, visit CRAN and choose the current version of R for your operating system.

Getting Started

When you type a command into R, the output will be printed into the console. R can do basic operations, just like a calculator.

2 + 3
[1] 5
5 * 9
[1] 45
9 + (6/2)
[1] 12
5 > 8
[1] FALSE

Primary data types in R include: 1. numeric, 2. string or character, 3. logical, and 4. factor. Note that output data types do not always match the input data. String data are entered with "" or '' surrounding the characters.

"cheese"
[1] "cheese"
"red"
[1] "red"

Logical data can be represented by either: F or FALSE. We’ll return to factors later.

Variables and Functions in R

While you can directly enter data each time you want to use it, R’s power comes from assigning data to named objects or variables. Objects are assigned using the <- symbol, which means “everything on the right is now referred to by the object name on the left”

Let’s make an object or variable called x with the value 5.

x <- 5
x
[1] 5

You can also create variables using = but this is generally discouraged as a practice. This is because = is too easily confused with ==, which has a different meanining. A single = means that whatever is on the left hand side is now equal to the value on the right. A double equals sign instead asks R to test whether or not the value on the left is or is not equal to the value on the right, an equivalency test. The output is a logical vector.

We will learn more about logical tests in R next week, but for now, let’s look at these examples.

5 == 5  #note that a double equals sign checks for equivalency in R
[1] TRUE
5 == 6  #Comments in R are prefaced by a hashtag (#). This tells R not to run this line of code, and that it is for your reference only.
[1] FALSE
# 5=6 # Why doesn't this last line work?

Try-it

  1. Run a test to evaluate if x is greater than 2
  2. Run the following: x=="x". What is the result?
  3. What is the result of adding 3 to x? What about adding 3 to “x”?
  4. Reassign x to 4.
  5. What happens if you run e?
Click for solution
x > 2
[1] TRUE
x == "x"
[1] FALSE
x + 3
[1] 8
#'x'+3  # why doesn't this work?
x <- 4
# e

R works by running functions on different datasets and variables. Functions allow us to calculate statistics; summarize, transform and visualize data; and so much more. Let’s start with our first function: plot().

plot(faithful)

faithful is a dataset that is built into base R (about old faithful), that here R has automatically decided to pull the speed and distance variables out and create a scatterplot. You don’t need to worry about the specifics of how and why this works for now, but pat yourself on the back for running your first R function and making a cool (though mysterious) plot.

R is case sensitive

Something very important to keep in mind with R is that it is case sensitive, unlike some other languages. This is very important to know for keeping track of different variables and often a cause of many coding errors. For example, we can create three different objects referring to trees by changing the capialization.

Tree <- "tree"
TREE <- "tree again"

# tree # why doesn't this work?

tree <- "a third tree"

Tree
[1] "tree"
TREE
[1] "tree again"
tree
[1] "a third tree"

Variable classes

As previously mentioned, there are several different types or classes of data that can be assigned. We’ll cover a few basics here. Vectors can have either string, logical, or numeric data; but only one class of element per vector. You can check the class of a vector with the class() function. In addition, R has built in checks for different classes, such as is.numeric().

x <- seq(1, 5)
is.numeric(x)
x <- c(1:5)
x <- c(1, 10, "eleven", 27)
is.numeric(x)
x <- c(rep(T, 10), F, F, T)
x

Note that R will automatically convert the numeric data into string data. This is called “coercion”.

###Factors Factors are useful for categorical data with explicit levels (which may or may not be ordered) such as cat/dog, T/F, Y/N or Income Brackets

x <- factor(c("Cat", "Dog", "Cat", "Cat"), levels = c("Cat", "Dog"))
x
[1] Cat Dog Cat Cat
Levels: Cat Dog
str(x)
 Factor w/ 2 levels "Cat","Dog": 1 2 1 1

Try it

First let’s make a few different objects and then examine them with class(). Try to predict in advance which data class each object will be.

tree <- "a third tree"
x <- TRUE
y <- "5"
z <- 5

Working with vectors and dataframes

Creating vectors with c()

Vectors are one-dimensional sets of values, which can be created using the concatenate function (among others). Functions in R are denoted as functionname(), in this case c().

Whenever you encounter a new function or want to look up how to use a function, you can refer to the help file. What does c() do?

`?`(c())

Let’s try one of the examples from the help file. c(1,7:9). What is the result?

c(1, 7:9)  #what is the `:` operator doing?
[1] 1 7 8 9
c(1, 7:50)
 [1]  1  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
[26] 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

We can also make vectors using character data. Let’s make a vector called Cities:

Cities <- c("New York", "Los Angeles", "San Francisco", "Chicago", "Minneapolis")

Look at our new object:

Cities
[1] "New York"      "Los Angeles"   "San Francisco" "Chicago"      
[5] "Minneapolis"  

Your can select individual elements from a vector:

Cities[2]
[1] "Los Angeles"
Cities[c(2, 3)]
[1] "Los Angeles"   "San Francisco"

Don’t forget that R is case sensitive. What happens when you enter cities?

cities

You can add new values to an existing vector:

Cities <- c(Cities, "Portland")

Check out your handiwork. Call the cities vector:

Cities
[1] "New York"      "Los Angeles"   "San Francisco" "Chicago"      
[5] "Minneapolis"   "Portland"     

Try-it

  1. You can also make an entirely new object with the same name. Make a new vector called Cities with three new city names. Look at your new Cities object. What happened to the old cities?
Click for solution
Cities <- c("Chengdu", "Chongqing", "New York")
Cities
  1. Reassign the Cities object to the following cities: “New York”, “Los Angeles”, “San Francisco”, “Chicago”,“Minneapolis”,“Portland”`.
Click for solution
Cities <- c("New York", "Los Angeles", "San Francisco", "Chicago", "Minneapolis",
    "Portland")

Creating vectors with rep() and seq()

There are many ways to create vectors in R. These two commonly used functions can repeat values or return a sequence of values.

rep(6, times = 10)  #repeating a single number
 [1] 6 6 6 6 6 6 6 6 6 6
rep(c("a", "b"), 10)  #repeating several values
 [1] "a" "b" "a" "b" "a" "b" "a" "b" "a" "b" "a" "b" "a" "b" "a" "b" "a" "b" "a"
[20] "b"
rep(c("a", "b"), c(5, 5))  # repeating each value in a sequence multiple times
 [1] "a" "a" "a" "a" "a" "b" "b" "b" "b" "b"

Try it

  1. Using 3 different methods, create a variable that is a sequence of the numbers 1-3, repeated in order 4 times. Hint: read the seq() help file.
Click for solution
test <- rep(seq(1, 3), 4)
test
 [1] 1 2 3 1 2 3 1 2 3 1 2 3
test <- rep(seq(1:3), 4)
test
 [1] 1 2 3 1 2 3 1 2 3 1 2 3
test <- rep(c(1:3), 4)
test
 [1] 1 2 3 1 2 3 1 2 3 1 2 3
rep(seq(from = 1, to = 3), 4)
 [1] 1 2 3 1 2 3 1 2 3 1 2 3

Dataframes

While data can be stored in lists and matrices, the most common and flexible data format you will use in R is a dataframe. Dataframes can contain multiple classes of data, but only one class per vector. Data frames are usually organized with each row representing a single case. Columns denote variables which apply across cases.

Let’s make a CityInfo dataframe.

Try it

Make two vectors:

  1. States with the values: NY, CA, CA, IL, MN, OR. Hint: Use the c() function.

  2. Sunshine_per with the values: 58, 73, 66, 54, 58, 48. These values denote the average percent of total possible annual sunshine experienced in each city.

Click for solution

Solution:

State <- c("NY", "CA", "CA", "IL", "MN", "OR")
Sunshine_per <- c(58, 73, 66, 54, 58, 48)

# Look at our handiwork:
State
Sunshine_per

We can combine these vectors together using cbind() which binds the columns together.

cbind(Cities, State, Sunshine_per)
     Cities          State Sunshine_per
[1,] "New York"      "NY"  "58"        
[2,] "Los Angeles"   "CA"  "73"        
[3,] "San Francisco" "CA"  "66"        
[4,] "Chicago"       "IL"  "54"        
[5,] "Minneapolis"   "MN"  "58"        
[6,] "Portland"      "OR"  "48"        

You’ll notice that cbind() coerces the data into characters. R has multiple ways to combine vectors into dataframes. Here we use the data.frame() function.

CityInfo <- data.frame(Cities, State, Sunshine_per)
CityInfo
         Cities State Sunshine_per
1      New York    NY           58
2   Los Angeles    CA           73
3 San Francisco    CA           66
4       Chicago    IL           54
5   Minneapolis    MN           58
6      Portland    OR           48

Now we have created a CityInfo object that is equivalent to a dataframe of the three vectors about cities we created earlier. R can give us summary and structural information about our new dataframe:

names(CityInfo)
[1] "Cities"       "State"        "Sunshine_per"
str(CityInfo)
'data.frame':   6 obs. of  3 variables:
 $ Cities      : chr  "New York" "Los Angeles" "San Francisco" "Chicago" ...
 $ State       : chr  "NY" "CA" "CA" "IL" ...
 $ Sunshine_per: num  58 73 66 54 58 48
summary(CityInfo)
    Cities             State            Sunshine_per 
 Length:6           Length:6           Min.   :48.0  
 Class :character   Class :character   1st Qu.:55.0  
 Mode  :character   Mode  :character   Median :58.0  
                                       Mean   :59.5  
                                       3rd Qu.:64.0  
                                       Max.   :73.0  
nrow(CityInfo)
[1] 6
ncol(CityInfo)
[1] 3

Selecting and subsetting variables

Subsetting with $

In wide format, each row in a dataframe is a case, while the columns are variables that are measures for each case. To select a variable in a dataframe, you use the $ operator.

Call the Sunshine_per column using the $ operator. What data class is it?

CityInfo$Sunshine_per
class(CityInfo$Sunshine_per)

Subsetting with [,]

Dataframes can be subset using the format dfname[row#,col#], or by calling columns by name.

CityInfo[1, 1]
[1] "New York"
CityInfo[, 3]
[1] 58 73 66 54 58 48
CityInfo[2, ]
       Cities State Sunshine_per
2 Los Angeles    CA           73
CityInfo[, "State"]
[1] "NY" "CA" "CA" "IL" "MN" "OR"

You can also subset dataframes based on logical tests. Let’s look at all the cities and states for which enjoy over 55% sunshine. Then let’s examine all the columns for any rows where the state is equal to California.

CityInfo[CityInfo$Sunshine_per > 55, c("Cities", "State")]
CityInfo[CityInfo$Sunshine_per > 55, 1:2]

# cityInfo[CityInfo$State==CA,]

What’s wrong with this last line of code? (Hint: 2 things)

Try-it

  1. Fix the above code to display all columns for all the rows in which the state is California.
Click for solution
CityInfo[CityInfo$State == "CA", ]
         Cities State Sunshine_per
2   Los Angeles    CA           73
3 San Francisco    CA           66
  1. Using the ! operator, we can also select all rows which are ‘not equal’ to a given value. Select all the rows for cities outside of California.
Click for solution
CityInfo[CityInfo$State != "CA", ]
       Cities State Sunshine_per
1    New York    NY           58
4     Chicago    IL           54
5 Minneapolis    MN           58
6    Portland    OR           48

Using subset()

We can also subset dataframes with a specific function: subset(). Let’s examine the help file to see what this function does.

`?`(subset())

Let’s subset all the data for cities with sunshine percentages greater than 55

subset(CityInfo, Sunshine_per > 55)
         Cities State Sunshine_per
1      New York    NY           58
2   Los Angeles    CA           73
3 San Francisco    CA           66
5   Minneapolis    MN           58

You can subset based on multiple conditions.

subset(CityInfo, State == "CA" & Sunshine_per > 55)
         Cities State Sunshine_per
2   Los Angeles    CA           73
3 San Francisco    CA           66

You can also use the | operator to select cases which have one or the other condition.

subset(CityInfo, State == "CA" | Sunshine_per > 55)
         Cities State Sunshine_per
1      New York    NY           58
2   Los Angeles    CA           73
3 San Francisco    CA           66
5   Minneapolis    MN           58

Descriptive statistics

Averages and vector characteristics

Often when exploring data we are interested in some basic descriptive statistics such as the mean, median and mode. R has functions built in for this.

mean(CityInfo$Sunshine_per)
[1] 59.5
median(CityInfo$Sunshine_per)
[1] 58
mode(CityInfo$Sunshine_per)
[1] "numeric"
length(CityInfo$Sunshine_per)
[1] 6
max(CityInfo$Sunshine_per)
[1] 73
sum(CityInfo$Sunshine_per)
[1] 357

We can also examine the spread of the data by making a histogram.

hist(CityInfo$Sunshine_per)

This doesn’t look the best, so we might want to update the bin size and add a title. We don’t have many observations here, but you can see how in principle, adjusting the bin size can change your interpretation of the data distribution.

hist(CityInfo$Sunshine_per, breaks = 3, main = "Histogram of City Sunshine", xlab = "Sunshine percent",
    ylab = "Frequency")

Vector arithmetic

We can also run calculations on vectors as a whole. Something to keep note of is that R will recycle through each vector during vector arithmetic. R doesn’t always return a warning when this is occuring, so be sure to keep this in mind.

CityInfo$Sunshine_per
[1] 58 73 66 54 58 48
CityInfo$Sunshine_per/2
[1] 29.0 36.5 33.0 27.0 29.0 24.0

CityInfo$Sunshine_per/c(10, 1)  # how does R treat the two vectors during this operation?
[1]  5.8 73.0  6.6 54.0  5.8 48.0
CityInfo$Sunshine_per/c(10, 1, 2)
[1]  5.8 73.0 33.0  5.4 58.0 24.0

CityInfo$Sunshine_per + 2
[1] 60 75 68 56 60 50

Summarizing character vectors

Sometimes you might encounter a long list of values that you would like to summarize. There are several ways of quickly assessing the basic attributes of a vector.

# First we make a fruits vector
fruits <- rep(c("apple", "orange", "banana", "pear", "pineapple"), c(10))
fruits <- c(fruits, rep(c("mango", "blueberry"), 2))

unique(fruits)  # returns the name of each unique named fruit
[1] "apple"     "orange"    "banana"    "pear"      "pineapple" "mango"    
[7] "blueberry"
length(unique(fruits))  # how many unique fruits are there?
[1] 7

table(fruits)  # returns a table of the number of times each fruit appears in the vector
fruits
    apple    banana blueberry     mango    orange      pear pineapple 
       10        10         2         2        10        10        10 

Missing values

In an ideal world, every data cell would be filled in every data table…but this is rarely the case. Sometimes (ok, frequently) we encounter missing values. But what is a missing value and how does R deal with them? How do you know a missing value when you see it?

R codes missing values as NA (not “NA” which is a character/string element). Having missing values in a dataframe can cause some functions to fail. Check out the following example.

missingparts <- c(1, 2, 3, NA)
mean(missingparts)  # what is the result?
[1] NA
mean(missingparts, na.rm = T)  # we can tell the function to ignore any NA values in the data
[1] 2

Try it

How do you know you have missing values rather than another issue in your code? There are a few functions that allow us to pick out the NAs. Try examining the missingparts vector with str(), summary(), and is.na(). What is the result of each of these functions and how might this output be useful?

Click for solution
str(missingparts)
 num [1:4] 1 2 3 NA
summary(missingparts)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0     1.5     2.0     2.0     2.5     3.0       1 
is.na(missingparts)
[1] FALSE FALSE FALSE  TRUE
missingparts[is.na(missingparts)]  #you can also subset out only the values that are equal to NA. This is not so useful here, but can be useful when you want to isolate rows in a dataframe that have missing values in particular columns.
[1] NA