ANTH630 Lecture Series Home

This lesson will introduce you to the R programming language, including how to conduct basic statistical analyses and create beautiful visualizations. We’ll focus on the basics of how to use R as a language and then showcase some of the powerful features of R that may be of interest to anthropologists and interdisciplinary researchers.

Installing RStudio and R

To install RStudio, click here and choose the Open Source License.

To install R, visit CRAN and choose the current version of R for your operating system.

Syntax in R

R works by typing out a function, putting something into the function, and then running the code to see the result. For example, if you want to make a simple plot, you use the function plot() with the data that you wish to plot listed inside the function’s parentheses. Here, let’s use a built in dataset in R to quickly make a scatterplot using a built-in dataset.

plot(cars)

The syntax of line of R code tries to be as human readable as possible. For example, if we want the dots in our plot to be red, we can we can add an argument into our function that tells R to make the points red.


plot(cars, col = "red")

Objects and Logical Tests

While you can directly enter data each time you want to use it, R’s power comes from assigning data to named objects or variables. Objects are assigned using the <- symbol, which means “everything on the right is now referred to by the object name on the left.” We can refer to this symbol as an assignment operator or becomes.

Let’s make an object or variable called treeheight with the value 15. Then run the line of code. What is the output?

treeheight <- 15

The line above only creates the object, but doesn’t show us the result. To see what the objecttreeheight is equal to, you next have to call the object.

treeheight
[1] 15

You can also create variables using = but this is generally discouraged as a practice. This is because = is too easily confused with ==, which has a different meaning. A single = means that whatever is on the left hand side is now equal to the value on the right. A double equals sign instead asks R to test whether or not the value on the left is or is not equal to the value on the right, an equivalency test. The output is a logical vector.

We will learn more about logical tests later, but for now, let’s look at these examples.

5 == 5  #note that a double equals sign checks for equivalency in R
[1] TRUE
5 == 6  #Comments in R are prefaced by a hashtag (#). This tells R not to run this line of code, and that it is for your reference only.
[1] FALSE
# 5=6 # Why doesn't this last line work?

Try-it

  1. Run a test to evaluate if treeheight is greater than 10
  2. Run the following: treeheight=="treeheight". What is the result?
  3. What is the result of adding 3 to treeheight? What about adding 3 to “treeheight”?
  4. Reassign treeheight to 20.
  5. What happens if you run bread?
Click for solution
treeheight > 10
[1] TRUE
treeheight == "treeheight"
[1] FALSE
treeheight + 3
[1] 18
#'treeheight'+3  # why doesn't this work? Note: I have these lines of code commented out to keep the document compiling properly
treeheight <- 10
# bread

R is case sensitive

Something very important to keep in mind with R is that it is case sensitive, unlike some other languages. This is very important to know for keeping track of different variables and often a cause of many coding errors. For example, we can create three different objects referring to trees by changing the capitalization.

Tree <- "tree"
TREE <- "tree again"

# tree # why doesn't this work?

tree <- "a third tree"

# look at results
Tree
[1] "tree"
TREE
[1] "tree again"
tree
[1] "a third tree"

Vectors

Above, we made a treeheight object that has one value in it. When working with real data, we often have multiple values that we want to analyze. In this case, we make a vector, or one dimensional data object, to list multiple values. For the treeheight object, we can use the c() function to add multiple tree heights.

treeheight <- c(15, 20, 12, 15, 18)
treeheight
[1] 15 20 12 15 18

Notice that the value of treeheight has been overwritten with the new vector.

Whenever you encounter a new function or want to look up how to use a function, you can refer to the help file. What does c() do?

`?`(c())

Vectors can also be made using nested functions. For example, we can populate a list of tree heights using sequence and repeat functions.

treeheight <- c(10, rep(15, 3), 20, 20)
treeheight
[1] 10 15 15 15 20 20

treeheight <- c(10, seq(10, 15), 20, 20)
treeheight
[1] 10 10 11 12 13 14 15 20 20

treeheight <- c(10, 10:15, 20, 20)
treeheight
[1] 10 10 11 12 13 14 15 20 20

You can select individual elements from a vector:

treeheight[2]
[1] 10
treeheight[c(2, 3)]
[1] 10 11

Don’t forget that R is case sensitive. What happens when you enter Treeheight?

Treeheight

You can add new values to an existing vector:

treeheight <- c(treeheight, 50)
treeheight
 [1] 10 10 11 12 13 14 15 20 20 50

Let’s go back to our original treeheight object.

treeheight <- c(15, 20, 12, 15, 18)
treeheight
[1] 15 20 12 15 18

Now that we have a list of tree heights, we can run some basic statistics on this dataset.

mean(treeheight)
[1] 16
median(treeheight)
[1] 15
max(treeheight)
[1] 20
range(treeheight)
[1] 12 20
summary(treeheight)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
     12      15      15      16      18      20 
str(treeheight)
 num [1:5] 15 20 12 15 18

Variable classes

In the last function, notice that R tells us this object is a numeric data class. We can also look at the data class with the class() function. In addition, R has built in checks for different classes, such as is.numeric().

class(treeheight)
[1] "numeric"
is.numeric(treeheight)
[1] TRUE

Primary data types in R include: 1. numeric, 2. string or character, 3. logical, and 4. factor. Note that output data types do not always match the input data. String data are entered with "" or '' surrounding the characters.

Vectors can have either string, logical, or numeric data; but only one class of element per vector. To illustrate, let’s re-assign our treeheight object to some new mixed variable types. To input string variable, we need to use "" around the value.

treeheight <- c(10, 14, "twelve", 20)
treeheight
class(treeheight)

Note that R will automatically convert the numeric data into string data. This is called “coercion”.

Try it

First let’s make a few different objects and then examine them with class(). Try to predict in advance which data class each object will be.

tree <- "a third tree"
x <- TRUE
y <- "5"
z <- 5

Dataframes

Dataframes basics

So far we have been working with vectors, or one-dimensional data. We can also load in dataframe, which are like tables or spreadsheets of multiple connected variables. While data can be stored in lists and matrices, the most common and flexible data format you will use in R is a dataframe. Dataframes can contain multiple classes of data, but only one class of data per vector. Data frames are usually organized with each row representing a single case or observation. Columns denote variables which apply across rows.

For example, let’s make a new trees dataframe that includes the heights, species, and products of different trees.

Try it

  1. Make a treeheight vector with the values: 15, 20, 12, 15, and 18. Keep this order. Hint: Use the c() function.

  2. Make a treetype vector with the values: apple, walnut, apple, hazelnut, and pear. Keep this order.

  3. Make a treeproduct vector with the values fruit and nut that matches the fruits and nuts in the same order as the treetype vector.

  4. Use the data.frame() function to make a new dataframe called trees that includes the three variable vectors you just created.

  5. Look at the structure of your new dataframe using the str() function.

Click for solution

Solution:

treeheight <- c(15, 20, 12, 15, 18)
treetype <- c("apple", "walnut", "apple", "hazelnut", "pear")
treeproduct <- c("fruit", "nut", "fruit", "nut", "fruit")

trees <- data.frame(treeheight, treetype, treeproduct)
trees
str(trees)

Selecting and subsetting variables

Subsetting with $

In wide format, each row in a dataframe is a case, while the columns are variables that are measures for each case. To select a variable in a dataframe, you use the $ operator.

Call the treetype column using the $ operator. What data class is it?

trees$treetype
class(trees$treetype)

We can also examine the spread of the data by making a histogram and selecting the treeheight variable.

hist(trees$treeheight)

We can also select multiple variables to run operations, such as creating a table of the counts of the number of trees of particular heights that have different tree products.

table(trees$treeheight, trees$treeproduct)
    
     fruit nut
  12     1   0
  15     1   1
  18     1   0
  20     0   1

Subsetting with [,]

Dataframes can be subset using the format dfname[row#,col#], or by calling columns by name.

trees[1, 1]
trees[, 3]
trees[2, ]
trees[, "treetype"]

You can also subset dataframes based on logical tests. Let’s look at all the tree types for the trees over 15ft tall. Then let’s examine all the columns for any rows where the tree product is fruit.

trees[trees$treeheight > 15, c("treetype")]
trees[trees$treeheight > 15, 2]


# Trees[trees$product==fruit,]

What’s wrong with this last line of code? (Hint: 3 things)

Try-it

  1. Fix the above code to display all columns all the columns for any rows where the tree product is fruit.
Click for solution
trees[trees$treeproduct == "fruit", ]
  treeheight treetype treeproduct
1         15    apple       fruit
3         12    apple       fruit
5         18     pear       fruit
  1. Using the != operator, we can also select all rows which are ‘not equal’ to a given value. Select all the rows for where the tree product is not fruit.
Click for solution
trees[trees$treeproduct != "fruit", ]
  treeheight treetype treeproduct
2         20   walnut         nut
4         15 hazelnut         nut
  1. Using the | operator in between logical checks, we can also select all rows which are equal to one condition or another condition. Select all the rows for where the tree type is either apple or pear. Hint: Run each check for tree type individually, then connect them with the | operator.
Click for solution
trees[trees$treetype == "apple" | trees$treetype == "pear", ]
  treeheight treetype treeproduct
1         15    apple       fruit
3         12    apple       fruit
5         18     pear       fruit

Analyzing numeric variables

We can also run calculations on vectors/variables as a whole. Something to keep note of is that R will recycle through each vector during vector arithmetic. R doesn’t always return a warning when this is occurring, so be sure to keep this in mind.

trees$treeheight
[1] 15 20 12 15 18
trees$treeheight/2
[1]  7.5 10.0  6.0  7.5  9.0
trees$treeheight/c(10, 1)  # how does R treat the two vectors during this operation?
[1]  1.5 20.0  1.2 15.0  1.8

Summarizing character variables

There are several ways of quickly assessing the basic attributes of a character vector/variable.

unique(trees$treetype)  # returns the name of each unique type of tree
[1] "apple"    "walnut"   "hazelnut" "pear"    
length(unique(trees$treetype))  # how many unique tree types are there?
[1] 4
table(trees$treetype)  # returns a table of the number of times each tree type appears in the dataframe

   apple hazelnut     pear   walnut 
       2        1        1        1 

Missing values

In an ideal world, every data cell would be filled in every data table…but this is rarely the case. Sometimes (ok, frequently) we encounter missing values. But what is a missing value and how does R deal with them? How do you know a missing value when you see it?

R codes missing values as NA (not “NA” which is a character/string element). Having missing values in a dataframe can cause some functions to fail. Check out the following example.

missingparts <- c(1, 2, 3, NA)
mean(missingparts)  # what is the result?
[1] NA
mean(missingparts, na.rm = T)  # we can tell the function to ignore any NA values in the data
[1] 2

Try it

How do you know you have missing values rather than another issue in your code? There are a few functions that allow us to pick out the NAs. Try examining the missingparts vector with str(), summary(), and is.na(). What is the result of each of these functions and how might this output be useful?

Click for solution
str(missingparts)
 num [1:4] 1 2 3 NA
summary(missingparts)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
    1.0     1.5     2.0     2.0     2.5     3.0       1 
is.na(missingparts)
[1] FALSE FALSE FALSE  TRUE
missingparts[is.na(missingparts)]  #you can also subset out only the values that are equal to NA. This is not so useful here, but can be useful when you want to isolate rows in a dataframe that have missing values in particular columns.
[1] NA

Installing packages

Base R has many useful functions but where R really shines is through the 22,977 and counting packages that you can download to enhance R’s functionality.

Let’s install and then load the tidyverse suite of packages. You only need to install a package once, but you have to load the library every time you start a new R session.

# install.packages('tidyverse')
library(tidyverse)

Packages can also be installed by using the “Tools” –> “Install Packages” menu in RStudio.

Loading data files

Let’s start working with some real data. Here we will work with the Open Data DC Urban Forestry Street Trees dataset. First, download the dataset and then we will load it into R. For the tutorial, I will be loading an older version of this file that I have uploaded online. This means our output may look a bit different.

Details on how to read data files from a Windows operating system: intro2r link.

urbantrees <- read.csv("https://maddiebrown.github.io/ANTH630/data/Urban_Forestry_Street_Trees_2024.csv")

Examining data structures

Let’s examine the structure of our dataset.

str(urbantrees)
'data.frame':   211117 obs. of  54 variables:
 $ X               : num  -77 -77 -77 -77 -77 ...
 $ Y               : num  38.9 38.9 38.9 38.9 38.9 ...
 $ SCI_NM          : chr  "Quercus montana" "Acer rubrum" "Quercus robur fastigiata" "Tilia americana" ...
 $ CMMN_NM         : chr  "Rock chestnut oak" "Red maple" "Columnar English oak" "American linden" ...
 $ GENUS_NAME      : chr  "Quercus" "Acer" "Quercus" "Tilia" ...
 $ FAM_NAME        : chr  "Fagaceae" "Sapindaceae" "Fagaceae" "Tiliaceae" ...
 $ DATE_PLANT      : chr  "2018/02/01 18:50:34+00" "" "" "" ...
 $ FACILITYID      : chr  "31982-090-3001-0269-000" "31982-100-3005-0155-000" "10150-300-3001-0050-000" "32691-092-3001-0105-000" ...
 $ VICINITY        : chr  "922 C ST SE" "1017 C ST SE" "3029 15TH ST NW" "904 D ST SE" ...
 $ WARD            : int  6 6 1 6 6 1 6 1 6 1 ...
 $ TBOX_L          : num  99 8 6 9 8 99 12 9 9 12 ...
 $ TBOX_W          : num  7 4 3 4 4 4 4 3 4 5 ...
 $ WIRES           : chr  "None" "None" "None" "None" ...
 $ CURB            : chr  "Permanent" "Permanent" "Permanent" "Permanent" ...
 $ SIDEWALK        : chr  "Permanent" "Permanent" "Permanent" "Permanent" ...
 $ TBOX_STAT       : chr  "Plant" "Plant" "Plant" "Plant" ...
 $ RETIREDDT       : chr  "" "" "" "" ...
 $ DBH             : num  5.7 17.7 10.9 13.4 11.9 9.3 1.6 5.5 24.5 21 ...
 $ DISEASE         : chr  "" "" "" "" ...
 $ PESTS           : chr  "" "" "" "" ...
 $ CONDITION       : chr  "Excellent" "Fair" "Fair" "Good" ...
 $ CONDITIODT      : chr  "2024/02/28 23:57:09+00" "2021/02/17 22:21:46+00" "2021/09/13 18:55:03+00" "2020/02/14 01:33:24+00" ...
 $ OWNERSHIP       : chr  "UFA" "UFA" "UFA" "UFA" ...
 $ TREE_NOTES      : chr  "Elevated street side. Feb 2024." "P dead wood only and r small mulberry at base, be careful of roots" "" "" ...
 $ MBG_WIDTH       : num  13.1 39.4 29.5 29.5 39.4 ...
 $ MBG_LENGTH      : num  19.7 45.9 46.5 45.9 45.9 ...
 $ MBG_ORIENTATION : num  90 90 163 0 90 ...
 $ MAX_CROWN_HEIGHT: num  18.9 45.9 37.4 41.5 32.6 ...
 $ MAX_MEAN        : num  14.3 30.7 21.3 22.6 21.2 ...
 $ MIN_CROWN_BASE  : num  0.0533 -0.1557 -0.2178 0.1589 -0.1809 ...
 $ DTM_MEAN        : num  82.3 81.2 202.9 77 81.1 ...
 $ PERIM           : num  65.6 183.7 170.6 164 177.2 ...
 $ CROWN_AREA      : num  215 1259 743 1130 1119 ...
 $ CICADA_SURVEY   : chr  "" "" "" "" ...
 $ ONEYEARPHOTO    : chr  "" "" "" "" ...
 $ SPECIALPHOTO    : chr  "" "" "" "" ...
 $ PHOTOREMARKS    : chr  "" "" "" "" ...
 $ ELEVATION       : chr  "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ SIGN            : chr  "Unknown" "Unknown" "Unknown" "Unknown" ...
 $ TRRS            : int  NA NA NA NA NA NA NA NA NA NA ...
 $ WARRANTY        : chr  "2017-2018" "Unknown" "Unknown" "Unknown" ...
 $ CREATED_USER    : chr  "" "" "" "" ...
 $ CREATED_DATE    : chr  "" "" "" "" ...
 $ EDITEDBY        : chr  "sward" "jchapman" "mmcphee" "sward" ...
 $ LAST_EDITED_USER: chr  "sward" "jchapman" "mmcphee" "sward" ...
 $ LAST_EDITED_DATE: chr  "2024/02/28 23:57:52+00" "2021/02/17 22:21:47+00" "2021/09/13 18:54:32+00" "2020/02/14 01:34:14+00" ...
 $ GIS_ID          : logi  NA NA NA NA NA NA ...
 $ GLOBALID        : chr  "{0B358D52-AAD4-41AC-B1AF-B19740DBC02A}" "{0F7845B3-E5DE-480B-96EC-B595354BCA5C}" "{EA1C7F1D-8FF6-4A3A-BFBD-0147BABCA5F7}" "{ADB853B2-E32F-4BB4-B949-DE7B5656DCD5}" ...
 $ CREATOR         : logi  NA NA NA NA NA NA ...
 $ CREATED         : logi  NA NA NA NA NA NA ...
 $ EDITOR          : logi  NA NA NA NA NA NA ...
 $ EDITED          : logi  NA NA NA NA NA NA ...
 $ SHAPE           : logi  NA NA NA NA NA NA ...
 $ OBJECTID        : int  40100904 40100905 40100906 40100907 40100908 40100909 40100910 40100911 40100912 40101121 ...

Exploratory data analysis with tidyverse

In tidyverse, the basic operator for linking functions is %>% or a pipe operator. We can use this to string many functions together.

Subsetting data

The basic function for subsetting columns/variables in tidyverse is select().

urbantrees %>%
    select(CMMN_NM)

The basic function for selecting particular rows is filter().

urbantrees %>%
    filter(CMMN_NM == "Red maple" & DISEASE == "Ganoderma Root Rot")

We can also select all the unique observations within a particular variable. For example, we might be interested in knowing what all the unique ward names are.

urbantrees %>%
    distinct(WARD)
   WARD
1     6
2     1
3     2
4     7
5     8
6     4
7     3
8     5
9    NA
10   10
11    0
12    9
13   88
14   99

We can also ask R to tell us how many distinct values there are within a variable.

n_distinct(urbantrees$FAM_NAME)
[1] 104

Try it

Recalling what we learned about subsetting dataframes, try to complete the following tasks using base R and/or tidyverse.

  1. Select the first 6 observations where the tree genus is Quercus. Hint: Use head().
  2. Show all the unique species names within the family Rosaceae.
  3. The previous line of code showed us that bur oaks are listed as being in Rosaceae. This seems odd. Write code to show all the different family names that are associated with bur oaks.
Click for solution
urbantrees %>%
    filter(GENUS_NAME == "Quercus") %>%
    head()
          X        Y                   SCI_NM              CMMN_NM GENUS_NAME
1 -76.99281 38.88609          Quercus montana    Rock chestnut oak    Quercus
2 -77.03567 38.92727 Quercus robur fastigiata Columnar English oak    Quercus
3 -77.03931 38.92800           Quercus lyrata          Overcup oak    Quercus
4 -77.00198 38.88539        Quercus palustris              Pin oak    Quercus
5 -77.04009 38.93254          Quercus phellos           Willow oak    Quercus
6 -77.04090 38.92535        Quercus palustris              Pin oak    Quercus
  FAM_NAME             DATE_PLANT              FACILITYID          VICINITY
1 Fagaceae 2018/02/01 18:50:34+00 31982-090-3001-0269-000       922 C ST SE
2 Fagaceae                        10150-300-3001-0050-000   3029 15TH ST NW
3 Fagaceae 2011/02/17 05:00:00+00 14582-160-3005-0656-000 1653 HOBART ST NW
4 Fagaceae                        30030-030-3001-0237-000 OPP 319 3RD ST SE
5 Fagaceae                        16890-178-3005-0043-000   1737 PARK RD NW
6 Fagaceae                        15408-165-3005-0467-000 1741 LANIER PL NW
  WARD TBOX_L TBOX_W WIRES      CURB  SIDEWALK TBOX_STAT RETIREDDT  DBH
1    6     99      7  None Permanent Permanent     Plant            5.7
2    1      6      3  None Permanent Permanent     Plant           10.9
3    1     99      4  None Permanent Permanent     Plant            9.3
4    6      9      4  None Permanent Permanent     Plant           24.5
5    1     12      5  None Permanent Permanent     Plant           21.0
6    1     99      5  None Permanent Flexipave     Plant           28.1
    DISEASE PESTS CONDITION             CONDITIODT OWNERSHIP
1                 Excellent 2024/02/28 23:57:09+00       UFA
2                      Fair 2021/09/13 18:55:03+00       UFA
3 Hypoxylon            Dead 2023/05/22 19:49:55+00       UFA
4                      Fair 2020/11/16 21:32:38+00       UFA
5                 Excellent 2022/11/18 21:24:48+00       UFA
6                      Fair 2022/08/18 19:26:54+00       UFA
                                                                                                                                                                                                                      TREE_NOTES
1                                                                                                                                                                                                Elevated street side. Feb 2024.
2                                                                                                                                                                                                                               
3                                                                                                                                                                                                                               
4 Bread loaf-sized Inonatus at base. Three.“Black crust” Kretzschmeria conk, fist-sized, on root flare, edge of sidewalk.  Grew one inch DBH since 2017. Another shelf conk at 15’ up.  Dieback sprinkled thru crown, June 2019.
5                                                                                                                                                                                                                               
6                                                                                                                                                                         P. Beginning of bls potentiallyWash gas disrupted soil
  MBG_WIDTH MBG_LENGTH MBG_ORIENTATION MAX_CROWN_HEIGHT MAX_MEAN MIN_CROWN_BASE
1  13.12336   19.68504        90.00000         18.91814 14.26427     0.05331409
2  29.53926   46.50863       163.30076         37.41346 21.32403    -0.21777124
3  29.52756   32.80840         0.00000         37.61407 18.87533    -0.57492221
4  39.37008   65.61680        90.00000         67.73044 56.71571     0.01390713
5  38.23960   78.39118       150.94540         54.32866 35.59290    -0.13329457
6  55.73578   83.25091        53.74616         61.28306 41.10743    -1.44659974
   DTM_MEAN    PERIM CROWN_AREA CICADA_SURVEY ONEYEARPHOTO SPECIALPHOTO
1  82.26296  65.6168   215.2780                                        
2 202.87526 170.6037   742.7091                                        
3 187.02505 144.3570   688.8896                                        
4  72.45985 216.5354  1668.4045                                        
5 198.98500 249.3438  1636.1128                                        
6 186.02665 295.2756  2755.5584                                        
  PHOTOREMARKS ELEVATION    SIGN TRRS  WARRANTY CREATED_USER CREATED_DATE
1                Unknown Unknown   NA 2017-2018                          
2                Unknown Unknown   NA   Unknown                          
3                Unknown Unknown   NA 2010-2011                          
4                Unknown Unknown   NA   Unknown                          
5                Unknown Unknown   NA   Unknown                          
6                Unknown Unknown   NA                                    
  EDITEDBY LAST_EDITED_USER       LAST_EDITED_DATE GIS_ID
1    sward            sward 2024/02/28 23:57:52+00     NA
2  mmcphee          mmcphee 2021/09/13 18:54:32+00     NA
3  jmiller          jmiller 2023/05/22 19:50:08+00     NA
4    sward            sward 2020/11/16 21:32:41+00     NA
5  jmiller          jmiller 2022/11/18 21:23:51+00     NA
6  mmcphee          mmcphee 2022/08/18 19:26:19+00     NA
                                GLOBALID CREATOR CREATED EDITOR EDITED SHAPE
1 {0B358D52-AAD4-41AC-B1AF-B19740DBC02A}      NA      NA     NA     NA    NA
2 {EA1C7F1D-8FF6-4A3A-BFBD-0147BABCA5F7}      NA      NA     NA     NA    NA
3 {0BEFB0A1-AAF4-4958-849C-CFBFBA3D4E78}      NA      NA     NA     NA    NA
4 {CFA5BDF4-B306-4D54-A501-FCE33E3C5146}      NA      NA     NA     NA    NA
5 {A09B6E85-1A6C-4A13-8011-E93255AEAF21}      NA      NA     NA     NA    NA
6 {4B386921-E1E9-455D-B878-8488AD418224}      NA      NA     NA     NA    NA
  OBJECTID
1 40100904
2 40100906
3 40100909
4 40100912
5 40101121
6 40101124

urbantrees %>%
    filter(FAM_NAME == "Rosaceae") %>%
    distinct(CMMN_NM)
                                CMMN_NM
1                 Bradford callery pear
2                                Cherry
3                 Shadblow serviceberry
4                    Prunus x yedoensis
5                    Cherry (Snowgoose)
6                      Purple leaf plum
7                             Crabapple
8                Alleghany serviceberry
9                        Yoshino cherry
10                          Chokecherry
11                         Okame cherry
12                       Kwanzan cherry
13                   Downy serviceberry
14                                     
15                     Arnold crabapple
16                     Golden rain tree
17                         Serviceberry
18      Autumn brilliance service berry
19               Donald Wyman Crabapple
20                 Adirondack Crabapple
21              Whitehouse callery pear
22               Crimson Cloud hawthorn
23                          Honeylocust
24             Crabapple (Harvest Gold)
25                         Crape myrtle
26                    Radiant crabapple
27                  Washington hawthorn
28                       Eastern redbud
29                     Japanese Apricot
30                     Lavalle hawthorn
31                               Redbud
32                    Other (See Notes)
33                  Snowdrift crabapple
34                   Prunus x yodoensis
35                    American hornbeam
36               Canada Red Chekecherry
37           Winter King Green hawthorn
38                             Blackgum
39                            Hackberry
40                     Snowgoose cherry
41       Ivory Silk Japanese tree lilac
42                        Trident maple
43                     Chinese pistache
44                   Thunder cloud plum
45                         Higan Cherry
46                      Swamp white oak
47                  Kentucky coffeetree
48                    Flowering Dogwood
49                         Silver maple
50                           Yellowwood
51                    Red horsechestnut
52                    Hardy Rubber Tree
53                          Hedge maple
54                          River birch
55          Moonglow Sweet Bay Magnolia
56                                  Elm
57       Autumn Brilliance serviceberry
58                                Lilac
59                   Chinese flame tree
60                    Sweetbay magnolia
61                         Bald cypress
62                         Deodar cedar
63                          Scarlet oak
64 Autumn Brilliance Apple serviceberry
65                       Staghorn sumac
66                     Japanese zelkova
67          Green Vase Japanese zelkova
68                    American sycamore
69                          Chinese elm
70              Shademaster honeylocust
71               Dura heat' river birch
72                  Carolina silverbell
73                     Cornelian Cherry
74                         Black Cherry
75                              Bur oak
76                    Southern magnolia
77                            Tuliptree
78                          Katsuratree
79                            Persimmon
80       Autumn Brilliance Serviceberry
81                 Thunder cloud  plum 
82                             Sweetgum
83                              Red oak
84                           Willow oak
85                    London plane tree

urbantrees %>%
    filter(CMMN_NM == "Bur oak") %>%
    distinct(FAM_NAME)
     FAM_NAME
1    Fagaceae
2   Fagaceae 
3 Sapindaceae
4    Ulmaceae
5            
6    Rosaceae
7        Null

Summarizing data

In order to answer questions about our data, we need to summarize it in various ways. Below are two ways to make a table of the counts of the number of trees that have various diseases.

table(urbantrees$DISEASE)

                    Armillaria Root Rot                 B&B                 BLS 
             209039                  35                   2                 279 
           Butt Rot                 DED  Ganoderma Root Rot           Hypoxylon 
                152                 144                 441                 222 
           jchapman             jconlon             jmiller           mlehtonen 
                  5                   1                  14                   1 
            mmcphee            msampson        None present      Powdery Mildew 
                  1                   4                 191                  31 
           Root Rot               sdoan              smckim               sward 
                 74                   3                   1                   8 
         Trunk Root           Trunk Rot 
                 40                 429 

urbantrees %>%
    group_by(DISEASE) %>%
    count() %>%
    arrange(desc(n))
# A tibble: 22 × 2
# Groups:   DISEASE [22]
   DISEASE                   n
   <chr>                 <int>
 1 ""                   209039
 2 "Ganoderma Root Rot"    441
 3 "Trunk Rot"             429
 4 "BLS"                   279
 5 "Hypoxylon"             222
 6 "None present"          191
 7 "Butt Rot"              152
 8 "DED"                   144
 9 "Root Rot"               74
10 "Trunk Root"             40
# ℹ 12 more rows

In tidyverse we can also create new summarized dataframes, such as the one below that tells us the mean height of the trees as well as the tallest height and the genus of the tallest tree.

urbantrees %>%
    summarise(meanheight = mean(MAX_CROWN_HEIGHT, na.rm = T), maxheight = max(MAX_CROWN_HEIGHT,
        na.rm = T), tallestspecies = urbantrees[max(urbantrees$MAX_CROWN_HEIGHT,
        na.rm = T), "GENUS_NAME"])
  meanheight maxheight tallestspecies
1   36.66681  182.9099          Ulmus

Try it

  1. Make a table of plant genus counts.
  2. Make a table of the number of trees per ward arranged in descending order. How does this fit into what you know about these wards?
  3. Make a table summarizing the number of hickory and pawpaw trees per ward. Keep both the pawpaw and hickory counts as separate rows. Hint: Use the | operator in your filter() function to keep all rows matching both conditions.
Click for solution
urbantrees %>%
    group_by(GENUS_NAME) %>%
    count()

table(urbantrees$GENUS_NAME)

urbantrees %>%
    group_by(WARD) %>%
    count() %>%
    arrange(desc(n))

urbantrees %>%
    group_by(WARD, CMMN_NM) %>%
    filter(CMMN_NM == "Pawpaw" | CMMN_NM == "Hickory") %>%
    count()

ifelse() statements

Another common form of logical testing in R is the ifelse() statement. In this case, you pass a logical test to R and if the output is true, a certain action is performed, then if it is false, another action is performed. This can be used to make new variables, subset data, color points on a graph and much more.

Let’s annotate the urban tree data according to whether or not the tree is in fair condition and located in ward 6.

head(ifelse(urbantrees$CONDITION == "Fair" & urbantrees$WARD == "6", "fair tree in ward 6",
    "other"))
[1] "other"               "fair tree in ward 6" "other"              
[4] "other"               "other"               "other"              

# now we can add this to our tree dataset
urbantrees$wardsixfair <- ifelse(urbantrees$CONDITION == "Fair" & urbantrees$WARD ==
    "6", "fair tree in ward 6", "other")

# and take a look at our new variable and double check that it worked as
# intended
urbantrees %>%
    select(CMMN_NM, CONDITION, WARD, wardsixfair) %>%
    head(10)
                   CMMN_NM CONDITION WARD         wardsixfair
1        Rock chestnut oak Excellent    6               other
2                Red maple      Fair    6 fair tree in ward 6
3     Columnar English oak      Fair    1               other
4          American linden      Good    6               other
5             Norway maple      Good    6               other
6              Overcup oak      Dead    1               other
7  Redmond American Linden      Good    6               other
8          New Harmony elm Excellent    1               other
9                  Pin oak      Fair    6 fair tree in ward 6
10              Willow oak Excellent    1               other

Try it

ifelse() statements can also be nested. How might you write code to output the annotation “fair tree in ward 6” for fair trees in ward 6, as well as the annotation “good tree in ward 6” for good trees in ward six. You can put these ifelse() statements in the same line of code.

Click for solution
ifelse(urbantrees$CONDITION == "Fair" & urbantrees$WARD == "6", "fair tree in ward 6",
    ifelse(urbantrees$CONDITION == "Good" & urbantrees$WARD == "6", "good tree in ward 6",
        "other"))

Plotting data

For this tutorial, we will use ggplot2 to plot data. In this package, you initialize a ggplot() object and then add aesthetic layers such as color controls, lines, points or text annotations.

First, we will make a basic scatterplot. This shows the perimeter of the crown by the mean crown height. Points are colored according to ward number.

ggplot(urbantrees, aes(PERIM, MAX_MEAN, color = as.factor(WARD))) + geom_point() +
    ggtitle("DC Tree Attributes")

Customizing aesthetics

There are multiple aesthetic parameters that can be customized in ggplots. This includes: color, fill, linetype, size, shape, font, and more. It just depends on which geom you are working with. We will explore some of these graphical parameters further as this tutorial introduces different geoms. Here is a vignette about aesthetic customization in ggplot2.

Different geoms

There are numerous plot types that can be made with ggplot2. Some examples are included below.

geom_col()
geom_point()
geom_line()
geom_smooth()
geom_histogram()
geom_boxplot()
geom_text()
geom_density()
geom_errorbar()
geom_hline()
geom_abline()

Bar plots

Bar plots are great for showing frequencies or proportions across different groups. For instance, we may want to calculate the number of pawpaw trees per ward and then plot this in a bargraph with ggplot2.

npawpawbyward <- urbantrees %>%
    group_by(WARD, CMMN_NM) %>%
    filter(CMMN_NM == "Pawpaw") %>%
    count()

ggplot(npawpawbyward, aes(x = WARD, y = n)) + geom_col()

We can clean up this plot by reordering the Wards from lowest to highest number of pawpaw trees. Let’s also add custom x and y axis labels and a title.

ggplot(npawpawbyward, aes(x = reorder(npawpawbyward$WARD, npawpawbyward$n), y = n)) +
    geom_col() + labs(x = "Ward", y = "Number of pawpaw trees") + ggtitle("Prevalence of Pawpaw Trees by Ward")

Try it

  1. Create a new dataframe summarizing the number of maple trees with each type of disease. Remove all rows with blank cells. Hint: Observations with blank cells are marked by an empty set "".
Click for solution
ndisease <- urbantrees %>%
    group_by(CMMN_NM, DISEASE) %>%
    filter(CMMN_NM == "Red maple" & DISEASE != "") %>%
    count()
  1. Plot the data as a barplot in order from least to most common disease. Hint: Use the reorder() function to control the order of the x axis variable.
  2. Flip the coordinates of the plot using coord_flip().
  3. Finish your plot by adding in x and y axis labels and a title. Hint: Use the labs() and ggtitle() functions.
Click for solution
ggplot(ndisease, aes(x = reorder(ndisease$DISEASE, ndisease$n), y = n)) + geom_col() +
    coord_flip() + labs(x = "Disease", y = "Number of trees") + ggtitle("Prevalence of diseases in Red Maples in DC")

Colors

R has many built-in colors. You can view them by using the colors() function.

Let’s add color to our plot of maple tree diseases. You can directly assign a color as an aesthetic trait in ggplot or assign the colors to a variable.

Try it

  1. In the geom_col() function of your previous plot code, add in colors with both the fill= and color= arguments.
  2. Try setting the fill() to the disease variable from within the aes() argument of your ggplot() function. What happens?
Click for solution
ggplot(ndisease, aes(x = reorder(ndisease$DISEASE, ndisease$n), y = n)) + geom_col(fill = "green",
    color = "blue") + coord_flip() + labs(x = "Disease", y = "Number of trees") +
    ggtitle("Prevalence of diseases in Red Maples in DC")


ggplot(ndisease, aes(x = reorder(ndisease$DISEASE, ndisease$n), y = n, fill = DISEASE)) +
    geom_col() + coord_flip() + labs(x = "Disease", y = "Number of trees") + ggtitle("Prevalence of diseases in Red Maples in DC")

Interactive mapping with leaflet

As a final introduction to R’s capabilities today, let’s quickly make an interactive map of our urbantrees dataset. If you run names(urbantrees) you will notice that there are X and Y variables that give the spatial locations of the trees. We can use these in combination with Leaflet to make a map.

library(leaflet)

pawpaws <- urbantrees %>%
    filter(CMMN_NM == "Pawpaw")

leaflet() %>%
    addTiles() %>%
    addMarkers(pawpaws$X, pawpaws$Y, popup = paste("<B>Name: </B>", pawpaws$CMMN_NM,
        "<br>", "<B>Condition: </B>", pawpaws$CONDITION, sep = ""))