This lesson will introduce you to the R programming language, including how to conduct basic statistical analyses and create beautiful visualizations. We’ll focus on the basics of how to use R as a language and then showcase some of the powerful features of R that may be of interest to anthropologists and interdisciplinary researchers.
To install RStudio, click here and choose the Open Source License.
To install R, visit CRAN and choose the current version of R for your operating system.
R works by typing out a function, putting something into the
function, and then running the code to see the result. For example, if
you want to make a simple plot, you use the function plot()
with the data that you wish to plot listed inside the function’s
parentheses. Here, let’s use a built in dataset in R to quickly make a
scatterplot using a built-in dataset.
plot(cars)
The syntax of line of R code tries to be as human readable as possible. For example, if we want the dots in our plot to be red, we can we can add an argument into our function that tells R to make the points red.
plot(cars, col = "red")
While you can directly enter data each time you want to use it, R’s
power comes from assigning data to named objects or variables. Objects
are assigned using the <- symbol, which means
“everything on the right is now referred to by the object name on the
left.” We can refer to this symbol as an assignment operator or
becomes.
Let’s make an object or variable called treeheight with
the value 15. Then run the line of code. What is the
output?
treeheight <- 15
The line above only creates the object, but doesn’t show us the
result. To see what the objecttreeheight is equal to, you
next have to call the object.
treeheight
[1] 15
You can also create variables using = but this is
generally discouraged as a practice. This is because = is
too easily confused with ==, which has a different meaning.
A single = means that whatever is on the left hand side is
now equal to the value on the right. A double equals sign instead asks R
to test whether or not the value on the left is or is not equal to the
value on the right, an equivalency test. The output is a logical
vector.
We will learn more about logical tests later, but for now, let’s look at these examples.
5 == 5 #note that a double equals sign checks for equivalency in R
[1] TRUE
5 == 6 #Comments in R are prefaced by a hashtag (#). This tells R not to run this line of code, and that it is for your reference only.
[1] FALSE
# 5=6 # Why doesn't this last line work?
treeheight is greater than
10treeheight=="treeheight". What is
the result?treeheight to 20.bread?treeheight > 10
[1] TRUE
treeheight == "treeheight"
[1] FALSE
treeheight + 3
[1] 18
#'treeheight'+3 # why doesn't this work? Note: I have these lines of code commented out to keep the document compiling properly
treeheight <- 10
# bread
Something very important to keep in mind with R is that it is case sensitive, unlike some other languages. This is very important to know for keeping track of different variables and often a cause of many coding errors. For example, we can create three different objects referring to trees by changing the capitalization.
Tree <- "tree"
TREE <- "tree again"
# tree # why doesn't this work?
tree <- "a third tree"
# look at results
Tree
[1] "tree"
TREE
[1] "tree again"
tree
[1] "a third tree"
Above, we made a treeheight object that has one value in
it. When working with real data, we often have multiple values that we
want to analyze. In this case, we make a vector, or one dimensional data
object, to list multiple values. For the treeheight object,
we can use the c() function to add multiple tree
heights.
treeheight <- c(15, 20, 12, 15, 18)
treeheight
[1] 15 20 12 15 18
Notice that the value of treeheight has been overwritten
with the new vector.
Whenever you encounter a new function or want to look up how to use a
function, you can refer to the help file. What does c()
do?
`?`(c())
Vectors can also be made using nested functions. For example, we can populate a list of tree heights using sequence and repeat functions.
treeheight <- c(10, rep(15, 3), 20, 20)
treeheight
[1] 10 15 15 15 20 20
treeheight <- c(10, seq(10, 15), 20, 20)
treeheight
[1] 10 10 11 12 13 14 15 20 20
treeheight <- c(10, 10:15, 20, 20)
treeheight
[1] 10 10 11 12 13 14 15 20 20
You can select individual elements from a vector:
treeheight[2]
[1] 10
treeheight[c(2, 3)]
[1] 10 11
Don’t forget that R is case sensitive. What happens when you enter
Treeheight?
Treeheight
You can add new values to an existing vector:
treeheight <- c(treeheight, 50)
treeheight
[1] 10 10 11 12 13 14 15 20 20 50
Let’s go back to our original treeheight object.
treeheight <- c(15, 20, 12, 15, 18)
treeheight
[1] 15 20 12 15 18
Now that we have a list of tree heights, we can run some basic statistics on this dataset.
mean(treeheight)
[1] 16
median(treeheight)
[1] 15
max(treeheight)
[1] 20
range(treeheight)
[1] 12 20
summary(treeheight)
Min. 1st Qu. Median Mean 3rd Qu. Max.
12 15 15 16 18 20
str(treeheight)
num [1:5] 15 20 12 15 18
In the last function, notice that R tells us this object is a numeric
data class. We can also look at the data class with the class()
function. In addition, R has built in checks for different classes, such
as is.numeric().
class(treeheight)
[1] "numeric"
is.numeric(treeheight)
[1] TRUE
Primary data types in R include: 1. numeric, 2. string or character,
3. logical, and 4. factor. Note that output data types do not always
match the input data. String data are entered with "" or
'' surrounding the characters.
Vectors can have either string, logical, or numeric data; but only
one class of element per vector. To illustrate, let’s re-assign our
treeheight object to some new mixed variable types. To
input string variable, we need to use "" around the
value.
treeheight <- c(10, 14, "twelve", 20)
treeheight
class(treeheight)
Note that R will automatically convert the numeric data into string data. This is called “coercion”.
First let’s make a few different objects and then examine them with
class(). Try to predict in advance which data class each
object will be.
tree <- "a third tree"
x <- TRUE
y <- "5"
z <- 5
So far we have been working with vectors, or one-dimensional data. We can also load in dataframe, which are like tables or spreadsheets of multiple connected variables. While data can be stored in lists and matrices, the most common and flexible data format you will use in R is a dataframe. Dataframes can contain multiple classes of data, but only one class of data per vector. Data frames are usually organized with each row representing a single case or observation. Columns denote variables which apply across rows.
For example, let’s make a new trees dataframe that
includes the heights, species, and products of different trees.
Make a treeheight vector with the values: 15, 20,
12, 15, and 18. Keep this order. Hint: Use the c()
function.
Make a treetype vector with the values: apple,
walnut, apple, hazelnut, and pear. Keep this order.
Make a treeproduct vector with the values
fruit and nut that matches the fruits and nuts in the
same order as the treetype vector.
Use the data.frame() function to make a new
dataframe called trees that includes the three variable
vectors you just created.
Look at the structure of your new dataframe using the
str() function.
Solution:
treeheight <- c(15, 20, 12, 15, 18)
treetype <- c("apple", "walnut", "apple", "hazelnut", "pear")
treeproduct <- c("fruit", "nut", "fruit", "nut", "fruit")
trees <- data.frame(treeheight, treetype, treeproduct)
trees
str(trees)
$In wide format, each row in a dataframe is a case, while the columns
are variables that are measures for each case. To select a variable in a
dataframe, you use the $ operator.
Call the treetype column using the $
operator. What data class is it?
trees$treetype
class(trees$treetype)
We can also examine the spread of the data by making a histogram and
selecting the treeheight variable.
hist(trees$treeheight)
We can also select multiple variables to run operations, such as creating a table of the counts of the number of trees of particular heights that have different tree products.
table(trees$treeheight, trees$treeproduct)
fruit nut
12 1 0
15 1 1
18 1 0
20 0 1
[,]Dataframes can be subset using the format
dfname[row#,col#], or by calling columns by name.
trees[1, 1]
trees[, 3]
trees[2, ]
trees[, "treetype"]
You can also subset dataframes based on logical tests. Let’s look at all the tree types for the trees over 15ft tall. Then let’s examine all the columns for any rows where the tree product is fruit.
trees[trees$treeheight > 15, c("treetype")]
trees[trees$treeheight > 15, 2]
# Trees[trees$product==fruit,]
What’s wrong with this last line of code? (Hint: 3 things)
trees[trees$treeproduct == "fruit", ]
treeheight treetype treeproduct
1 15 apple fruit
3 12 apple fruit
5 18 pear fruit
!= operator, we can also select all rows
which are ‘not equal’ to a given value. Select all the rows for where
the tree product is not fruit.trees[trees$treeproduct != "fruit", ]
treeheight treetype treeproduct
2 20 walnut nut
4 15 hazelnut nut
| operator in between logical checks, we can
also select all rows which are equal to one condition or another
condition. Select all the rows for where the tree type is either apple
or pear. Hint: Run each check for tree type individually, then connect
them with the | operator.trees[trees$treetype == "apple" | trees$treetype == "pear", ]
treeheight treetype treeproduct
1 15 apple fruit
3 12 apple fruit
5 18 pear fruit
We can also run calculations on vectors/variables as a whole. Something to keep note of is that R will recycle through each vector during vector arithmetic. R doesn’t always return a warning when this is occurring, so be sure to keep this in mind.
trees$treeheight
[1] 15 20 12 15 18
trees$treeheight/2
[1] 7.5 10.0 6.0 7.5 9.0
trees$treeheight/c(10, 1) # how does R treat the two vectors during this operation?
[1] 1.5 20.0 1.2 15.0 1.8
There are several ways of quickly assessing the basic attributes of a character vector/variable.
unique(trees$treetype) # returns the name of each unique type of tree
[1] "apple" "walnut" "hazelnut" "pear"
length(unique(trees$treetype)) # how many unique tree types are there?
[1] 4
table(trees$treetype) # returns a table of the number of times each tree type appears in the dataframe
apple hazelnut pear walnut
2 1 1 1
In an ideal world, every data cell would be filled in every data table…but this is rarely the case. Sometimes (ok, frequently) we encounter missing values. But what is a missing value and how does R deal with them? How do you know a missing value when you see it?
R codes missing values as NA (not “NA” which is a
character/string element). Having missing values in a dataframe can
cause some functions to fail. Check out the following example.
missingparts <- c(1, 2, 3, NA)
mean(missingparts) # what is the result?
[1] NA
mean(missingparts, na.rm = T) # we can tell the function to ignore any NA values in the data
[1] 2
How do you know you have missing values rather than another issue in
your code? There are a few functions that allow us to pick out the NAs.
Try examining the missingparts vector with
str(), summary(), and is.na().
What is the result of each of these functions and how might this output
be useful?
str(missingparts)
num [1:4] 1 2 3 NA
summary(missingparts)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.0 1.5 2.0 2.0 2.5 3.0 1
is.na(missingparts)
[1] FALSE FALSE FALSE TRUE
missingparts[is.na(missingparts)] #you can also subset out only the values that are equal to NA. This is not so useful here, but can be useful when you want to isolate rows in a dataframe that have missing values in particular columns.
[1] NA
Base R has many useful functions but where R really shines is through the 22,977 and counting packages that you can download to enhance R’s functionality.
Let’s install and then load the tidyverse suite of
packages. You only need to install a package once, but you have to load
the library every time you start a new R session.
# install.packages('tidyverse')
library(tidyverse)
Packages can also be installed by using the “Tools” –> “Install Packages” menu in RStudio.
Let’s start working with some real data. Here we will work with the Open Data DC Urban Forestry Street Trees dataset. First, download the dataset and then we will load it into R. For the tutorial, I will be loading an older version of this file that I have uploaded online. This means our output may look a bit different.
Details on how to read data files from a Windows operating system: intro2r link.
urbantrees <- read.csv("https://maddiebrown.github.io/ANTH630/data/Urban_Forestry_Street_Trees_2024.csv")
Let’s examine the structure of our dataset.
str(urbantrees)
'data.frame': 211117 obs. of 54 variables:
$ X : num -77 -77 -77 -77 -77 ...
$ Y : num 38.9 38.9 38.9 38.9 38.9 ...
$ SCI_NM : chr "Quercus montana" "Acer rubrum" "Quercus robur fastigiata" "Tilia americana" ...
$ CMMN_NM : chr "Rock chestnut oak" "Red maple" "Columnar English oak" "American linden" ...
$ GENUS_NAME : chr "Quercus" "Acer" "Quercus" "Tilia" ...
$ FAM_NAME : chr "Fagaceae" "Sapindaceae" "Fagaceae" "Tiliaceae" ...
$ DATE_PLANT : chr "2018/02/01 18:50:34+00" "" "" "" ...
$ FACILITYID : chr "31982-090-3001-0269-000" "31982-100-3005-0155-000" "10150-300-3001-0050-000" "32691-092-3001-0105-000" ...
$ VICINITY : chr "922 C ST SE" "1017 C ST SE" "3029 15TH ST NW" "904 D ST SE" ...
$ WARD : int 6 6 1 6 6 1 6 1 6 1 ...
$ TBOX_L : num 99 8 6 9 8 99 12 9 9 12 ...
$ TBOX_W : num 7 4 3 4 4 4 4 3 4 5 ...
$ WIRES : chr "None" "None" "None" "None" ...
$ CURB : chr "Permanent" "Permanent" "Permanent" "Permanent" ...
$ SIDEWALK : chr "Permanent" "Permanent" "Permanent" "Permanent" ...
$ TBOX_STAT : chr "Plant" "Plant" "Plant" "Plant" ...
$ RETIREDDT : chr "" "" "" "" ...
$ DBH : num 5.7 17.7 10.9 13.4 11.9 9.3 1.6 5.5 24.5 21 ...
$ DISEASE : chr "" "" "" "" ...
$ PESTS : chr "" "" "" "" ...
$ CONDITION : chr "Excellent" "Fair" "Fair" "Good" ...
$ CONDITIODT : chr "2024/02/28 23:57:09+00" "2021/02/17 22:21:46+00" "2021/09/13 18:55:03+00" "2020/02/14 01:33:24+00" ...
$ OWNERSHIP : chr "UFA" "UFA" "UFA" "UFA" ...
$ TREE_NOTES : chr "Elevated street side. Feb 2024." "P dead wood only and r small mulberry at base, be careful of roots" "" "" ...
$ MBG_WIDTH : num 13.1 39.4 29.5 29.5 39.4 ...
$ MBG_LENGTH : num 19.7 45.9 46.5 45.9 45.9 ...
$ MBG_ORIENTATION : num 90 90 163 0 90 ...
$ MAX_CROWN_HEIGHT: num 18.9 45.9 37.4 41.5 32.6 ...
$ MAX_MEAN : num 14.3 30.7 21.3 22.6 21.2 ...
$ MIN_CROWN_BASE : num 0.0533 -0.1557 -0.2178 0.1589 -0.1809 ...
$ DTM_MEAN : num 82.3 81.2 202.9 77 81.1 ...
$ PERIM : num 65.6 183.7 170.6 164 177.2 ...
$ CROWN_AREA : num 215 1259 743 1130 1119 ...
$ CICADA_SURVEY : chr "" "" "" "" ...
$ ONEYEARPHOTO : chr "" "" "" "" ...
$ SPECIALPHOTO : chr "" "" "" "" ...
$ PHOTOREMARKS : chr "" "" "" "" ...
$ ELEVATION : chr "Unknown" "Unknown" "Unknown" "Unknown" ...
$ SIGN : chr "Unknown" "Unknown" "Unknown" "Unknown" ...
$ TRRS : int NA NA NA NA NA NA NA NA NA NA ...
$ WARRANTY : chr "2017-2018" "Unknown" "Unknown" "Unknown" ...
$ CREATED_USER : chr "" "" "" "" ...
$ CREATED_DATE : chr "" "" "" "" ...
$ EDITEDBY : chr "sward" "jchapman" "mmcphee" "sward" ...
$ LAST_EDITED_USER: chr "sward" "jchapman" "mmcphee" "sward" ...
$ LAST_EDITED_DATE: chr "2024/02/28 23:57:52+00" "2021/02/17 22:21:47+00" "2021/09/13 18:54:32+00" "2020/02/14 01:34:14+00" ...
$ GIS_ID : logi NA NA NA NA NA NA ...
$ GLOBALID : chr "{0B358D52-AAD4-41AC-B1AF-B19740DBC02A}" "{0F7845B3-E5DE-480B-96EC-B595354BCA5C}" "{EA1C7F1D-8FF6-4A3A-BFBD-0147BABCA5F7}" "{ADB853B2-E32F-4BB4-B949-DE7B5656DCD5}" ...
$ CREATOR : logi NA NA NA NA NA NA ...
$ CREATED : logi NA NA NA NA NA NA ...
$ EDITOR : logi NA NA NA NA NA NA ...
$ EDITED : logi NA NA NA NA NA NA ...
$ SHAPE : logi NA NA NA NA NA NA ...
$ OBJECTID : int 40100904 40100905 40100906 40100907 40100908 40100909 40100910 40100911 40100912 40101121 ...
In tidyverse, the basic operator for linking functions is
%>% or a pipe operator. We can use this to string many
functions together.
The basic function for subsetting columns/variables in tidyverse is
select().
urbantrees %>%
select(CMMN_NM)
The basic function for selecting particular rows is
filter().
urbantrees %>%
filter(CMMN_NM == "Red maple" & DISEASE == "Ganoderma Root Rot")
We can also select all the unique observations within a particular variable. For example, we might be interested in knowing what all the unique ward names are.
urbantrees %>%
distinct(WARD)
WARD
1 6
2 1
3 2
4 7
5 8
6 4
7 3
8 5
9 NA
10 10
11 0
12 9
13 88
14 99
We can also ask R to tell us how many distinct values there are within a variable.
n_distinct(urbantrees$FAM_NAME)
[1] 104
Recalling what we learned about subsetting dataframes, try to
complete the following tasks using base R and/or
tidyverse.
head().urbantrees %>%
filter(GENUS_NAME == "Quercus") %>%
head()
X Y SCI_NM CMMN_NM GENUS_NAME
1 -76.99281 38.88609 Quercus montana Rock chestnut oak Quercus
2 -77.03567 38.92727 Quercus robur fastigiata Columnar English oak Quercus
3 -77.03931 38.92800 Quercus lyrata Overcup oak Quercus
4 -77.00198 38.88539 Quercus palustris Pin oak Quercus
5 -77.04009 38.93254 Quercus phellos Willow oak Quercus
6 -77.04090 38.92535 Quercus palustris Pin oak Quercus
FAM_NAME DATE_PLANT FACILITYID VICINITY
1 Fagaceae 2018/02/01 18:50:34+00 31982-090-3001-0269-000 922 C ST SE
2 Fagaceae 10150-300-3001-0050-000 3029 15TH ST NW
3 Fagaceae 2011/02/17 05:00:00+00 14582-160-3005-0656-000 1653 HOBART ST NW
4 Fagaceae 30030-030-3001-0237-000 OPP 319 3RD ST SE
5 Fagaceae 16890-178-3005-0043-000 1737 PARK RD NW
6 Fagaceae 15408-165-3005-0467-000 1741 LANIER PL NW
WARD TBOX_L TBOX_W WIRES CURB SIDEWALK TBOX_STAT RETIREDDT DBH
1 6 99 7 None Permanent Permanent Plant 5.7
2 1 6 3 None Permanent Permanent Plant 10.9
3 1 99 4 None Permanent Permanent Plant 9.3
4 6 9 4 None Permanent Permanent Plant 24.5
5 1 12 5 None Permanent Permanent Plant 21.0
6 1 99 5 None Permanent Flexipave Plant 28.1
DISEASE PESTS CONDITION CONDITIODT OWNERSHIP
1 Excellent 2024/02/28 23:57:09+00 UFA
2 Fair 2021/09/13 18:55:03+00 UFA
3 Hypoxylon Dead 2023/05/22 19:49:55+00 UFA
4 Fair 2020/11/16 21:32:38+00 UFA
5 Excellent 2022/11/18 21:24:48+00 UFA
6 Fair 2022/08/18 19:26:54+00 UFA
TREE_NOTES
1 Elevated street side. Feb 2024.
2
3
4 Bread loaf-sized Inonatus at base. Three.“Black crust” Kretzschmeria conk, fist-sized, on root flare, edge of sidewalk. Grew one inch DBH since 2017. Another shelf conk at 15’ up. Dieback sprinkled thru crown, June 2019.
5
6 P. Beginning of bls potentiallyWash gas disrupted soil
MBG_WIDTH MBG_LENGTH MBG_ORIENTATION MAX_CROWN_HEIGHT MAX_MEAN MIN_CROWN_BASE
1 13.12336 19.68504 90.00000 18.91814 14.26427 0.05331409
2 29.53926 46.50863 163.30076 37.41346 21.32403 -0.21777124
3 29.52756 32.80840 0.00000 37.61407 18.87533 -0.57492221
4 39.37008 65.61680 90.00000 67.73044 56.71571 0.01390713
5 38.23960 78.39118 150.94540 54.32866 35.59290 -0.13329457
6 55.73578 83.25091 53.74616 61.28306 41.10743 -1.44659974
DTM_MEAN PERIM CROWN_AREA CICADA_SURVEY ONEYEARPHOTO SPECIALPHOTO
1 82.26296 65.6168 215.2780
2 202.87526 170.6037 742.7091
3 187.02505 144.3570 688.8896
4 72.45985 216.5354 1668.4045
5 198.98500 249.3438 1636.1128
6 186.02665 295.2756 2755.5584
PHOTOREMARKS ELEVATION SIGN TRRS WARRANTY CREATED_USER CREATED_DATE
1 Unknown Unknown NA 2017-2018
2 Unknown Unknown NA Unknown
3 Unknown Unknown NA 2010-2011
4 Unknown Unknown NA Unknown
5 Unknown Unknown NA Unknown
6 Unknown Unknown NA
EDITEDBY LAST_EDITED_USER LAST_EDITED_DATE GIS_ID
1 sward sward 2024/02/28 23:57:52+00 NA
2 mmcphee mmcphee 2021/09/13 18:54:32+00 NA
3 jmiller jmiller 2023/05/22 19:50:08+00 NA
4 sward sward 2020/11/16 21:32:41+00 NA
5 jmiller jmiller 2022/11/18 21:23:51+00 NA
6 mmcphee mmcphee 2022/08/18 19:26:19+00 NA
GLOBALID CREATOR CREATED EDITOR EDITED SHAPE
1 {0B358D52-AAD4-41AC-B1AF-B19740DBC02A} NA NA NA NA NA
2 {EA1C7F1D-8FF6-4A3A-BFBD-0147BABCA5F7} NA NA NA NA NA
3 {0BEFB0A1-AAF4-4958-849C-CFBFBA3D4E78} NA NA NA NA NA
4 {CFA5BDF4-B306-4D54-A501-FCE33E3C5146} NA NA NA NA NA
5 {A09B6E85-1A6C-4A13-8011-E93255AEAF21} NA NA NA NA NA
6 {4B386921-E1E9-455D-B878-8488AD418224} NA NA NA NA NA
OBJECTID
1 40100904
2 40100906
3 40100909
4 40100912
5 40101121
6 40101124
urbantrees %>%
filter(FAM_NAME == "Rosaceae") %>%
distinct(CMMN_NM)
CMMN_NM
1 Bradford callery pear
2 Cherry
3 Shadblow serviceberry
4 Prunus x yedoensis
5 Cherry (Snowgoose)
6 Purple leaf plum
7 Crabapple
8 Alleghany serviceberry
9 Yoshino cherry
10 Chokecherry
11 Okame cherry
12 Kwanzan cherry
13 Downy serviceberry
14
15 Arnold crabapple
16 Golden rain tree
17 Serviceberry
18 Autumn brilliance service berry
19 Donald Wyman Crabapple
20 Adirondack Crabapple
21 Whitehouse callery pear
22 Crimson Cloud hawthorn
23 Honeylocust
24 Crabapple (Harvest Gold)
25 Crape myrtle
26 Radiant crabapple
27 Washington hawthorn
28 Eastern redbud
29 Japanese Apricot
30 Lavalle hawthorn
31 Redbud
32 Other (See Notes)
33 Snowdrift crabapple
34 Prunus x yodoensis
35 American hornbeam
36 Canada Red Chekecherry
37 Winter King Green hawthorn
38 Blackgum
39 Hackberry
40 Snowgoose cherry
41 Ivory Silk Japanese tree lilac
42 Trident maple
43 Chinese pistache
44 Thunder cloud plum
45 Higan Cherry
46 Swamp white oak
47 Kentucky coffeetree
48 Flowering Dogwood
49 Silver maple
50 Yellowwood
51 Red horsechestnut
52 Hardy Rubber Tree
53 Hedge maple
54 River birch
55 Moonglow Sweet Bay Magnolia
56 Elm
57 Autumn Brilliance serviceberry
58 Lilac
59 Chinese flame tree
60 Sweetbay magnolia
61 Bald cypress
62 Deodar cedar
63 Scarlet oak
64 Autumn Brilliance Apple serviceberry
65 Staghorn sumac
66 Japanese zelkova
67 Green Vase Japanese zelkova
68 American sycamore
69 Chinese elm
70 Shademaster honeylocust
71 Dura heat' river birch
72 Carolina silverbell
73 Cornelian Cherry
74 Black Cherry
75 Bur oak
76 Southern magnolia
77 Tuliptree
78 Katsuratree
79 Persimmon
80 Autumn Brilliance Serviceberry
81 Thunder cloud plum
82 Sweetgum
83 Red oak
84 Willow oak
85 London plane tree
urbantrees %>%
filter(CMMN_NM == "Bur oak") %>%
distinct(FAM_NAME)
FAM_NAME
1 Fagaceae
2 Fagaceae
3 Sapindaceae
4 Ulmaceae
5
6 Rosaceae
7 Null
In order to answer questions about our data, we need to summarize it in various ways. Below are two ways to make a table of the counts of the number of trees that have various diseases.
table(urbantrees$DISEASE)
Armillaria Root Rot B&B BLS
209039 35 2 279
Butt Rot DED Ganoderma Root Rot Hypoxylon
152 144 441 222
jchapman jconlon jmiller mlehtonen
5 1 14 1
mmcphee msampson None present Powdery Mildew
1 4 191 31
Root Rot sdoan smckim sward
74 3 1 8
Trunk Root Trunk Rot
40 429
urbantrees %>%
group_by(DISEASE) %>%
count() %>%
arrange(desc(n))
# A tibble: 22 × 2
# Groups: DISEASE [22]
DISEASE n
<chr> <int>
1 "" 209039
2 "Ganoderma Root Rot" 441
3 "Trunk Rot" 429
4 "BLS" 279
5 "Hypoxylon" 222
6 "None present" 191
7 "Butt Rot" 152
8 "DED" 144
9 "Root Rot" 74
10 "Trunk Root" 40
# ℹ 12 more rows
In tidyverse we can also create new summarized
dataframes, such as the one below that tells us the mean height of the
trees as well as the tallest height and the genus of the tallest
tree.
urbantrees %>%
summarise(meanheight = mean(MAX_CROWN_HEIGHT, na.rm = T), maxheight = max(MAX_CROWN_HEIGHT,
na.rm = T), tallestspecies = urbantrees[max(urbantrees$MAX_CROWN_HEIGHT,
na.rm = T), "GENUS_NAME"])
meanheight maxheight tallestspecies
1 36.66681 182.9099 Ulmus
| operator in your filter() function
to keep all rows matching both conditions.urbantrees %>%
group_by(GENUS_NAME) %>%
count()
table(urbantrees$GENUS_NAME)
urbantrees %>%
group_by(WARD) %>%
count() %>%
arrange(desc(n))
urbantrees %>%
group_by(WARD, CMMN_NM) %>%
filter(CMMN_NM == "Pawpaw" | CMMN_NM == "Hickory") %>%
count()
ifelse() statementsAnother common form of logical testing in R is the
ifelse() statement. In this case, you pass a logical test
to R and if the output is true, a certain action is performed, then if
it is false, another action is performed. This can be used to make new
variables, subset data, color points on a graph and much more.
Let’s annotate the urban tree data according to whether or not the tree is in fair condition and located in ward 6.
head(ifelse(urbantrees$CONDITION == "Fair" & urbantrees$WARD == "6", "fair tree in ward 6",
"other"))
[1] "other" "fair tree in ward 6" "other"
[4] "other" "other" "other"
# now we can add this to our tree dataset
urbantrees$wardsixfair <- ifelse(urbantrees$CONDITION == "Fair" & urbantrees$WARD ==
"6", "fair tree in ward 6", "other")
# and take a look at our new variable and double check that it worked as
# intended
urbantrees %>%
select(CMMN_NM, CONDITION, WARD, wardsixfair) %>%
head(10)
CMMN_NM CONDITION WARD wardsixfair
1 Rock chestnut oak Excellent 6 other
2 Red maple Fair 6 fair tree in ward 6
3 Columnar English oak Fair 1 other
4 American linden Good 6 other
5 Norway maple Good 6 other
6 Overcup oak Dead 1 other
7 Redmond American Linden Good 6 other
8 New Harmony elm Excellent 1 other
9 Pin oak Fair 6 fair tree in ward 6
10 Willow oak Excellent 1 other
ifelse() statements can also be nested. How might you
write code to output the annotation “fair tree in ward 6” for fair trees
in ward 6, as well as the annotation “good tree in ward 6” for good
trees in ward six. You can put these ifelse() statements in
the same line of code.
ifelse(urbantrees$CONDITION == "Fair" & urbantrees$WARD == "6", "fair tree in ward 6",
ifelse(urbantrees$CONDITION == "Good" & urbantrees$WARD == "6", "good tree in ward 6",
"other"))
For this tutorial, we will use ggplot2 to plot data. In
this package, you initialize a ggplot() object and then add
aesthetic layers such as color controls, lines, points or text
annotations.
First, we will make a basic scatterplot. This shows the perimeter of the crown by the mean crown height. Points are colored according to ward number.
ggplot(urbantrees, aes(PERIM, MAX_MEAN, color = as.factor(WARD))) + geom_point() +
ggtitle("DC Tree Attributes")
There are multiple aesthetic parameters that can be customized in
ggplots. This includes: color, fill, linetype, size, shape, font, and
more. It just depends on which geom you are working with.
We will explore some of these graphical parameters further as this
tutorial introduces different geoms. Here
is a vignette about aesthetic customization in ggplot2.
There are numerous plot types that can be made with
ggplot2. Some examples are included below.
geom_col()
geom_point()
geom_line()
geom_smooth()
geom_histogram()
geom_boxplot()
geom_text()
geom_density()
geom_errorbar()
geom_hline()
geom_abline()
Bar plots are great for showing frequencies or proportions across
different groups. For instance, we may want to calculate the number of
pawpaw trees per ward and then plot this in a bargraph with
ggplot2.
npawpawbyward <- urbantrees %>%
group_by(WARD, CMMN_NM) %>%
filter(CMMN_NM == "Pawpaw") %>%
count()
ggplot(npawpawbyward, aes(x = WARD, y = n)) + geom_col()
We can clean up this plot by reordering the Wards from lowest to highest number of pawpaw trees. Let’s also add custom x and y axis labels and a title.
ggplot(npawpawbyward, aes(x = reorder(npawpawbyward$WARD, npawpawbyward$n), y = n)) +
geom_col() + labs(x = "Ward", y = "Number of pawpaw trees") + ggtitle("Prevalence of Pawpaw Trees by Ward")
"".ndisease <- urbantrees %>%
group_by(CMMN_NM, DISEASE) %>%
filter(CMMN_NM == "Red maple" & DISEASE != "") %>%
count()
reorder() function to control the
order of the x axis variable.coord_flip().labs() and ggtitle() functions.ggplot(ndisease, aes(x = reorder(ndisease$DISEASE, ndisease$n), y = n)) + geom_col() +
coord_flip() + labs(x = "Disease", y = "Number of trees") + ggtitle("Prevalence of diseases in Red Maples in DC")
R has many built-in colors. You can view them by using the
colors() function.
Let’s add color to our plot of maple tree diseases. You can directly assign a color as an aesthetic trait in ggplot or assign the colors to a variable.
geom_col() function of your previous plot code,
add in colors with both the fill= and color=
arguments.fill() to the disease variable from
within the aes() argument of your ggplot()
function. What happens?ggplot(ndisease, aes(x = reorder(ndisease$DISEASE, ndisease$n), y = n)) + geom_col(fill = "green",
color = "blue") + coord_flip() + labs(x = "Disease", y = "Number of trees") +
ggtitle("Prevalence of diseases in Red Maples in DC")
ggplot(ndisease, aes(x = reorder(ndisease$DISEASE, ndisease$n), y = n, fill = DISEASE)) +
geom_col() + coord_flip() + labs(x = "Disease", y = "Number of trees") + ggtitle("Prevalence of diseases in Red Maples in DC")
As a final introduction to R’s capabilities today, let’s quickly make
an interactive map of our urbantrees dataset. If you run
names(urbantrees) you will notice that there are X and Y
variables that give the spatial locations of the trees. We can use these
in combination with Leaflet
to make a map.
library(leaflet)
pawpaws <- urbantrees %>%
filter(CMMN_NM == "Pawpaw")
leaflet() %>%
addTiles() %>%
addMarkers(pawpaws$X, pawpaws$Y, popup = paste("<B>Name: </B>", pawpaws$CMMN_NM,
"<br>", "<B>Condition: </B>", pawpaws$CONDITION, sep = ""))