This workshop is designed to help anthropologists get started using R. R is an amazing open-source tool for cleaning, analyzing, and visualizing data. You can also use R for web publishing and creating other types of documents. In this workshop, we’ll cover some essentials for sociocultural anthropologists and archaeologists who are beginning their R journeys.
We’ll cover the following topics:
At the end of this page we have listed additional R resources to help you continue to build skills after this workshop.
We will be working with R using the software RStudio. RStudio has a
number of helpul tools that makes R more user-friendly. To create a new
R script in RStudio, navigate to File -> New File -> R
Script. An important thing to understand when working with RStudio
is that the console and R script are two different
things. The R script is where you will do all of your coding. Think of
this like a text document that you are writing, saving, and then can
work on again later. Meanwhile, the console is where the output from
your R script will appear. You will be able to tell that the code lines
have finished running and R is ready for the next input when you see a
>
symbol at the last line of your console.
You can customize the layout of your panes and aesthetics of your code by navigating to Tools -> Global Options.
R code is typically written in the script pane of RStudio, while the output appears in the console. Let’s try inputting some basic arithmetic.
1 + 2
5 - 3
2 * (5-3)
Note that R follows the basic order of operations (PEMDAS).
The primary data types in R are:
String or character data are entered with ""
or
''
surrounding the characters. Let’s try entering some
strings into R.
"book"
"three"
book
Note the error on the last line:
Error: object 'book' not found
. This means that there is no
object called book
in our R workspace. We’ll learn about
objects in the following section.
R works by running functions on objects and other data. This means that we can use a function to tell R what we to do with the input data.
For example, let’s ask R to tell us the mean of the numbers from 5 to 8 with the following code:
mean(5:8)
## [1] 6.5
We can also make a quick plot with R:
plot(iris$Petal.Length, iris$Petal.Width)
If you encounter a new function and don’t know exactly how it works
you can pull up the helpful by writing
?EXAMPLEFUNCTIONNAME()
or in this case
?mean()
. This help file will tell you what type of data the
function can read, the different arguments or options you can adjust in
the function, and will usually give several examples of how to use the
function in practice.
While you can directly enter data each time you want to use it, R’s
power comes from assigning data to named objects or variables. Objects
are assigned using the <-
symbol, which means
“everything on the right is now referred to by the object name on the
left”
house <- c(1,2,3)
house
## [1] 1 2 3
sizes <- seq(1:5)
sizes
## [1] 1 2 3 4 5
Something very important to keep in mind with R is that it is case sensitive, unlike some other languages. This is very important to know for keeping track of different variables and often a cause of many coding errors. For example, we can create different objects referring to trees by changing the capitalization.
Tree <- "tree"
TREE <- "tree again"
Now try entering:
Tree
TREE
tree
Why doesn’t the last line work?
Let’s make an object called tree
.
tree <- "a third tree"
There are several different types or classes of data that can be
assigned in R. We’ll cover a few basics here. Vectors can have either
string, logical, or numeric data; but each vector can only contain one
data class. You can check the class of a vector with the
class()
function. In addition, R has built in checks for
different classes, such as is.numeric()
.
x <- seq(1,5)
is.numeric(x)
x <- c(1:5)
x <- c(1,10,"eleven", 27)
is.numeric(x)
x <- c(rep(T, 10), F, F, T)
x
Note that R will automatically convert the numeric data into string data. This is called “coercion”.
Make a new object called “heights” that is composed of the numbers: 2, 4, 6, 8.
If you didn’t already use it, try using the seq() function to
recreate the heights
object. Hint: try ?seq()
to see the help file for this function.
heights <- c(2, 4, 6, 8)
heights <- seq(2,8,2)
The base installation of R has many built-in functions and datasets. However, often we want to extend R’s functionality through installing new packages with additional functions. The cool thing about R is that it is made by the users, so there are always new packages being written to solve new data wrangling, analysis and visualization challenges. As of writing this, there are 20,656 packages listed in CRAN.
Packages only need to be installed one time per user on a computer.
However, each time you run a new R session you will need to load the
library. I usually install packages using the
Tools -> Install Packages
menu. Be sure to select
“install dependencies” as this will add any other packages that you need
running in the background in order to use the package you are
installing.
For loading packages, it is best to load them using R code. This makes your code reproducible if you send it to someone else and ensures that restarting R and running all your code will work each time.
Let’s take a minute to install and load the following packages:
library(tidyverse)
library(DescTools)
library(leaflet)
library(stringr)
library(tidytext)
There are multiple ways to load data into R. Best practice is to load your data file using code rather than drop down menus. This ensures that a specific named file is associated with your code each time and reduces potential for accidentally loading the wrong data file.
For this workshop, we will read in a file using a url. You can also load data from your computer by pasting the file path into your file reading function. See more information about this here.
DCtrees <- read.csv("https://maddiebrown.github.io/ANTH630/data/Urban_Forestry_Street_Trees_2024.csv")
Whenever we load a new data file, the first thing we might want to do
is examine the structure of the data. R has a handy function:
str()
that allows you to see the number of observations,
column and data types, as well as a snippet of the first few
observations for each column.
str(DCtrees)
There are a number of functions in R that can help us get a feel for our data and visualize it in many helpful and interesting ways.
For example, we can use summary()
to obtain summary
statistics and information about the contents of our data frame, such as
the class of the variables in each column of the data frame.
summary(DCtrees)
You can use the head()
function to view the first few
rows of the data frame.
head(DCtrees)
## X Y SCI_NM CMMN_NM GENUS_NAME
## 1 -76.99281 38.88609 Quercus montana Rock chestnut oak Quercus
## 2 -76.99206 38.88599 Acer rubrum Red maple Acer
## 3 -77.03567 38.92727 Quercus robur fastigiata Columnar English oak Quercus
## 4 -76.99334 38.88417 Tilia americana American linden Tilia
## 5 -76.99838 38.88728 Acer platanoides Norway maple Acer
## 6 -77.03931 38.92800 Quercus lyrata Overcup oak Quercus
## FAM_NAME DATE_PLANT FACILITYID VICINITY
## 1 Fagaceae 2018/02/01 18:50:34+00 31982-090-3001-0269-000 922 C ST SE
## 2 Sapindaceae 31982-100-3005-0155-000 1017 C ST SE
## 3 Fagaceae 10150-300-3001-0050-000 3029 15TH ST NW
## 4 Tiliaceae 32691-092-3001-0105-000 904 D ST SE
## 5 Sapindaceae 30060-020-3001-0101-000 208 6TH ST SE
## 6 Fagaceae 2011/02/17 05:00:00+00 14582-160-3005-0656-000 1653 HOBART ST NW
## WARD TBOX_L TBOX_W WIRES CURB SIDEWALK TBOX_STAT RETIREDDT DBH
## 1 6 99 7 None Permanent Permanent Plant 5.7
## 2 6 8 4 None Permanent Permanent Plant 17.7
## 3 1 6 3 None Permanent Permanent Plant 10.9
## 4 6 9 4 None Permanent Permanent Plant 13.4
## 5 6 8 4 None Permanent Permanent Plant 11.9
## 6 1 99 4 None Permanent Permanent Plant 9.3
## DISEASE PESTS CONDITION CONDITIODT OWNERSHIP
## 1 Excellent 2024/02/28 23:57:09+00 UFA
## 2 Fair 2021/02/17 22:21:46+00 UFA
## 3 Fair 2021/09/13 18:55:03+00 UFA
## 4 Good 2020/02/14 01:33:24+00 UFA
## 5 Good 2020/09/16 19:38:17+00 UFA
## 6 Hypoxylon Dead 2023/05/22 19:49:55+00 UFA
## TREE_NOTES MBG_WIDTH
## 1 Elevated street side. Feb 2024. 13.12336
## 2 P dead wood only and r small mulberry at base, be careful of roots 39.37008
## 3 29.53926
## 4 29.52756
## 5 Arborist removed some deadwood and scheduled for pruning on 1/5/17 39.37008
## 6 29.52756
## MBG_LENGTH MBG_ORIENTATION MAX_CROWN_HEIGHT MAX_MEAN MIN_CROWN_BASE DTM_MEAN
## 1 19.68504 90.0000 18.91814 14.26427 0.05331409 82.26296
## 2 45.93176 90.0000 45.90728 30.68867 -0.15571680 81.22527
## 3 46.50863 163.3008 37.41346 21.32403 -0.21777124 202.87526
## 4 45.93176 0.0000 41.53025 22.57074 0.15885049 77.00650
## 5 45.93176 90.0000 32.59466 21.19249 -0.18092014 81.08643
## 6 32.80840 0.0000 37.61407 18.87533 -0.57492221 187.02505
## PERIM CROWN_AREA CICADA_SURVEY ONEYEARPHOTO SPECIALPHOTO PHOTOREMARKS
## 1 65.6168 215.2780
## 2 183.7270 1259.3763
## 3 170.6037 742.7091
## 4 164.0420 1130.2095
## 5 177.1654 1119.4456
## 6 144.3570 688.8896
## ELEVATION SIGN TRRS WARRANTY CREATED_USER CREATED_DATE EDITEDBY
## 1 Unknown Unknown NA 2017-2018 sward
## 2 Unknown Unknown NA Unknown jchapman
## 3 Unknown Unknown NA Unknown mmcphee
## 4 Unknown Unknown NA Unknown sward
## 5 Unknown Unknown NA Unknown sward
## 6 Unknown Unknown NA 2010-2011 jmiller
## LAST_EDITED_USER LAST_EDITED_DATE GIS_ID
## 1 sward 2024/02/28 23:57:52+00 NA
## 2 jchapman 2021/02/17 22:21:47+00 NA
## 3 mmcphee 2021/09/13 18:54:32+00 NA
## 4 sward 2020/02/14 01:34:14+00 NA
## 5 sward 2020/09/16 19:51:11+00 NA
## 6 jmiller 2023/05/22 19:50:08+00 NA
## GLOBALID CREATOR CREATED EDITOR EDITED SHAPE
## 1 {0B358D52-AAD4-41AC-B1AF-B19740DBC02A} NA NA NA NA NA
## 2 {0F7845B3-E5DE-480B-96EC-B595354BCA5C} NA NA NA NA NA
## 3 {EA1C7F1D-8FF6-4A3A-BFBD-0147BABCA5F7} NA NA NA NA NA
## 4 {ADB853B2-E32F-4BB4-B949-DE7B5656DCD5} NA NA NA NA NA
## 5 {300EF1F5-F440-4E16-BBC0-69C6CDD772CA} NA NA NA NA NA
## 6 {0BEFB0A1-AAF4-4958-849C-CFBFBA3D4E78} NA NA NA NA NA
## OBJECTID
## 1 40100904
## 2 40100905
## 3 40100906
## 4 40100907
## 5 40100908
## 6 40100909
The tail()
function to view the last few lines of your
data frame.
tail(DCtrees)
## X Y SCI_NM CMMN_NM GENUS_NAME
## 211112 -76.98323 38.83522 Malus 'Harvest Gold' Other (See Notes)
## 211113 -76.92602 38.88502
## 211114 -77.06645 38.97106
## 211115 -76.92586 38.88641
## 211116 -76.92609 38.88525 Quercus palustris Pin oak Quercus
## 211117 -76.92634 38.88415
## FAM_NAME DATE_PLANT FACILITYID
## 211112
## 211113 30530-025-3001-0151-000
## 211114 10330-615-3001-0203-000
## 211115 30530-015-3005-0089-000
## 211116 Fagaceae 30530-025-3005-0080-000
## 211117 2019/10/28 18:09:28+00 30530-030-3005-0120-000
## VICINITY WARD TBOX_L TBOX_W WIRES CURB SIDEWALK
## 211112 1300 Blk Southern Avenue SE 8 NA NA None None None
## 211113 200 BLK 53RD ST SE 7 99 3 None Permanent Permanent
## 211114 6124 33RD ST NW 4 99 7 Both Permanent Permanent
## 211115 125 53RD ST SE 7 99 3 Both Permanent Permanent
## 211116 300 BLK 53RD ST SE 7 99 3 Both Permanent Permanent
## 211117 OPP 301 53RD ST SE 7 99 3 Both Permanent Permanent
## TBOX_STAT RETIREDDT DBH DISEASE PESTS CONDITION
## 211112 Plant 2.0 Good
## 211113 Conflict NA
## 211114 Open NA Good
## 211115 Conflict NA
## 211116 Plant 17.5 Trunk Root Fair
## 211117 Open NA Good
## CONDITIODT OWNERSHIP
## 211112 2024/03/08 20:41:09+00 UFA
## 211113 2024/03/08 20:40:24+00 UFA
## 211114 2024/03/08 21:56:08+00 UFA
## 211115 2024/03/08 21:56:34+00 UFA
## 211116 2024/03/08 22:08:56+00 UFA
## 211117 2024/03/08 22:25:55+00 UFA
## TREE_NOTES
## 211112 IPMA project
## 211113 Large curb cut and pp trees
## 211114
## 211115 Long guy wire
## 211116 Plant hornbeam. White building near private magnolia
## 211117 Hornbeam. Plant at corner of intersection opposite of school sign. Site lines less important for left turn, one way street.
## MBG_WIDTH MBG_LENGTH MBG_ORIENTATION MAX_CROWN_HEIGHT MAX_MEAN
## 211112 NA NA NA NA NA
## 211113 32.80840 39.37008 0 54.65678 35.175927
## 211114 85.30184 98.42520 90 89.07318 56.363557
## 211115 26.24672 36.08924 0 30.77502 23.386227
## 211116 22.96588 36.08924 0 38.82791 22.407826
## 211117 22.96588 26.24672 90 13.86132 7.454631
## MIN_CROWN_BASE DTM_MEAN PERIM CROWN_AREA CICADA_SURVEY ONEYEARPHOTO
## 211112 NA NA NA NA
## 211113 -0.39570668 160.1070 144.3570 914.9315
## 211114 -2.13418047 360.8612 413.3858 5920.1450
## 211115 -1.63968071 143.1423 124.6719 613.5423
## 211116 -0.01722297 158.3834 118.1102 688.8896
## 211117 -0.14787045 163.6321 131.2336 312.1531
## SPECIALPHOTO PHOTOREMARKS ELEVATION SIGN TRRS WARRANTY CREATED_USER
## 211112 Unknown Unknown NA Unknown cklapthor
## 211113 Unknown Unknown NA Unknown rdelsack
## 211114 Unknown Unknown NA Unknown JCONLON
## 211115 Unknown Unknown NA Unknown rdelsack
## 211116 Unknown Unknown NA Unknown rdelsack
## 211117 Unknown Unknown NA 2019-2020 rdelsack
## CREATED_DATE EDITEDBY LAST_EDITED_USER LAST_EDITED_DATE
## 211112 2024/03/08 20:40:53+00 cklapthor cklapthor 2024/03/08 20:40:53+00
## 211113 2024/03/08 20:45:17+00 rdelsack rdelsack 2024/03/08 20:45:17+00
## 211114 2024/03/08 21:56:08+00 jconlon JCONLON 2024/03/08 21:56:08+00
## 211115 2024/03/08 21:56:09+00 rdelsack rdelsack 2024/03/08 21:56:09+00
## 211116 2024/03/08 22:08:24+00 rdelsack rdelsack 2024/03/08 22:08:24+00
## 211117 2024/03/08 22:29:29+00 rdelsack rdelsack 2024/03/08 22:29:29+00
## GIS_ID GLOBALID CREATOR CREATED EDITOR
## 211112 NA {ACF5EEB0-BF92-464E-8BEC-FD54E3EEAB68} NA NA NA
## 211113 NA {ECBB335F-9F07-4A5A-99B0-EA935CF5E2A4} NA NA NA
## 211114 NA {359E8372-14C0-45A1-97E3-35FE79F91486} NA NA NA
## 211115 NA {18A305FC-FEE6-4419-B6FA-D056A9B0F7C6} NA NA NA
## 211116 NA {36DB1DCD-D49A-4168-9E14-723C7D96A242} NA NA NA
## 211117 NA {97C49852-3CBD-4D08-88B8-67A9B90C6D4A} NA NA NA
## EDITED SHAPE OBJECTID
## 211112 NA NA 40312223
## 211113 NA NA 40312224
## 211114 NA NA 40312225
## 211115 NA NA 40312226
## 211116 NA NA 40312227
## 211117 NA NA 40312228
The unique()
function can help you extract unique
elements from your data frame. For example, if you wanted to know how
many unique tree genus names are included in the data and what they are,
you can use this code:
unique(DCtrees$GENUS_NAME)
## [1] "Quercus"
## [2] "Acer"
## [3] "Tilia"
## [4] "Ulmus"
## [5] "Liquidambar"
## [6] ""
## [7] "Platanus"
## [8] "Gleditsia"
## [9] "Ginkgo"
## [10] "Pyrus"
## [11] "Celtis"
## [12] "Zelkova"
## [13] "Liriodendron"
## [14] "Carpinus"
## [15] "Gymnocladus"
## [16] "Pistacia"
## [17] "Other"
## [18] "Nyssa"
## [19] "Syringa"
## [20] "Cercidiphyllum"
## [21] "Cercis"
## [22] "Lagerstroemia"
## [23] "Betula"
## [24] "Fagus"
## [25] "Cladrastis"
## [26] "Robinia"
## [27] "Prunus"
## [28] "Magnolia"
## [29] "Parrotia"
## [30] "Metasequoia"
## [31] "Amelanchier"
## [32] "Taxodium"
## [33] "Koelreuteria"
## [34] "Malus"
## [35] "Ostrya"
## [36] "Catalpa"
## [37] "Morus"
## [38] "Styphnolobium"
## [39] "Fraxinus"
## [40] "Ilex"
## [41] "Cornus"
## [42] "Pinus"
## [43] "Aescululs"
## [44] "Chionanthus"
## [45] "Juniperus"
## [46] "Oxydendrum"
## [47] "Halesia"
## [48] "Juglans"
## [49] "Ailanthus"
## [50] "Crataegus"
## [51] "Diospyros"
## [52] "Sassafras"
## [53] "Salix"
## [54] "Eucommia"
## [55] "Carya"
## [56] "No\nNo\nNo"
## [57] "Aesculus"
## [58] "Stewartia"
## [59] "Maclura"
## [60] "Rhus"
## [61] "No"
## [62] "Populus"
## [63] "Tsuga"
## [64] "Picea"
## [65] "Cryptomeria"
## [66] "Cotinus"
## [67] "Thuja"
## [68] "Laburnum"
## [69] "Maackia"
## [70] "Asimina"
## [71] "Cedrus"
## [72] "Cornus\nCornus\nCornus\nCornus\nCornus\nCornus"
## [73] "Corylus"
## [74] "Alnus"
## [75] "Paulownia"
## [76] "Tetradium"
## [77] "Phellodendron"
## [78] "Taxus"
## [79] "Alibizia"
## [80] "Toona"
## [81] "Viburnum"
## [82] "Amelanchier x"
In this tutorial, we will mostly use ‘tidyverse’, which we’ve already installed above, to perform exploratory data analysis. Tidyverse is a collection of R packages designed to make data manipulation, visualization, and analysis easier and more efficient.
library(tidyverse)
The ‘dplyr’ package within tidyverse provides a set of functions for
efficiently manipulating data frames. Some of the main functions in
dplyr are filter()
, mutate()
,
summarize()
, and arrange()
. These functions
allow you to filter rows, create new variables, group data, and perform
various transformations.
We can filter the dataset to select specific rows based on conditions
using filter()
. For example, we can filter out data for
trees with DBH greater than 10.
DBH_trees <- DCtrees %>%
filter(DBH > 10)
#head(DBH_trees)
We can use the %like%
operator to identify substrings
within longer strings. By putting %
symbols at the start
and end of the string we want to find, this tells R to look for any
occurrence of the string within the data. Read more about DescTools’
%like%
operator here.
The code below tells R to look within the DCtrees dataframe and filter (keep) all the rows where the common name (CMMN_NM) contains the string “apple” in it. Then, the second argument tells R to only select (keep) the column that contains the common names of the plants (CMMN_NM). Finally, because there are so many trees that have apple in their names, we are asking R to only show us the first 20 results. Keep in mind that R is case sensitive here, so this will only find results with all lowercase letters in the searched string.
DCtrees %>% filter(CMMN_NM %like% "%apple%") %>% select(CMMN_NM) %>% head(20)
## CMMN_NM
## 1 Crabapple
## 2 Crabapple
## 3 Crabapple
## 4 Crabapple
## 5 Crabapple
## 6 Arnold crabapple
## 7 Crabapple
## 8 Crabapple
## 9 Crabapple
## 10 Crabapple
## 11 Crabapple
## 12 Crabapple
## 13 Donald Wyman Crabapple
## 14 Crabapple
## 15 Adirondack Crabapple
## 16 Crabapple
## 17 Crabapple
## 18 Donald Wyman Crabapple
## 19 Donald Wyman Crabapple
## 20 Crabapple
We can also combine filters. Let’s filter out data for trees in the genus Acer with DBH greater than 10.
Acer_DBH_trees <- DCtrees %>%
filter(GENUS_NAME=="Acer", DBH > 10)
#head(Acer_DBH_trees)
We can also combine filters using logical operators ‘&’ and ‘|’. ‘&’ is used to specify that both conditions must be true. ‘|’ is used to specify that either condition can be true.
We can use ‘&’ to write the code above that filters out Acer trees with a DBH greater than 10 in a different way.
Acer_DBH_trees2 <- DCtrees %>%
filter(GENUS_NAME=="Acer" & DBH > 10)
#head(Acer_DBH_trees2)
We can use a combination of ‘&’ and ‘|’ to filter out data for trees that are in either the genus Acer or Quercus who have DBH greater than 10.
Acer_Q_DBH <- DCtrees %>%
filter(GENUS_NAME=="Acer" | GENUS_NAME== "Quercus" & DBH > 10)
#head(Acer_Q_DBH)
How would you filter all the rows where the FAM_NAME
is “Rosaceae”?
How would you filter all the rows where the
CMMN_NAME
contains the string apple
anywhere
in it and the CONDITION
of the tree is
“Excellent”?
DCtrees %>% filter(FAM_NAME == "Rosaceae")
DCtrees %>% filter(CMMN_NM %like% "%apple%" & CONDITION == "Excellent")
filter()
selects rows, but select()
is used
to select columns.
For example, we could use select()
if we only wanted the
facility ID (FACILITYID) and vicinity (VICINITY) of the trees that are
in genus Acer and have a DBH great than 10.
Facility_ID_Large_Acer <- Acer_DBH_trees2 %>%
select(FACILITYID, VICINITY)
#head(Facility_ID_Large_Acer)
You can rename columns in your dataset using rename()
.
For example, if we wanted to change VICINITY to ADDRESS, we could use
the following code:
DCtrees_Address <- DCtrees %>%
rename(ADDRESS = VICINITY)
#head(DCtrees_Address)
You can add new columns with adjusted values using
mutate()
. For example, we can add a column with the
circumference of each tree.
Circumference_DCtrees <- DCtrees %>%
mutate(CIRCUMFERENCE = DBH * pi )
#head(Circumference_DCtrees)
Tidyverse can also be used to produce descriptive statistics from
your dataset. summarize()
is a good tool for producing
descriptive statistics like mean()
, median()
,
sd()
, and sum()
.
For example, the code below will calculate the mean, median, and
standard deviation of DBH in each genus. We will also need to use the
group_by()
function to do this.
DCtrees_Stats <- DCtrees %>%
group_by(GENUS_NAME) %>%
summarize(mean_DBH = mean(DBH), median_DBH = median(DBH), sd_DBH = sd(DBH))
#DCtrees_Stats
As you can see, some of the statistics in the chart above come back as ‘NA’ because the DBH of some trees were not measured and therefore have ‘NA’ in the DBH column. We can fix this issue by ignoring ‘NA’ values and calculating our statistics with only the trees we have data for. To do this, we will use ‘na.rm = TRUE’.
DCtrees_Stats <- DCtrees %>%
group_by(GENUS_NAME) %>%
summarize(mean_DBH = mean(DBH, na.rm = TRUE), median_DBH = median(DBH, na.rm = TRUE), sd_DBH = sd(DBH, na.rm = TRUE))
#DCtrees_Stats
The ‘ggplot2’ package in tidyverse is a great tool for visualizing
data. You can use it to make many different kinds of plots.
ggplot()
is the function we will use to initialize a plot
and different arguments can go inside this function such as
(aes()
), which customizing the aesthetics of the plot, and
(geom()
), which adds geometric layers like points, lines,
and bars to the plot. You can also add titles, labels, and themes to
your plot.
Let’s make a scatterplot:
ggplot(DCtrees, aes(x=MBG_WIDTH, y=MBG_LENGTH)) +
geom_point(size=2, shape=23) +
labs(x = "MBG Width", y = "MBG Length",
title = "MBG Length vs Width")
A box and whisker plot showing DBH for Acer, Quercus, Magnolia, and Juniperus trees:
Genus_selected <- DCtrees %>%
filter(GENUS_NAME=="Acer" | GENUS_NAME== "Quercus" | GENUS_NAME == "Magnolia" | GENUS_NAME== "Juniperus")
ggplot(Genus_selected, aes(x = GENUS_NAME, y = DBH)) + geom_boxplot() + labs(x = "Genus of Trees", y = "Diameter at Breast Height (DBH)", title = "Comparison of DBH Across Tree Genera") + theme_minimal()
In the code below, I have removed outliers and changed the y-axis range. But the data still needs to be cleaned further.
ggplot(Genus_selected, aes(x = GENUS_NAME, y = DBH)) +
geom_boxplot(outlier.shape = NA, fill = "lightblue", color = "blue") +
coord_cartesian(ylim = c(0,50)) +
labs(x = "Genus of Trees", y = "Diameter at Breast Height (DBH)",
title = "Comparison of DBH Across Tree Genera") +
theme_minimal()
A bar graph showing mean DBH of Acer, Juniperus, Magnolia, and Quercus genera:
mean_dbh <- aggregate(DBH ~ GENUS_NAME, data = Genus_selected, FUN = mean)
ggplot(mean_dbh, aes(x = GENUS_NAME, y = DBH)) +
geom_bar(stat = "identity", fill = "skyblue", width = 0.5) + # Create bar plot
labs(x = "Genus", y = "Mean DBH") + # Add labels for axes
ggtitle("Mean Diameter at Breast Height (DBH) by Genus") + # Add title
theme_minimal()
Leaflet is a package
that enables the rapid creation of interactive maps using R. If you
didn’t previously load the leaflet library, do so now with the following
code: library(leaflet)
. Note: you may also need to
install the package before loading it.
Leaflet maps are created in layers, first adding the basemap and then the interactive data layers on top. Here, we will make a map of the location of the pawpaw trees in DC using leaflet.
First, we can isolate out all the observations that contain the
string Pawpaw
in them.
pawpaws <- DCtrees %>% filter(CMMN_NM %like% "%Pawpaw%")
Next we can make a map of the pawpaw trees. We can pipe all the functions to create the map in the same line of code. When adding markers to the map, it is important to specify which columns refer to the latitude and longitude.
leaflet(pawpaws) %>% addTiles() %>% addCircleMarkers(lng=pawpaws$X, lat=pawpaws$Y)
Suppose we wanted to add multiple layers to the leaflet map. Similar
to adding layers to a plot in ggplot2, we can add a second
addCircleMarkers()
function to our original leaflet
map.
Let’s filter out all the apple trees that are marked as being in “Excellent” condition.
excellentapples <- DCtrees %>% filter(CMMN_NM %like% "%apple%" & CONDITION == "Excellent")
Next, we can add this layer to our map. Here I am also adding in the argument “popup” for the excellent apples layer. The popup here displays the common name of the tree.
leaflet(pawpaws) %>%
addTiles() %>%
addCircleMarkers(lng=pawpaws$X, lat=pawpaws$Y) %>%
addCircleMarkers(lng=excellentapples$X, lat=excellentapples$Y, popup=excellentapples$CMMN_NM, color="green", radius=2)
Add a third layer to the map, this time selecting all the apples marked as being in “good” condition. First, create your new vector of good apples and then add it to our existing map. Finally, add a popup detailing the scientific name of the tree.
First, make a good apples vector.
goodapples <- DCtrees %>% filter(CMMN_NM %like% "%apple%" & CONDITION == "Good")
Now make the map with all layers added in.
leaflet(pawpaws) %>%
addTiles() %>%
addCircleMarkers(lng=pawpaws$X, lat=pawpaws$Y) %>%
addCircleMarkers(lng=excellentapples$X, lat=excellentapples$Y, label=excellentapples$CMMN_NM, popup=excellentapples$CMMN_NM, color="green", radius=2) %>%
addCircleMarkers(lng=goodapples$X, lat=goodapples$Y, color="red", radius=2, label=goodapples$SCI_NM, popup = goodapples$SCI_NM)
In leaflet we can also control whether or not each layer is included
on the map. Using the addLayersControl()
function, we can
add a menu for turning each layer on or off. Within each layer, add an
argument for the group
. These group
labels
will then be included in the baseGroups
and
overlayGroups
arguments. Note that baseGroups
enables toggling between layers (or basemaps) while
overlayGroups
enables turning specific layers on and
off.
#https://rstudio.github.io/leaflet/articles/showhide.html
leaflet(pawpaws) %>%
addTiles() %>%
addCircleMarkers(lng=pawpaws$X, lat=pawpaws$Y, group="Pawpaws") %>%
addCircleMarkers(lng=excellentapples$X, lat=excellentapples$Y, popup=excellentapples$CMMN_NM, color="green", radius=2, group="Excellent apples") %>%
addCircleMarkers(lng=goodapples$X, lat=goodapples$Y, popup=goodapples$CMMN_NM, color="red", radius=2, group="Good apples") %>%
addLayersControl(
baseGroups = c("Excellent apples", "Good apples"),
overlayGroups = c("Pawpaws"),
options = layersControlOptions(collapsed = FALSE)
)
Analyzing text data can yield exciting insights into our research
questions. We’ll cover how to combine qualitative and structured
approaches to text analysis through analyzing the
TREE_NOTES
column in our data. First, let’s examine the
first 20 rows. Do any patterns or themes stand out to you?
head(DCtrees$TREE_NOTES, 20)
## [1] "Elevated street side. Feb 2024."
## [2] "P dead wood only and r small mulberry at base, be careful of roots"
## [3] ""
## [4] ""
## [5] "Arborist removed some deadwood and scheduled for pruning on 1/5/17"
## [6] ""
## [7] "Two leaders, both kinds horizontal, Oct 2022., resident at 313 says we are basically a waste of tax payer money so try and be nice to him "
## [8] ""
## [9] "Bread loaf-sized Inonatus at base. Three.“Black crust” Kretzschmeria conk, fist-sized, on root flare, edge of sidewalk. Grew one inch DBH since 2017. Another shelf conk at 15’ up. Dieback sprinkled thru crown, June 2019."
## [10] ""
## [11] ""
## [12] ""
## [13] "P. Beginning of bls potentiallyWash gas disrupted soil"
## [14] ""
## [15] ""
## [16] " "
## [17] "Early defoliation, likely bacterial or fungal leaf"
## [18] "Multiple vertical leaders."
## [19] ""
## [20] "sprouts, elevate"
Let’s go through some different approaches to analyzing these open ended text responses. We’ll cover:
One of the first things we might examine in text analysis is the
frequency with which particular words occur. In R, we can convert longer
string data into individual words using the unnest_tokens()
function.
Let’s pull out the individual words found in the TREE_NOTES column of the DCtrees data.
#first pull out only the columns needed for this analysis
treewords<- DCtrees %>% select(OBJECTID, TREE_NOTES)
#unnest the words
treewords <- treewords %>% unnest_tokens(output=word, input=TREE_NOTES)
#take a look at the first 20 rows
head(treewords, 20)
## OBJECTID word
## 1 40100904 elevated
## 2 40100904 street
## 3 40100904 side
## 4 40100904 feb
## 5 40100904 2024
## 6 40100905 p
## 7 40100905 dead
## 8 40100905 wood
## 9 40100905 only
## 10 40100905 and
## 11 40100905 r
## 12 40100905 small
## 13 40100905 mulberry
## 14 40100905 at
## 15 40100905 base
## 16 40100905 be
## 17 40100905 careful
## 18 40100905 of
## 19 40100905 roots
## 20 40100908 arborist
Now let’s examine the top 20 words in these notes.
#look at the top 20 words in the document
treewords %>% count(word,sort=T) %>% top_n(20)
## Selecting by n
## word n
## 1 prune 20711
## 2 pruned 10536
## 3 tree 9341
## 4 to 8726
## 5 and 7031
## 6 for 5846
## 7 by 5375
## 8 of 5303
## 9 deadwood 5031
## 10 in 4762
## 11 elevate 4649
## 12 planted 4008
## 13 warranty 3978
## 14 p 3684
## 15 the 3424
## 16 clearance 3108
## 17 remove 3078
## 18 sidewalk 2887
## 19 on 2852
## 20 no 2756
Looking at these most frequently occurring words, some of them clearly refer to the condition of the trees (e.g. prune, deadwood, planted), while others mention additional spatial features associated with the point (e.g. sidewalk, clearance). However, some words are also different forms of the same word (e.g. prune and pruned).
Still other words are less informative (e.g. to, and, by). We consider these filler words to be “stop words” and can remove them systematically from the text. We’ll use existing stop word libraries today, but know that you can also make custom stop word lists.
First, let’s take a look at the stopwords available in R.
data(stop_words)
stop_words %>% top_n(50)
## Selecting by lexicon
## # A tibble: 174 × 2
## word lexicon
## <chr> <chr>
## 1 i snowball
## 2 me snowball
## 3 my snowball
## 4 myself snowball
## 5 we snowball
## 6 our snowball
## 7 ours snowball
## 8 ourselves snowball
## 9 you snowball
## 10 your snowball
## # ℹ 164 more rows
Then, we can use an anti-join to remove the stopwords from our dataset.
treewordstidy <- treewords %>% anti_join(stop_words)
## Joining with `by = join_by(word)`
You’ll notice that there are still some words that are not very helpful here, including numbers. Let’s remove the numbers as well:
#you can also make a custom stopwords list: https://www.tidytextmining.com/nasa
# adapted from: https://bookdown.org/psonkin18/berkshire/tokenize.html
treewordstidy <- treewordstidy %>% filter(!grepl('[0-9]', word))
Select and display the top 50 most frequently occurring terms in the
tree notes using the new, tidy dataframe. Display the terms from most
frequent to least frequent. Hint: Try count()
and
top_n()
.
treewordstidy %>% count(word,sort=T) %>% top_n(50)
## Selecting by n
## word n
## 1 prune 20711
## 2 pruned 10536
## 3 tree 9341
## 4 deadwood 5031
## 5 elevate 4649
## 6 planted 4008
## 7 warranty 3978
## 8 clearance 3108
## 9 remove 3078
## 10 sidewalk 2887
## 11 plant 2563
## 12 street 2487
## 13 removed 2342
## 14 elevation 2084
## 15 trees 2050
## 16 stem 1975
## 17 resident 1971
## 18 limb 1816
## 19 trunk 1766
## 20 stump 1645
## 21 elevated 1634
## 22 planting 1555
## 23 elm 1549
## 24 private 1537
## 25 passed 1517
## 26 box 1514
## 27 reduce 1442
## 28 close 1438
## 29 casey 1427
## 30 limbs 1335
## 31 dead 1318
## 32 stems 1307
## 33 crown 1293
## 34 structural 1281
## 35 removal 1272
## 36 building 1259
## 37 property 1252
## 38 project 1240
## 39 multi 1216
## 40 damage 1179
## 41 codominant 1177
## 42 dbh 1176
## 43 arborist 1148
## 44 pruning 1140
## 45 base 1112
## 46 sign 1104
## 47 root 1088
## 48 low 1065
## 49 construction 1032
## 50 light 1026
Word frequency can be clearly displayed using barcharts made in ggplot2.
#plot top words from tokenized data
treewordstidy %>% count(word,sort=TRUE) %>% top_n(50) %>% mutate(word=reorder(word,n))%>% ggplot(aes(x=word,y=n))+ geom_col()+xlab(NULL)+coord_flip()+labs(y="Count",x="Unique words", title="Top words in DC tree notes")
## Selecting by n
We can also visualize frequency with wordclouds.
library(wordcloud)
## Loading required package: RColorBrewer
treewordstidy %>% count(word) %>% with(wordcloud(word, n, max.words = 100))
Sentiment analysis enables us to tag individual terms according to particular themes or emotions. For analyzing the positivity/negativity of a text or the emotions associated with it, we can use existing libraries or lexicons that include tagged terms.
First, let’s use the NRC sentiment lexicon. This lexicon tags terms according to 10 different emotions such as anger, anticipation, surprise, etc.
# install text analysis libraries
library(textdata)
#show the sentiments in the package
nrcdf <- get_sentiments("nrc")
#examine a table of the different emotions
nrcdf %>% count(sentiment,sort=T)
## # A tibble: 10 × 2
## sentiment n
## <chr> <int>
## 1 negative 3316
## 2 positive 2308
## 3 fear 1474
## 4 anger 1245
## 5 trust 1230
## 6 sadness 1187
## 7 disgust 1056
## 8 anticipation 837
## 9 joy 687
## 10 surprise 532
Most of the tagged terms are in the negative and positive categories, while the least number of tagged terms are in the surprise category.
Let’s apply this tagging library to our TREE_NOTES
terms. We can join the dataframes using inner_join()
.
# merge our datasets
treewordstidy_sentiment <- treewordstidy %>% inner_join(get_sentiments("nrc"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 21 of `x` matches multiple rows in `y`.
## ℹ Row 1098 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
Next, we might want to examine the words in our
TREE_NOTES
with the most frequently associated with
particular sentiments. The following code pulls up the most frequently
occuring terms that are associated with the sentiment
sadness.
treewordstidy_sentiment %>% filter(sentiment=="sadness") %>% count(word, sort=T) %>% top_n(30)
## Selecting by n
## word n
## 1 remove 3078
## 2 damage 1179
## 3 wound 953
## 4 decay 912
## 5 failure 554
## 6 pine 516
## 7 lost 477
## 8 wildfire 443
## 9 conflict 440
## 10 emergency 399
## 11 broken 355
## 12 lowest 319
## 13 bad 265
## 14 strip 263
## 15 fell 227
## 16 lower 182
## 17 missing 174
## 18 broke 157
## 19 leave 147
## 20 hollow 133
## 21 cancel 132
## 22 rot 123
## 23 buried 115
## 24 tough 113
## 25 late 112
## 26 poison 103
## 27 dying 91
## 28 fall 91
## 29 hanging 82
## 30 badly 74
We can also visualize these results. More information on the next steps in text analysis with R can be found in Tidy Text Mining with R.
# drawn from https://www.tidytextmining.com/sentiment
treewordstidy_sentiment %>% group_by(sentiment) %>% count(word, sentiment, sort = TRUE) %>%
slice_max(n, n = 15) %>% ungroup() %>%
mutate(word = reorder(word, n)) %>% ggplot(aes(n, word, fill = sentiment)) + geom_col(show.legend = FALSE)+ facet_wrap(~sentiment, scales = "free_y")
From the previous analysis, we saw that the word “decay” is
frequently used in the TREE_NOTES
. Let’s see how often the
word “decay” is used in notes on trees in each genus.
DCtrees %>% filter(grepl("decay", TREE_NOTES)) %>% group_by(GENUS_NAME) %>% summarise(n=n()) %>% arrange(desc(n))
## # A tibble: 37 × 2
## GENUS_NAME n
## <chr> <int>
## 1 "Acer" 233
## 2 "Quercus" 231
## 3 "" 37
## 4 "Ulmus" 35
## 5 "Prunus" 32
## 6 "Tilia" 29
## 7 "Liriodendron" 19
## 8 "Ginkgo" 17
## 9 "Platanus" 17
## 10 "Cercis" 10
## # ℹ 27 more rows
# Double check results and examine the TREE_NOTES for Acer trees
#DCtrees %>% filter(grepl("decay", TREE_NOTES)) %>% filter(GENUS_NAME=="Acer") %>% select(GENUS_NAME, TREE_NOTES)
Finally, let’s pull everything we’ve learned together and make a map of the trees described by “decay” in DC. Since Acer and Quercus are the two genera with the most mentions of “decay” let’s color them differently from the other genera.
decay <- DCtrees %>% filter(grepl("decay", TREE_NOTES))
leaflet(decay) %>%
addTiles() %>%
addCircleMarkers(lng=decay$X, lat=decay$Y, group="trees", radius=2, opacity=.5, color=ifelse(decay$GENUS=="Acer", "purple", ifelse(decay$GENUS=="Quercus", "blue", "darkgreen")), label=decay$SCI_NM, popup = paste("<b>Name:</b>", decay$CMMN_NM, "<br/>", "<b>Genus:</b>", decay$GENUS, "<br/>", "<b>Tree notes:</b>", decay$TREE_NOTES, sep=" "))