Overview

This workshop is designed to help anthropologists get started using R. R is an amazing open-source tool for cleaning, analyzing, and visualizing data. You can also use R for web publishing and creating other types of documents. In this workshop, we’ll cover some essentials for sociocultural anthropologists and archaeologists who are beginning their R journeys.

We’ll cover the following topics:

  • Basic R operations
  • Exploratory data analysis
  • Creating interactive maps
  • Introductory text analysis

At the end of this page we have listed additional R resources to help you continue to build skills after this workshop.

Getting started

Working with RStudio

We will be working with R using the software RStudio. RStudio has a number of helpul tools that makes R more user-friendly. To create a new R script in RStudio, navigate to File -> New File -> R Script. An important thing to understand when working with RStudio is that the console and R script are two different things. The R script is where you will do all of your coding. Think of this like a text document that you are writing, saving, and then can work on again later. Meanwhile, the console is where the output from your R script will appear. You will be able to tell that the code lines have finished running and R is ready for the next input when you see a > symbol at the last line of your console.

You can customize the layout of your panes and aesthetics of your code by navigating to Tools -> Global Options.

Writing code

R code is typically written in the script pane of RStudio, while the output appears in the console. Let’s try inputting some basic arithmetic.

1 + 2

5 - 3 

2 * (5-3)

Note that R follows the basic order of operations (PEMDAS).

The primary data types in R are:

  1. numeric
  2. string or character
  3. logical
  4. factor

String or character data are entered with "" or '' surrounding the characters. Let’s try entering some strings into R.

"book"
"three"
book

Note the error on the last line: Error: object 'book' not found. This means that there is no object called book in our R workspace. We’ll learn about objects in the following section.

Objects and functions

R works by running functions on objects and other data. This means that we can use a function to tell R what we to do with the input data.

For example, let’s ask R to tell us the mean of the numbers from 5 to 8 with the following code:

mean(5:8)
## [1] 6.5

We can also make a quick plot with R:

plot(iris$Petal.Length, iris$Petal.Width)

If you encounter a new function and don’t know exactly how it works you can pull up the helpful by writing ?EXAMPLEFUNCTIONNAME() or in this case ?mean(). This help file will tell you what type of data the function can read, the different arguments or options you can adjust in the function, and will usually give several examples of how to use the function in practice.

While you can directly enter data each time you want to use it, R’s power comes from assigning data to named objects or variables. Objects are assigned using the <- symbol, which means “everything on the right is now referred to by the object name on the left”

house <- c(1,2,3)
house
## [1] 1 2 3
sizes <- seq(1:5)
sizes
## [1] 1 2 3 4 5

R is case sensitive

Something very important to keep in mind with R is that it is case sensitive, unlike some other languages. This is very important to know for keeping track of different variables and often a cause of many coding errors. For example, we can create different objects referring to trees by changing the capitalization.

Tree <- "tree"
TREE <- "tree again"

Now try entering:

Tree
TREE
tree

Why doesn’t the last line work?

Let’s make an object called tree.

tree <- "a third tree"

Data classes

There are several different types or classes of data that can be assigned in R. We’ll cover a few basics here. Vectors can have either string, logical, or numeric data; but each vector can only contain one data class. You can check the class of a vector with the class() function. In addition, R has built in checks for different classes, such as is.numeric().

x <- seq(1,5)
is.numeric(x)
x <- c(1:5)
x <- c(1,10,"eleven", 27) 
is.numeric(x)
x <- c(rep(T, 10), F, F, T)
x

Note that R will automatically convert the numeric data into string data. This is called “coercion”.

Try it

  1. Make a new object called “heights” that is composed of the numbers: 2, 4, 6, 8.

  2. If you didn’t already use it, try using the seq() function to recreate the heights object. Hint: try ?seq() to see the help file for this function.

Click to see a solution
heights <- c(2, 4, 6, 8)

heights <- seq(2,8,2)

Installing and loading packages

The base installation of R has many built-in functions and datasets. However, often we want to extend R’s functionality through installing new packages with additional functions. The cool thing about R is that it is made by the users, so there are always new packages being written to solve new data wrangling, analysis and visualization challenges. As of writing this, there are 20,656 packages listed in CRAN.

Packages only need to be installed one time per user on a computer. However, each time you run a new R session you will need to load the library. I usually install packages using the Tools -> Install Packages menu. Be sure to select “install dependencies” as this will add any other packages that you need running in the background in order to use the package you are installing.

For loading packages, it is best to load them using R code. This makes your code reproducible if you send it to someone else and ensures that restarting R and running all your code will work each time.

Let’s take a minute to install and load the following packages:

library(tidyverse)
library(DescTools)
library(leaflet)
library(stringr)
library(tidytext)

Reading data into R

There are multiple ways to load data into R. Best practice is to load your data file using code rather than drop down menus. This ensures that a specific named file is associated with your code each time and reduces potential for accidentally loading the wrong data file.

For this workshop, we will read in a file using a url. You can also load data from your computer by pasting the file path into your file reading function. See more information about this here.

DCtrees <- read.csv("https://maddiebrown.github.io/ANTH630/data/Urban_Forestry_Street_Trees_2024.csv")

Whenever we load a new data file, the first thing we might want to do is examine the structure of the data. R has a handy function: str() that allows you to see the number of observations, column and data types, as well as a snippet of the first few observations for each column.

str(DCtrees)

Exploratory data analysis

There are a number of functions in R that can help us get a feel for our data and visualize it in many helpful and interesting ways.

For example, we can use summary() to obtain summary statistics and information about the contents of our data frame, such as the class of the variables in each column of the data frame.

summary(DCtrees)

You can use the head() function to view the first few rows of the data frame.

head(DCtrees)
##           X        Y                   SCI_NM              CMMN_NM GENUS_NAME
## 1 -76.99281 38.88609          Quercus montana    Rock chestnut oak    Quercus
## 2 -76.99206 38.88599              Acer rubrum            Red maple       Acer
## 3 -77.03567 38.92727 Quercus robur fastigiata Columnar English oak    Quercus
## 4 -76.99334 38.88417          Tilia americana      American linden      Tilia
## 5 -76.99838 38.88728         Acer platanoides         Norway maple       Acer
## 6 -77.03931 38.92800           Quercus lyrata          Overcup oak    Quercus
##      FAM_NAME             DATE_PLANT              FACILITYID          VICINITY
## 1    Fagaceae 2018/02/01 18:50:34+00 31982-090-3001-0269-000       922 C ST SE
## 2 Sapindaceae                        31982-100-3005-0155-000      1017 C ST SE
## 3    Fagaceae                        10150-300-3001-0050-000   3029 15TH ST NW
## 4   Tiliaceae                        32691-092-3001-0105-000       904 D ST SE
## 5 Sapindaceae                        30060-020-3001-0101-000     208 6TH ST SE
## 6    Fagaceae 2011/02/17 05:00:00+00 14582-160-3005-0656-000 1653 HOBART ST NW
##   WARD TBOX_L TBOX_W WIRES      CURB  SIDEWALK TBOX_STAT RETIREDDT  DBH
## 1    6     99      7  None Permanent Permanent     Plant            5.7
## 2    6      8      4  None Permanent Permanent     Plant           17.7
## 3    1      6      3  None Permanent Permanent     Plant           10.9
## 4    6      9      4  None Permanent Permanent     Plant           13.4
## 5    6      8      4  None Permanent Permanent     Plant           11.9
## 6    1     99      4  None Permanent Permanent     Plant            9.3
##     DISEASE PESTS CONDITION             CONDITIODT OWNERSHIP
## 1                 Excellent 2024/02/28 23:57:09+00       UFA
## 2                      Fair 2021/02/17 22:21:46+00       UFA
## 3                      Fair 2021/09/13 18:55:03+00       UFA
## 4                      Good 2020/02/14 01:33:24+00       UFA
## 5                      Good 2020/09/16 19:38:17+00       UFA
## 6 Hypoxylon            Dead 2023/05/22 19:49:55+00       UFA
##                                                           TREE_NOTES MBG_WIDTH
## 1                                    Elevated street side. Feb 2024.  13.12336
## 2 P dead wood only and r small mulberry at base, be careful of roots  39.37008
## 3                                                                     29.53926
## 4                                                                     29.52756
## 5 Arborist removed some deadwood and scheduled for pruning on 1/5/17  39.37008
## 6                                                                     29.52756
##   MBG_LENGTH MBG_ORIENTATION MAX_CROWN_HEIGHT MAX_MEAN MIN_CROWN_BASE  DTM_MEAN
## 1   19.68504         90.0000         18.91814 14.26427     0.05331409  82.26296
## 2   45.93176         90.0000         45.90728 30.68867    -0.15571680  81.22527
## 3   46.50863        163.3008         37.41346 21.32403    -0.21777124 202.87526
## 4   45.93176          0.0000         41.53025 22.57074     0.15885049  77.00650
## 5   45.93176         90.0000         32.59466 21.19249    -0.18092014  81.08643
## 6   32.80840          0.0000         37.61407 18.87533    -0.57492221 187.02505
##      PERIM CROWN_AREA CICADA_SURVEY ONEYEARPHOTO SPECIALPHOTO PHOTOREMARKS
## 1  65.6168   215.2780                                                     
## 2 183.7270  1259.3763                                                     
## 3 170.6037   742.7091                                                     
## 4 164.0420  1130.2095                                                     
## 5 177.1654  1119.4456                                                     
## 6 144.3570   688.8896                                                     
##   ELEVATION    SIGN TRRS  WARRANTY CREATED_USER CREATED_DATE EDITEDBY
## 1   Unknown Unknown   NA 2017-2018                              sward
## 2   Unknown Unknown   NA   Unknown                           jchapman
## 3   Unknown Unknown   NA   Unknown                            mmcphee
## 4   Unknown Unknown   NA   Unknown                              sward
## 5   Unknown Unknown   NA   Unknown                              sward
## 6   Unknown Unknown   NA 2010-2011                            jmiller
##   LAST_EDITED_USER       LAST_EDITED_DATE GIS_ID
## 1            sward 2024/02/28 23:57:52+00     NA
## 2         jchapman 2021/02/17 22:21:47+00     NA
## 3          mmcphee 2021/09/13 18:54:32+00     NA
## 4            sward 2020/02/14 01:34:14+00     NA
## 5            sward 2020/09/16 19:51:11+00     NA
## 6          jmiller 2023/05/22 19:50:08+00     NA
##                                 GLOBALID CREATOR CREATED EDITOR EDITED SHAPE
## 1 {0B358D52-AAD4-41AC-B1AF-B19740DBC02A}      NA      NA     NA     NA    NA
## 2 {0F7845B3-E5DE-480B-96EC-B595354BCA5C}      NA      NA     NA     NA    NA
## 3 {EA1C7F1D-8FF6-4A3A-BFBD-0147BABCA5F7}      NA      NA     NA     NA    NA
## 4 {ADB853B2-E32F-4BB4-B949-DE7B5656DCD5}      NA      NA     NA     NA    NA
## 5 {300EF1F5-F440-4E16-BBC0-69C6CDD772CA}      NA      NA     NA     NA    NA
## 6 {0BEFB0A1-AAF4-4958-849C-CFBFBA3D4E78}      NA      NA     NA     NA    NA
##   OBJECTID
## 1 40100904
## 2 40100905
## 3 40100906
## 4 40100907
## 5 40100908
## 6 40100909

The tail() function to view the last few lines of your data frame.

tail(DCtrees)
##                X        Y               SCI_NM           CMMN_NM GENUS_NAME
## 211112 -76.98323 38.83522 Malus 'Harvest Gold' Other (See Notes)           
## 211113 -76.92602 38.88502                                                  
## 211114 -77.06645 38.97106                                                  
## 211115 -76.92586 38.88641                                                  
## 211116 -76.92609 38.88525    Quercus palustris           Pin oak    Quercus
## 211117 -76.92634 38.88415                                                  
##        FAM_NAME             DATE_PLANT              FACILITYID
## 211112                                                        
## 211113                                 30530-025-3001-0151-000
## 211114                                 10330-615-3001-0203-000
## 211115                                 30530-015-3005-0089-000
## 211116 Fagaceae                        30530-025-3005-0080-000
## 211117          2019/10/28 18:09:28+00 30530-030-3005-0120-000
##                           VICINITY WARD TBOX_L TBOX_W WIRES      CURB  SIDEWALK
## 211112 1300 Blk Southern Avenue SE    8     NA     NA  None      None      None
## 211113          200 BLK 53RD ST SE    7     99      3  None Permanent Permanent
## 211114             6124 33RD ST NW    4     99      7  Both Permanent Permanent
## 211115              125 53RD ST SE    7     99      3  Both Permanent Permanent
## 211116          300 BLK 53RD ST SE    7     99      3  Both Permanent Permanent
## 211117          OPP 301 53RD ST SE    7     99      3  Both Permanent Permanent
##        TBOX_STAT RETIREDDT  DBH    DISEASE PESTS CONDITION
## 211112     Plant            2.0                       Good
## 211113  Conflict             NA                           
## 211114      Open             NA                       Good
## 211115  Conflict             NA                           
## 211116     Plant           17.5 Trunk Root            Fair
## 211117      Open             NA                       Good
##                    CONDITIODT OWNERSHIP
## 211112 2024/03/08 20:41:09+00       UFA
## 211113 2024/03/08 20:40:24+00       UFA
## 211114 2024/03/08 21:56:08+00       UFA
## 211115 2024/03/08 21:56:34+00       UFA
## 211116 2024/03/08 22:08:56+00       UFA
## 211117 2024/03/08 22:25:55+00       UFA
##                                                                                                                         TREE_NOTES
## 211112                                                                                                               IPMA project 
## 211113                                                                                                 Large curb cut and pp trees
## 211114                                                                                                                            
## 211115                                                                                                               Long guy wire
## 211116                                                                        Plant hornbeam. White building near private magnolia
## 211117 Hornbeam. Plant at corner of intersection opposite of school sign. Site lines less important for left turn, one way street.
##        MBG_WIDTH MBG_LENGTH MBG_ORIENTATION MAX_CROWN_HEIGHT  MAX_MEAN
## 211112        NA         NA              NA               NA        NA
## 211113  32.80840   39.37008               0         54.65678 35.175927
## 211114  85.30184   98.42520              90         89.07318 56.363557
## 211115  26.24672   36.08924               0         30.77502 23.386227
## 211116  22.96588   36.08924               0         38.82791 22.407826
## 211117  22.96588   26.24672              90         13.86132  7.454631
##        MIN_CROWN_BASE DTM_MEAN    PERIM CROWN_AREA CICADA_SURVEY ONEYEARPHOTO
## 211112             NA       NA       NA         NA                           
## 211113    -0.39570668 160.1070 144.3570   914.9315                           
## 211114    -2.13418047 360.8612 413.3858  5920.1450                           
## 211115    -1.63968071 143.1423 124.6719   613.5423                           
## 211116    -0.01722297 158.3834 118.1102   688.8896                           
## 211117    -0.14787045 163.6321 131.2336   312.1531                           
##        SPECIALPHOTO PHOTOREMARKS ELEVATION    SIGN TRRS  WARRANTY CREATED_USER
## 211112                             Unknown Unknown   NA   Unknown    cklapthor
## 211113                             Unknown Unknown   NA   Unknown     rdelsack
## 211114                             Unknown Unknown   NA   Unknown      JCONLON
## 211115                             Unknown Unknown   NA   Unknown     rdelsack
## 211116                             Unknown Unknown   NA   Unknown     rdelsack
## 211117                             Unknown Unknown   NA 2019-2020     rdelsack
##                  CREATED_DATE  EDITEDBY LAST_EDITED_USER       LAST_EDITED_DATE
## 211112 2024/03/08 20:40:53+00 cklapthor        cklapthor 2024/03/08 20:40:53+00
## 211113 2024/03/08 20:45:17+00  rdelsack         rdelsack 2024/03/08 20:45:17+00
## 211114 2024/03/08 21:56:08+00   jconlon          JCONLON 2024/03/08 21:56:08+00
## 211115 2024/03/08 21:56:09+00  rdelsack         rdelsack 2024/03/08 21:56:09+00
## 211116 2024/03/08 22:08:24+00  rdelsack         rdelsack 2024/03/08 22:08:24+00
## 211117 2024/03/08 22:29:29+00  rdelsack         rdelsack 2024/03/08 22:29:29+00
##        GIS_ID                               GLOBALID CREATOR CREATED EDITOR
## 211112     NA {ACF5EEB0-BF92-464E-8BEC-FD54E3EEAB68}      NA      NA     NA
## 211113     NA {ECBB335F-9F07-4A5A-99B0-EA935CF5E2A4}      NA      NA     NA
## 211114     NA {359E8372-14C0-45A1-97E3-35FE79F91486}      NA      NA     NA
## 211115     NA {18A305FC-FEE6-4419-B6FA-D056A9B0F7C6}      NA      NA     NA
## 211116     NA {36DB1DCD-D49A-4168-9E14-723C7D96A242}      NA      NA     NA
## 211117     NA {97C49852-3CBD-4D08-88B8-67A9B90C6D4A}      NA      NA     NA
##        EDITED SHAPE OBJECTID
## 211112     NA    NA 40312223
## 211113     NA    NA 40312224
## 211114     NA    NA 40312225
## 211115     NA    NA 40312226
## 211116     NA    NA 40312227
## 211117     NA    NA 40312228

The unique() function can help you extract unique elements from your data frame. For example, if you wanted to know how many unique tree genus names are included in the data and what they are, you can use this code:

unique(DCtrees$GENUS_NAME)
##  [1] "Quercus"                                       
##  [2] "Acer"                                          
##  [3] "Tilia"                                         
##  [4] "Ulmus"                                         
##  [5] "Liquidambar"                                   
##  [6] ""                                              
##  [7] "Platanus"                                      
##  [8] "Gleditsia"                                     
##  [9] "Ginkgo"                                        
## [10] "Pyrus"                                         
## [11] "Celtis"                                        
## [12] "Zelkova"                                       
## [13] "Liriodendron"                                  
## [14] "Carpinus"                                      
## [15] "Gymnocladus"                                   
## [16] "Pistacia"                                      
## [17] "Other"                                         
## [18] "Nyssa"                                         
## [19] "Syringa"                                       
## [20] "Cercidiphyllum"                                
## [21] "Cercis"                                        
## [22] "Lagerstroemia"                                 
## [23] "Betula"                                        
## [24] "Fagus"                                         
## [25] "Cladrastis"                                    
## [26] "Robinia"                                       
## [27] "Prunus"                                        
## [28] "Magnolia"                                      
## [29] "Parrotia"                                      
## [30] "Metasequoia"                                   
## [31] "Amelanchier"                                   
## [32] "Taxodium"                                      
## [33] "Koelreuteria"                                  
## [34] "Malus"                                         
## [35] "Ostrya"                                        
## [36] "Catalpa"                                       
## [37] "Morus"                                         
## [38] "Styphnolobium"                                 
## [39] "Fraxinus"                                      
## [40] "Ilex"                                          
## [41] "Cornus"                                        
## [42] "Pinus"                                         
## [43] "Aescululs"                                     
## [44] "Chionanthus"                                   
## [45] "Juniperus"                                     
## [46] "Oxydendrum"                                    
## [47] "Halesia"                                       
## [48] "Juglans"                                       
## [49] "Ailanthus"                                     
## [50] "Crataegus"                                     
## [51] "Diospyros"                                     
## [52] "Sassafras"                                     
## [53] "Salix"                                         
## [54] "Eucommia"                                      
## [55] "Carya"                                         
## [56] "No\nNo\nNo"                                    
## [57] "Aesculus"                                      
## [58] "Stewartia"                                     
## [59] "Maclura"                                       
## [60] "Rhus"                                          
## [61] "No"                                            
## [62] "Populus"                                       
## [63] "Tsuga"                                         
## [64] "Picea"                                         
## [65] "Cryptomeria"                                   
## [66] "Cotinus"                                       
## [67] "Thuja"                                         
## [68] "Laburnum"                                      
## [69] "Maackia"                                       
## [70] "Asimina"                                       
## [71] "Cedrus"                                        
## [72] "Cornus\nCornus\nCornus\nCornus\nCornus\nCornus"
## [73] "Corylus"                                       
## [74] "Alnus"                                         
## [75] "Paulownia"                                     
## [76] "Tetradium"                                     
## [77] "Phellodendron"                                 
## [78] "Taxus"                                         
## [79] "Alibizia"                                      
## [80] "Toona"                                         
## [81] "Viburnum"                                      
## [82] "Amelanchier x"

In this tutorial, we will mostly use ‘tidyverse’, which we’ve already installed above, to perform exploratory data analysis. Tidyverse is a collection of R packages designed to make data manipulation, visualization, and analysis easier and more efficient.

library(tidyverse)

The ‘dplyr’ package within tidyverse provides a set of functions for efficiently manipulating data frames. Some of the main functions in dplyr are filter(), mutate(), summarize(), and arrange(). These functions allow you to filter rows, create new variables, group data, and perform various transformations.

Filtering subsets of data

We can filter the dataset to select specific rows based on conditions using filter(). For example, we can filter out data for trees with DBH greater than 10.

DBH_trees <- DCtrees %>%
  filter(DBH > 10)
#head(DBH_trees)

We can use the %like% operator to identify substrings within longer strings. By putting % symbols at the start and end of the string we want to find, this tells R to look for any occurrence of the string within the data. Read more about DescTools’ %like% operator here.

The code below tells R to look within the DCtrees dataframe and filter (keep) all the rows where the common name (CMMN_NM) contains the string “apple” in it. Then, the second argument tells R to only select (keep) the column that contains the common names of the plants (CMMN_NM). Finally, because there are so many trees that have apple in their names, we are asking R to only show us the first 20 results. Keep in mind that R is case sensitive here, so this will only find results with all lowercase letters in the searched string.

DCtrees %>% filter(CMMN_NM %like% "%apple%") %>% select(CMMN_NM) %>% head(20)
##                   CMMN_NM
## 1               Crabapple
## 2               Crabapple
## 3               Crabapple
## 4               Crabapple
## 5               Crabapple
## 6        Arnold crabapple
## 7               Crabapple
## 8               Crabapple
## 9               Crabapple
## 10              Crabapple
## 11              Crabapple
## 12              Crabapple
## 13 Donald Wyman Crabapple
## 14              Crabapple
## 15   Adirondack Crabapple
## 16              Crabapple
## 17              Crabapple
## 18 Donald Wyman Crabapple
## 19 Donald Wyman Crabapple
## 20              Crabapple

We can also combine filters. Let’s filter out data for trees in the genus Acer with DBH greater than 10.

Acer_DBH_trees <- DCtrees %>%
  filter(GENUS_NAME=="Acer", DBH > 10)
#head(Acer_DBH_trees)

We can also combine filters using logical operators ‘&’ and ‘|’. ‘&’ is used to specify that both conditions must be true. ‘|’ is used to specify that either condition can be true.

We can use ‘&’ to write the code above that filters out Acer trees with a DBH greater than 10 in a different way.

Acer_DBH_trees2 <- DCtrees %>%
  filter(GENUS_NAME=="Acer" & DBH > 10)
#head(Acer_DBH_trees2)

We can use a combination of ‘&’ and ‘|’ to filter out data for trees that are in either the genus Acer or Quercus who have DBH greater than 10.

Acer_Q_DBH <- DCtrees %>%
  filter(GENUS_NAME=="Acer" | GENUS_NAME== "Quercus" & DBH > 10)
#head(Acer_Q_DBH)

Try it

  1. How would you filter all the rows where the FAM_NAME is “Rosaceae”?

  2. How would you filter all the rows where the CMMN_NAME contains the string apple anywhere in it and the CONDITION of the tree is “Excellent”?

Click to see a solution
DCtrees %>% filter(FAM_NAME == "Rosaceae")

DCtrees %>% filter(CMMN_NM %like% "%apple%" &  CONDITION == "Excellent")

Selecting specific columns

filter() selects rows, but select() is used to select columns.

For example, we could use select() if we only wanted the facility ID (FACILITYID) and vicinity (VICINITY) of the trees that are in genus Acer and have a DBH great than 10.

Facility_ID_Large_Acer <- Acer_DBH_trees2 %>%
  select(FACILITYID, VICINITY)
#head(Facility_ID_Large_Acer)

You can rename columns in your dataset using rename(). For example, if we wanted to change VICINITY to ADDRESS, we could use the following code:

DCtrees_Address <- DCtrees %>%
  rename(ADDRESS = VICINITY)
#head(DCtrees_Address)

You can add new columns with adjusted values using mutate(). For example, we can add a column with the circumference of each tree.

Circumference_DCtrees <- DCtrees %>%
  mutate(CIRCUMFERENCE = DBH * pi )
#head(Circumference_DCtrees)

Descriptive statistics

Tidyverse can also be used to produce descriptive statistics from your dataset. summarize() is a good tool for producing descriptive statistics like mean(), median(), sd(), and sum().

For example, the code below will calculate the mean, median, and standard deviation of DBH in each genus. We will also need to use the group_by() function to do this.

DCtrees_Stats <- DCtrees %>%
  group_by(GENUS_NAME) %>%
  summarize(mean_DBH = mean(DBH), median_DBH = median(DBH), sd_DBH = sd(DBH))

#DCtrees_Stats

As you can see, some of the statistics in the chart above come back as ‘NA’ because the DBH of some trees were not measured and therefore have ‘NA’ in the DBH column. We can fix this issue by ignoring ‘NA’ values and calculating our statistics with only the trees we have data for. To do this, we will use ‘na.rm = TRUE’.

DCtrees_Stats <- DCtrees %>%
  group_by(GENUS_NAME) %>%
  summarize(mean_DBH = mean(DBH, na.rm = TRUE), median_DBH = median(DBH, na.rm = TRUE), sd_DBH = sd(DBH, na.rm = TRUE))
#DCtrees_Stats

Plotting data

The ‘ggplot2’ package in tidyverse is a great tool for visualizing data. You can use it to make many different kinds of plots. ggplot() is the function we will use to initialize a plot and different arguments can go inside this function such as (aes()), which customizing the aesthetics of the plot, and (geom()), which adds geometric layers like points, lines, and bars to the plot. You can also add titles, labels, and themes to your plot.

Let’s make a scatterplot:

ggplot(DCtrees, aes(x=MBG_WIDTH, y=MBG_LENGTH)) +
  geom_point(size=2, shape=23) +
  labs(x = "MBG Width", y = "MBG Length",
       title = "MBG Length vs Width")

A box and whisker plot showing DBH for Acer, Quercus, Magnolia, and Juniperus trees:

Genus_selected <- DCtrees %>%
  filter(GENUS_NAME=="Acer" | GENUS_NAME== "Quercus" | GENUS_NAME == "Magnolia" | GENUS_NAME== "Juniperus")


ggplot(Genus_selected, aes(x = GENUS_NAME, y = DBH)) + geom_boxplot() + labs(x = "Genus of Trees", y = "Diameter at Breast Height (DBH)", title = "Comparison of DBH Across Tree Genera") + theme_minimal()

In the code below, I have removed outliers and changed the y-axis range. But the data still needs to be cleaned further.

ggplot(Genus_selected, aes(x = GENUS_NAME, y = DBH)) +
  geom_boxplot(outlier.shape = NA, fill = "lightblue", color = "blue") +
    coord_cartesian(ylim = c(0,50)) +
  labs(x = "Genus of Trees", y = "Diameter at Breast Height (DBH)",
       title = "Comparison of DBH Across Tree Genera") +
  theme_minimal()

A bar graph showing mean DBH of Acer, Juniperus, Magnolia, and Quercus genera:

mean_dbh <- aggregate(DBH ~ GENUS_NAME, data = Genus_selected, FUN = mean)

ggplot(mean_dbh, aes(x = GENUS_NAME, y = DBH)) +
  geom_bar(stat = "identity", fill = "skyblue", width = 0.5) +  # Create bar plot
  labs(x = "Genus", y = "Mean DBH") +  # Add labels for axes
  ggtitle("Mean Diameter at Breast Height (DBH) by Genus") +  # Add title
  theme_minimal()

Mapping with Leaflet

Leaflet is a package that enables the rapid creation of interactive maps using R. If you didn’t previously load the leaflet library, do so now with the following code: library(leaflet). Note: you may also need to install the package before loading it.

Leaflet maps are created in layers, first adding the basemap and then the interactive data layers on top. Here, we will make a map of the location of the pawpaw trees in DC using leaflet.

Where are the Pawpaw trees in DC?

First, we can isolate out all the observations that contain the string Pawpaw in them.

pawpaws <- DCtrees %>% filter(CMMN_NM %like% "%Pawpaw%")

Next we can make a map of the pawpaw trees. We can pipe all the functions to create the map in the same line of code. When adding markers to the map, it is important to specify which columns refer to the latitude and longitude.

leaflet(pawpaws) %>% addTiles() %>% addCircleMarkers(lng=pawpaws$X, lat=pawpaws$Y)

Making multi-layered maps

Suppose we wanted to add multiple layers to the leaflet map. Similar to adding layers to a plot in ggplot2, we can add a second addCircleMarkers() function to our original leaflet map.

Let’s filter out all the apple trees that are marked as being in “Excellent” condition.

excellentapples <- DCtrees %>% filter(CMMN_NM %like% "%apple%" &  CONDITION == "Excellent")

Next, we can add this layer to our map. Here I am also adding in the argument “popup” for the excellent apples layer. The popup here displays the common name of the tree.

leaflet(pawpaws) %>% 
  addTiles() %>%
  addCircleMarkers(lng=pawpaws$X, lat=pawpaws$Y) %>%
  addCircleMarkers(lng=excellentapples$X, lat=excellentapples$Y, popup=excellentapples$CMMN_NM, color="green", radius=2)

Try it

Add a third layer to the map, this time selecting all the apples marked as being in “good” condition. First, create your new vector of good apples and then add it to our existing map. Finally, add a popup detailing the scientific name of the tree.

Click to see a solution

First, make a good apples vector.

goodapples <- DCtrees %>% filter(CMMN_NM %like% "%apple%" &  CONDITION == "Good")

Now make the map with all layers added in.

leaflet(pawpaws) %>% 
  addTiles() %>%
  addCircleMarkers(lng=pawpaws$X, lat=pawpaws$Y) %>%
  addCircleMarkers(lng=excellentapples$X, lat=excellentapples$Y, label=excellentapples$CMMN_NM, popup=excellentapples$CMMN_NM, color="green", radius=2) %>%
  addCircleMarkers(lng=goodapples$X, lat=goodapples$Y, color="red", radius=2, label=goodapples$SCI_NM, popup = goodapples$SCI_NM)

Controlling map layers

In leaflet we can also control whether or not each layer is included on the map. Using the addLayersControl() function, we can add a menu for turning each layer on or off. Within each layer, add an argument for the group. These group labels will then be included in the baseGroups and overlayGroups arguments. Note that baseGroups enables toggling between layers (or basemaps) while overlayGroups enables turning specific layers on and off.

#https://rstudio.github.io/leaflet/articles/showhide.html

leaflet(pawpaws) %>% 
  addTiles() %>%
  addCircleMarkers(lng=pawpaws$X, lat=pawpaws$Y, group="Pawpaws") %>%
  addCircleMarkers(lng=excellentapples$X, lat=excellentapples$Y, popup=excellentapples$CMMN_NM, color="green", radius=2, group="Excellent apples") %>%
  addCircleMarkers(lng=goodapples$X, lat=goodapples$Y, popup=goodapples$CMMN_NM, color="red", radius=2, group="Good apples") %>%
   addLayersControl(
    baseGroups = c("Excellent apples", "Good apples"),
    overlayGroups = c("Pawpaws"),
    options = layersControlOptions(collapsed = FALSE)
  )

Additional mapping resources

Introductory text analysis

Analyzing text data can yield exciting insights into our research questions. We’ll cover how to combine qualitative and structured approaches to text analysis through analyzing the TREE_NOTES column in our data. First, let’s examine the first 20 rows. Do any patterns or themes stand out to you?

head(DCtrees$TREE_NOTES, 20)
##  [1] "Elevated street side. Feb 2024."                                                                                                                                                                                               
##  [2] "P dead wood only and r small mulberry at base, be careful of roots"                                                                                                                                                            
##  [3] ""                                                                                                                                                                                                                              
##  [4] ""                                                                                                                                                                                                                              
##  [5] "Arborist removed some deadwood and scheduled for pruning on 1/5/17"                                                                                                                                                            
##  [6] ""                                                                                                                                                                                                                              
##  [7] "Two leaders, both kinds horizontal, Oct 2022., resident at 313 says we are basically a waste of tax payer money so try and be nice to him "                                                                                    
##  [8] ""                                                                                                                                                                                                                              
##  [9] "Bread loaf-sized Inonatus at base. Three.“Black crust” Kretzschmeria conk, fist-sized, on root flare, edge of sidewalk.  Grew one inch DBH since 2017. Another shelf conk at 15’ up.  Dieback sprinkled thru crown, June 2019."
## [10] ""                                                                                                                                                                                                                              
## [11] ""                                                                                                                                                                                                                              
## [12] ""                                                                                                                                                                                                                              
## [13] "P. Beginning of bls potentiallyWash gas disrupted soil"                                                                                                                                                                        
## [14] ""                                                                                                                                                                                                                              
## [15] ""                                                                                                                                                                                                                              
## [16] " "                                                                                                                                                                                                                             
## [17] "Early defoliation, likely bacterial or fungal leaf"                                                                                                                                                                            
## [18] "Multiple vertical leaders."                                                                                                                                                                                                    
## [19] ""                                                                                                                                                                                                                              
## [20] "sprouts, elevate"

Let’s go through some different approaches to analyzing these open ended text responses. We’ll cover:

  • Word frequency
  • Sentiment analysis

Word Frequency

One of the first things we might examine in text analysis is the frequency with which particular words occur. In R, we can convert longer string data into individual words using the unnest_tokens() function.

Let’s pull out the individual words found in the TREE_NOTES column of the DCtrees data.

#first pull out only the columns needed for this analysis
treewords<- DCtrees %>% select(OBJECTID, TREE_NOTES)

#unnest the words
treewords <- treewords %>% unnest_tokens(output=word, input=TREE_NOTES) 

#take a look at the first 20 rows
head(treewords, 20)
##    OBJECTID     word
## 1  40100904 elevated
## 2  40100904   street
## 3  40100904     side
## 4  40100904      feb
## 5  40100904     2024
## 6  40100905        p
## 7  40100905     dead
## 8  40100905     wood
## 9  40100905     only
## 10 40100905      and
## 11 40100905        r
## 12 40100905    small
## 13 40100905 mulberry
## 14 40100905       at
## 15 40100905     base
## 16 40100905       be
## 17 40100905  careful
## 18 40100905       of
## 19 40100905    roots
## 20 40100908 arborist

Now let’s examine the top 20 words in these notes.

#look at the top 20 words in the document
treewords %>% count(word,sort=T) %>% top_n(20)
## Selecting by n
##         word     n
## 1      prune 20711
## 2     pruned 10536
## 3       tree  9341
## 4         to  8726
## 5        and  7031
## 6        for  5846
## 7         by  5375
## 8         of  5303
## 9   deadwood  5031
## 10        in  4762
## 11   elevate  4649
## 12   planted  4008
## 13  warranty  3978
## 14         p  3684
## 15       the  3424
## 16 clearance  3108
## 17    remove  3078
## 18  sidewalk  2887
## 19        on  2852
## 20        no  2756

Looking at these most frequently occurring words, some of them clearly refer to the condition of the trees (e.g. prune, deadwood, planted), while others mention additional spatial features associated with the point (e.g. sidewalk, clearance). However, some words are also different forms of the same word (e.g. prune and pruned).

Still other words are less informative (e.g. to, and, by). We consider these filler words to be “stop words” and can remove them systematically from the text. We’ll use existing stop word libraries today, but know that you can also make custom stop word lists.

First, let’s take a look at the stopwords available in R.

data(stop_words)
stop_words %>% top_n(50)
## Selecting by lexicon
## # A tibble: 174 × 2
##    word      lexicon 
##    <chr>     <chr>   
##  1 i         snowball
##  2 me        snowball
##  3 my        snowball
##  4 myself    snowball
##  5 we        snowball
##  6 our       snowball
##  7 ours      snowball
##  8 ourselves snowball
##  9 you       snowball
## 10 your      snowball
## # ℹ 164 more rows

Then, we can use an anti-join to remove the stopwords from our dataset.

treewordstidy <- treewords  %>% anti_join(stop_words)
## Joining with `by = join_by(word)`

You’ll notice that there are still some words that are not very helpful here, including numbers. Let’s remove the numbers as well:

#you can also make a custom stopwords list: https://www.tidytextmining.com/nasa
# adapted from: https://bookdown.org/psonkin18/berkshire/tokenize.html

treewordstidy <- treewordstidy %>% filter(!grepl('[0-9]', word))    

Try-it

Select and display the top 50 most frequently occurring terms in the tree notes using the new, tidy dataframe. Display the terms from most frequent to least frequent. Hint: Try count() and top_n().

Click to see a solution
treewordstidy %>% count(word,sort=T) %>% top_n(50)
## Selecting by n
##            word     n
## 1         prune 20711
## 2        pruned 10536
## 3          tree  9341
## 4      deadwood  5031
## 5       elevate  4649
## 6       planted  4008
## 7      warranty  3978
## 8     clearance  3108
## 9        remove  3078
## 10     sidewalk  2887
## 11        plant  2563
## 12       street  2487
## 13      removed  2342
## 14    elevation  2084
## 15        trees  2050
## 16         stem  1975
## 17     resident  1971
## 18         limb  1816
## 19        trunk  1766
## 20        stump  1645
## 21     elevated  1634
## 22     planting  1555
## 23          elm  1549
## 24      private  1537
## 25       passed  1517
## 26          box  1514
## 27       reduce  1442
## 28        close  1438
## 29        casey  1427
## 30        limbs  1335
## 31         dead  1318
## 32        stems  1307
## 33        crown  1293
## 34   structural  1281
## 35      removal  1272
## 36     building  1259
## 37     property  1252
## 38      project  1240
## 39        multi  1216
## 40       damage  1179
## 41   codominant  1177
## 42          dbh  1176
## 43     arborist  1148
## 44      pruning  1140
## 45         base  1112
## 46         sign  1104
## 47         root  1088
## 48          low  1065
## 49 construction  1032
## 50        light  1026

Visualizing term frequency

Word frequency can be clearly displayed using barcharts made in ggplot2.

#plot top words from tokenized data
treewordstidy %>% count(word,sort=TRUE) %>% top_n(50) %>% mutate(word=reorder(word,n))%>% ggplot(aes(x=word,y=n))+ geom_col()+xlab(NULL)+coord_flip()+labs(y="Count",x="Unique words", title="Top words in DC tree notes")
## Selecting by n

We can also visualize frequency with wordclouds.

library(wordcloud)
## Loading required package: RColorBrewer
treewordstidy %>% count(word) %>%  with(wordcloud(word, n, max.words = 100))

Sentiment analysis

Sentiment analysis enables us to tag individual terms according to particular themes or emotions. For analyzing the positivity/negativity of a text or the emotions associated with it, we can use existing libraries or lexicons that include tagged terms.

First, let’s use the NRC sentiment lexicon. This lexicon tags terms according to 10 different emotions such as anger, anticipation, surprise, etc.

# install text analysis libraries
library(textdata)

#show the sentiments in the package
nrcdf <- get_sentiments("nrc") 

#examine a table of the different emotions
nrcdf %>% count(sentiment,sort=T)
## # A tibble: 10 × 2
##    sentiment        n
##    <chr>        <int>
##  1 negative      3316
##  2 positive      2308
##  3 fear          1474
##  4 anger         1245
##  5 trust         1230
##  6 sadness       1187
##  7 disgust       1056
##  8 anticipation   837
##  9 joy            687
## 10 surprise       532

Most of the tagged terms are in the negative and positive categories, while the least number of tagged terms are in the surprise category.

Let’s apply this tagging library to our TREE_NOTES terms. We can join the dataframes using inner_join().

# merge our datasets
treewordstidy_sentiment <- treewordstidy %>% inner_join(get_sentiments("nrc"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., get_sentiments("nrc")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 21 of `x` matches multiple rows in `y`.
## ℹ Row 1098 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

Next, we might want to examine the words in our TREE_NOTES with the most frequently associated with particular sentiments. The following code pulls up the most frequently occuring terms that are associated with the sentiment sadness.

treewordstidy_sentiment %>% filter(sentiment=="sadness") %>% count(word, sort=T) %>% top_n(30)
## Selecting by n
##         word    n
## 1     remove 3078
## 2     damage 1179
## 3      wound  953
## 4      decay  912
## 5    failure  554
## 6       pine  516
## 7       lost  477
## 8   wildfire  443
## 9   conflict  440
## 10 emergency  399
## 11    broken  355
## 12    lowest  319
## 13       bad  265
## 14     strip  263
## 15      fell  227
## 16     lower  182
## 17   missing  174
## 18     broke  157
## 19     leave  147
## 20    hollow  133
## 21    cancel  132
## 22       rot  123
## 23    buried  115
## 24     tough  113
## 25      late  112
## 26    poison  103
## 27     dying   91
## 28      fall   91
## 29   hanging   82
## 30     badly   74

We can also visualize these results. More information on the next steps in text analysis with R can be found in Tidy Text Mining with R.

# drawn from https://www.tidytextmining.com/sentiment

treewordstidy_sentiment %>% group_by(sentiment) %>% count(word, sentiment, sort = TRUE) %>%
   slice_max(n, n = 15) %>% ungroup() %>%
  mutate(word = reorder(word, n)) %>% ggplot(aes(n, word, fill = sentiment)) +  geom_col(show.legend = FALSE)+ facet_wrap(~sentiment, scales = "free_y")

Using text searches in analysis

From the previous analysis, we saw that the word “decay” is frequently used in the TREE_NOTES. Let’s see how often the word “decay” is used in notes on trees in each genus.

DCtrees %>% filter(grepl("decay", TREE_NOTES)) %>% group_by(GENUS_NAME) %>% summarise(n=n()) %>% arrange(desc(n))
## # A tibble: 37 × 2
##    GENUS_NAME         n
##    <chr>          <int>
##  1 "Acer"           233
##  2 "Quercus"        231
##  3 ""                37
##  4 "Ulmus"           35
##  5 "Prunus"          32
##  6 "Tilia"           29
##  7 "Liriodendron"    19
##  8 "Ginkgo"          17
##  9 "Platanus"        17
## 10 "Cercis"          10
## # ℹ 27 more rows
# Double check results and examine the TREE_NOTES for Acer trees

#DCtrees %>% filter(grepl("decay", TREE_NOTES)) %>% filter(GENUS_NAME=="Acer") %>% select(GENUS_NAME, TREE_NOTES)

Finally, let’s pull everything we’ve learned together and make a map of the trees described by “decay” in DC. Since Acer and Quercus are the two genera with the most mentions of “decay” let’s color them differently from the other genera.

decay <- DCtrees %>% filter(grepl("decay", TREE_NOTES)) 

leaflet(decay) %>% 
  addTiles() %>%
  addCircleMarkers(lng=decay$X, lat=decay$Y, group="trees", radius=2, opacity=.5, color=ifelse(decay$GENUS=="Acer", "purple", ifelse(decay$GENUS=="Quercus", "blue", "darkgreen")), label=decay$SCI_NM, popup = paste("<b>Name:</b>", decay$CMMN_NM, "<br/>", "<b>Genus:</b>", decay$GENUS, "<br/>", "<b>Tree notes:</b>", decay$TREE_NOTES, sep=" "))