##Installing and loading packages R comes with basic functions, but you need to install packages in order to really maximize R’s functionality. You can install a package as follows:
install.packages("psych")
Packages can also be installed by by using the “Tools” or “Install Packages” menu in RStudio.
Packages only need to be installed once, but must be loaded with each new session of R. Let’s load the psych
package. We will use this later to produce summary statistics. Use the library()
function to load a package.
library(psych)
The working directory is where R will automatically look to find files to load into R and where any files you create will be exported to. You can save a lot of time by setting and using a working directory.
getwd() #shows you the working directory
setwd("path")
# windows: 'C:\Users\username\Desktop\filename.csv' mac:
# ''/Users/username/Desktop/filename.csv''
To read in a data file, we can use the read.csv( )
function. If the data file is within your working directory, you can simply refer to the file name. On a mac you can also right click on the file and hold down the option key to copy the file pathname.
# CityInfo <- read.csv('pathname/CityInfo.csv')
# CityInfo<-read.csv('CityInfo.csv')
CityInfo <- read.csv("https://maddiebrown.github.io/ANTH630/data/Cityinfo.csv")
# write.csv(CityInfo, 'CityInfo2.csv')
Check out this new CityInfo
dataframe.
CityInfo
City State Region Populationmetro_2016 Sunshine_per Bikecom_per
1 New York NY NE 20685000 58 1.1
2 Los Angeles CA W 15135000 73 1.3
3 San Francisco CA W 5955000 66 4.4
4 Chicago IL MW 9185000 54 1.7
5 Minneapolis MN MW 2795000 58 4.6
6 Portland OR W 2000000 48 7.2
7 Dallas TX S 6280000 61 0.2
8 Philadelphia PA NE 5595000 56 1.9
9 Boston MA NE 4490000 58 2.4
10 Detroit MI MW 3660000 53 0.8
11 Denver CO W 2600000 69 2.5
12 Seattle WA W 3475000 47 3.7
13 Phoenix AZ W 4295000 85 0.8
14 Atlanta GA S 5120000 60 0.7
15 New Orleans LA S 925000 57 3.4
16 Houston TX S 6005000 59 0.6
17 Washington DC DC S 4950000 56 3.9
18 Miami FL S 5820000 70 0.9
19 Milwaukee WI MW 1415000 54 1.0
Year Sportsteam_num
1 1625 9
2 1781 8
3 1776 6
4 1803 5
5 1867 4
6 1845 1
7 1841 4
8 1682 4
9 1630 4
10 1701 4
11 1858 4
12 1851 2
13 1868 4
14 1843 3
15 1718 2
16 1837 3
17 1790 4
18 1896 4
19 1833 2
Before we do anything else, let’s make a quick boxplot of the sunshine across regions in our dataframe.
boxplot(Sunshine_per ~ Region, data = CityInfo, xlab = "Region", ylab = "Sunshine %",
main = "Sunshine by US region", col = "goldenrod")
Now let’s examine our dataframe. Look at the first six rows
head(CityInfo)
City State Region Populationmetro_2016 Sunshine_per Bikecom_per Year
1 New York NY NE 20685000 58 1.1 1625
2 Los Angeles CA W 15135000 73 1.3 1781
3 San Francisco CA W 5955000 66 4.4 1776
4 Chicago IL MW 9185000 54 1.7 1803
5 Minneapolis MN MW 2795000 58 4.6 1867
6 Portland OR W 2000000 48 7.2 1845
Sportsteam_num
1 9
2 8
3 6
4 5
5 4
6 1
Examine the dataframe structure
str(CityInfo)
'data.frame': 19 obs. of 8 variables:
$ City : Factor w/ 19 levels "Atlanta","Boston",..: 13 8 17 3 11 16 4 14 2 6 ...
$ State : Factor w/ 17 levels "AZ","CA","CO",..: 12 2 2 7 11 13 15 14 9 10 ...
$ Region : Factor w/ 4 levels "MW","NE","S",..: 2 4 4 1 1 4 3 2 2 1 ...
$ Populationmetro_2016: int 20685000 15135000 5955000 9185000 2795000 2000000 6280000 5595000 4490000 3660000 ...
$ Sunshine_per : int 58 73 66 54 58 48 61 56 58 53 ...
$ Bikecom_per : num 1.1 1.3 4.4 1.7 4.6 7.2 0.2 1.9 2.4 0.8 ...
$ Year : int 1625 1781 1776 1803 1867 1845 1841 1682 1630 1701 ...
$ Sportsteam_num : int 9 8 6 5 4 1 4 4 4 4 ...
And look at summary statistics for the dataframe
summary(CityInfo)
City State Region Populationmetro_2016 Sunshine_per
Atlanta: 1 CA : 2 MW:4 Min. : 925000 Min. :47.00
Boston : 1 TX : 2 NE:3 1st Qu.: 3135000 1st Qu.:55.00
Chicago: 1 AZ : 1 S :6 Median : 4950000 Median :58.00
Dallas : 1 CO : 1 W :6 Mean : 5809737 Mean :60.11
Denver : 1 DC : 1 3rd Qu.: 5980000 3rd Qu.:63.50
Detroit: 1 FL : 1 Max. :20685000 Max. :85.00
(Other):13 (Other):11
Bikecom_per Year Sportsteam_num
Min. :0.200 Min. :1625 Min. :1.000
1st Qu.:0.850 1st Qu.:1747 1st Qu.:3.000
Median :1.700 Median :1833 Median :4.000
Mean :2.268 Mean :1792 Mean :4.053
3rd Qu.:3.550 3rd Qu.:1848 3rd Qu.:4.000
Max. :7.200 Max. :1896 Max. :9.000
describe(CityInfo)
vars n mean sd median trimmed
City* 1 19 10.00 5.63 10.0 10.00
State* 2 19 8.95 5.23 9.0 8.94
Region* 3 19 2.74 1.15 3.0 2.76
Populationmetro_2016 4 19 5809736.84 4786299.16 4950000.0 5222058.82
Sunshine_per 5 19 60.11 9.13 58.0 59.41
Bikecom_per 6 19 2.27 1.83 1.7 2.10
Year 7 19 1791.84 82.35 1833.0 1795.53
Sportsteam_num 8 19 4.05 1.96 4.0 3.94
mad min max range skew kurtosis
City* 7.41 1.0 19.0 18 0.00 -1.39
State* 7.41 1.0 17.0 16 -0.02 -1.51
Region* 1.48 1.0 4.0 3 -0.35 -1.40
Populationmetro_2016 1971858.00 925000.0 20685000.0 19760000 1.82 2.82
Sunshine_per 5.93 47.0 85.0 38 1.00 0.68
Bikecom_per 1.33 0.2 7.2 7 1.03 0.29
Year 51.89 1625.0 1896.0 271 -0.78 -0.77
Sportsteam_num 1.48 1.0 9.0 8 0.94 0.59
se
City* 1.29
State* 1.20
Region* 0.26
Populationmetro_2016 1098052.33
Sunshine_per 2.09
Bikecom_per 0.42
Year 18.89
Sportsteam_num 0.45
Recalling what we learned about subsetting dataframes, try to complete the following tasks:
Select the Year
column.
Select the 5th element of the Year
column.
Select the 5th row of the CityInfo
dataframe.
Select the 5th and 6th rows.
Click for solution
CityInfo$Year
CityInfo[, 7]
CityInfo$Year[5]
CityInfo[5, ]
CityInfo[c(5, 6), ]
You can subset dataframes in numerous ways. Last week we discussed the subset()
, $
, and [,]
functions. We can also use logical tests and specific functions to subset dataframes based on conditionals.
CityInfo[CityInfo$Region == "W" & CityInfo$State == "CA", ]
City State Region Populationmetro_2016 Sunshine_per Bikecom_per Year
2 Los Angeles CA W 15135000 73 1.3 1781
3 San Francisco CA W 5955000 66 4.4 1776
Sportsteam_num
2 8
3 6
CityInfo[CityInfo$Region == "W" | CityInfo$State == "CA", ]
City State Region Populationmetro_2016 Sunshine_per Bikecom_per
2 Los Angeles CA W 15135000 73 1.3
3 San Francisco CA W 5955000 66 4.4
6 Portland OR W 2000000 48 7.2
11 Denver CO W 2600000 69 2.5
12 Seattle WA W 3475000 47 3.7
13 Phoenix AZ W 4295000 85 0.8
Year Sportsteam_num
2 1781 8
3 1776 6
6 1845 1
11 1858 4
12 1851 2
13 1868 4
CityInfo[CityInfo$Region == "W" & CityInfo$State != "CA", ]
City State Region Populationmetro_2016 Sunshine_per Bikecom_per Year
6 Portland OR W 2000000 48 7.2 1845
11 Denver CO W 2600000 69 2.5 1858
12 Seattle WA W 3475000 47 3.7 1851
13 Phoenix AZ W 4295000 85 0.8 1868
Sportsteam_num
6 1
11 4
12 2
13 4
Suppose we want to subset the city names based on whether they are in the NE and W regions. We can use %in%
.
CityInfo$Region %in% c("NE", "W") # what does this return?
[1] TRUE TRUE TRUE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE
[13] TRUE FALSE FALSE FALSE FALSE FALSE FALSE
CityInfo[CityInfo$Region %in% c("NE", "W"), "City"] # what does this return?
[1] New York Los Angeles San Francisco Portland Philadelphia
[6] Boston Denver Seattle Phoenix
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC
There are also additional functions that allow you to match subsets of dataframes based on particular values.
CityInfo
City State Region Populationmetro_2016 Sunshine_per Bikecom_per
1 New York NY NE 20685000 58 1.1
2 Los Angeles CA W 15135000 73 1.3
3 San Francisco CA W 5955000 66 4.4
4 Chicago IL MW 9185000 54 1.7
5 Minneapolis MN MW 2795000 58 4.6
6 Portland OR W 2000000 48 7.2
7 Dallas TX S 6280000 61 0.2
8 Philadelphia PA NE 5595000 56 1.9
9 Boston MA NE 4490000 58 2.4
10 Detroit MI MW 3660000 53 0.8
11 Denver CO W 2600000 69 2.5
12 Seattle WA W 3475000 47 3.7
13 Phoenix AZ W 4295000 85 0.8
14 Atlanta GA S 5120000 60 0.7
15 New Orleans LA S 925000 57 3.4
16 Houston TX S 6005000 59 0.6
17 Washington DC DC S 4950000 56 3.9
18 Miami FL S 5820000 70 0.9
19 Milwaukee WI MW 1415000 54 1.0
Year Sportsteam_num
1 1625 9
2 1781 8
3 1776 6
4 1803 5
5 1867 4
6 1845 1
7 1841 4
8 1682 4
9 1630 4
10 1701 4
11 1858 4
12 1851 2
13 1868 4
14 1843 3
15 1718 2
16 1837 3
17 1790 4
18 1896 4
19 1833 2
which(CityInfo$Populationmetro_2016 > 3475000) # this returns the index of the value from the vector
[1] 1 2 3 4 7 8 9 10 13 14 16 17 18
CityInfo[which(CityInfo$Populationmetro_2016 > 3475000), ] # this subsets the whole dataframe for only the cities with populations over 3475000.
City State Region Populationmetro_2016 Sunshine_per Bikecom_per
1 New York NY NE 20685000 58 1.1
2 Los Angeles CA W 15135000 73 1.3
3 San Francisco CA W 5955000 66 4.4
4 Chicago IL MW 9185000 54 1.7
7 Dallas TX S 6280000 61 0.2
8 Philadelphia PA NE 5595000 56 1.9
9 Boston MA NE 4490000 58 2.4
10 Detroit MI MW 3660000 53 0.8
13 Phoenix AZ W 4295000 85 0.8
14 Atlanta GA S 5120000 60 0.7
16 Houston TX S 6005000 59 0.6
17 Washington DC DC S 4950000 56 3.9
18 Miami FL S 5820000 70 0.9
Year Sportsteam_num
1 1625 9
2 1781 8
3 1776 6
4 1803 5
7 1841 4
8 1682 4
9 1630 4
10 1701 4
13 1868 4
14 1843 3
16 1837 3
17 1790 4
18 1896 4
Text can be searched using a standardized lexicon called regular expressions. Read more on regular expressions and how to [se them]](https://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions) for data science. Regular expressions are also helpful when pasting labels or programming figure titles/captions in a standardized way (for example, adding new lines). Here let’s use str_detect()
to pull out some strings with regular expressions.
library(tidyverse)
# any cities ending in 's'
CityInfo$City[str_detect(CityInfo$City, "s$")]
[1] Los Angeles Minneapolis Dallas New Orleans
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC
# which rows have a city ending in 's'
str_detect(CityInfo$City, "s$")
[1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
# any cities with a space in the name
CityInfo$City[str_detect(CityInfo$City, ". .")]
[1] New York Los Angeles San Francisco New Orleans Washington DC
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC
Some basics with regular expressions:
$
means the end of a string (line, e.g. whole string, not within string)^
means the start of a string[]
means to search for any of the included characters within that bracket at that position. Simply listing characters will lead to direct matches, while you can also create ranges and exclusions..
can stand in for any character except a return (new line)\n
means a new line|
means “or”, just as in base R.
and \
), you need to escape the character first in order to directly match it. See here for details.Click for solution
# 1. any cities starting with 's'
CityInfo$City[str_detect(CityInfo$City, "^s")]
factor(0)
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC
# why doesn't this work? R is case sensitive
CityInfo$City[str_detect(CityInfo$City, "^S")]
[1] San Francisco Seattle
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC
# 2. Print all city names for cities founded in the 1600s or 1700s
CityInfo$City[str_detect(CityInfo$Year, "1[6-7].")]
[1] New York Los Angeles San Francisco Philadelphia Boston
[6] Detroit New Orleans Washington DC
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC
# 3. Print the name and year of founding for all cities with either 'as' or 'il'
# in the name.
CityInfo[str_detect(CityInfo$City, "as|il"), c("City", "Year")]
City Year
7 Dallas 1841
8 Philadelphia 1682
17 Washington DC 1790
19 Milwaukee 1833
Sports teams in the Big 4 (NFL, MLB, NMA, NHL) and metro population estimates
Sunniest cities in US (Using National oceanic atmmospheric administation data. Average percent of possible sunshine)