Loading packages and data

##Installing and loading packages R comes with basic functions, but you need to install packages in order to really maximize R’s functionality. You can install a package as follows:

install.packages("psych")

Packages can also be installed by by using the “Tools” or “Install Packages” menu in RStudio.

Packages only need to be installed once, but must be loaded with each new session of R. Let’s load the psych package. We will use this later to produce summary statistics. Use the library() function to load a package.

library(psych)

Setting the working directory

The working directory is where R will automatically look to find files to load into R and where any files you create will be exported to. You can save a lot of time by setting and using a working directory.

getwd()  #shows you the working directory
setwd("path")
# windows: 'C:\Users\username\Desktop\filename.csv' mac:
# ''/Users/username/Desktop/filename.csv''

Reading and writing data

To read in a data file, we can use the read.csv( ) function. If the data file is within your working directory, you can simply refer to the file name. On a mac you can also right click on the file and hold down the option key to copy the file pathname.

# CityInfo <- read.csv('pathname/CityInfo.csv')
# CityInfo<-read.csv('CityInfo.csv')

CityInfo <- read.csv("https://maddiebrown.github.io/ANTH630/data/Cityinfo.csv")

# write.csv(CityInfo, 'CityInfo2.csv')

Exploratory data analysis

Check out this new CityInfo dataframe.

CityInfo
            City State Region Populationmetro_2016 Sunshine_per Bikecom_per
1       New York    NY     NE             20685000           58         1.1
2    Los Angeles    CA      W             15135000           73         1.3
3  San Francisco    CA      W              5955000           66         4.4
4        Chicago    IL     MW              9185000           54         1.7
5    Minneapolis    MN     MW              2795000           58         4.6
6       Portland    OR      W              2000000           48         7.2
7         Dallas    TX      S              6280000           61         0.2
8   Philadelphia    PA     NE              5595000           56         1.9
9         Boston    MA     NE              4490000           58         2.4
10       Detroit    MI     MW              3660000           53         0.8
11        Denver    CO      W              2600000           69         2.5
12       Seattle    WA      W              3475000           47         3.7
13       Phoenix    AZ      W              4295000           85         0.8
14       Atlanta    GA      S              5120000           60         0.7
15   New Orleans    LA      S               925000           57         3.4
16       Houston    TX      S              6005000           59         0.6
17 Washington DC    DC      S              4950000           56         3.9
18         Miami    FL      S              5820000           70         0.9
19     Milwaukee    WI     MW              1415000           54         1.0
   Year Sportsteam_num
1  1625              9
2  1781              8
3  1776              6
4  1803              5
5  1867              4
6  1845              1
7  1841              4
8  1682              4
9  1630              4
10 1701              4
11 1858              4
12 1851              2
13 1868              4
14 1843              3
15 1718              2
16 1837              3
17 1790              4
18 1896              4
19 1833              2

Before we do anything else, let’s make a quick boxplot of the sunshine across regions in our dataframe.

boxplot(Sunshine_per ~ Region, data = CityInfo, xlab = "Region", ylab = "Sunshine %", 
    main = "Sunshine by US region", col = "goldenrod")

Now let’s examine our dataframe. Look at the first six rows

head(CityInfo)
           City State Region Populationmetro_2016 Sunshine_per Bikecom_per Year
1      New York    NY     NE             20685000           58         1.1 1625
2   Los Angeles    CA      W             15135000           73         1.3 1781
3 San Francisco    CA      W              5955000           66         4.4 1776
4       Chicago    IL     MW              9185000           54         1.7 1803
5   Minneapolis    MN     MW              2795000           58         4.6 1867
6      Portland    OR      W              2000000           48         7.2 1845
  Sportsteam_num
1              9
2              8
3              6
4              5
5              4
6              1

Examine the dataframe structure

str(CityInfo)
'data.frame':   19 obs. of  8 variables:
 $ City                : Factor w/ 19 levels "Atlanta","Boston",..: 13 8 17 3 11 16 4 14 2 6 ...
 $ State               : Factor w/ 17 levels "AZ","CA","CO",..: 12 2 2 7 11 13 15 14 9 10 ...
 $ Region              : Factor w/ 4 levels "MW","NE","S",..: 2 4 4 1 1 4 3 2 2 1 ...
 $ Populationmetro_2016: int  20685000 15135000 5955000 9185000 2795000 2000000 6280000 5595000 4490000 3660000 ...
 $ Sunshine_per        : int  58 73 66 54 58 48 61 56 58 53 ...
 $ Bikecom_per         : num  1.1 1.3 4.4 1.7 4.6 7.2 0.2 1.9 2.4 0.8 ...
 $ Year                : int  1625 1781 1776 1803 1867 1845 1841 1682 1630 1701 ...
 $ Sportsteam_num      : int  9 8 6 5 4 1 4 4 4 4 ...

And look at summary statistics for the dataframe

summary(CityInfo)
      City        State    Region Populationmetro_2016  Sunshine_per  
 Atlanta: 1   CA     : 2   MW:4   Min.   :  925000     Min.   :47.00  
 Boston : 1   TX     : 2   NE:3   1st Qu.: 3135000     1st Qu.:55.00  
 Chicago: 1   AZ     : 1   S :6   Median : 4950000     Median :58.00  
 Dallas : 1   CO     : 1   W :6   Mean   : 5809737     Mean   :60.11  
 Denver : 1   DC     : 1          3rd Qu.: 5980000     3rd Qu.:63.50  
 Detroit: 1   FL     : 1          Max.   :20685000     Max.   :85.00  
 (Other):13   (Other):11                                              
  Bikecom_per         Year      Sportsteam_num 
 Min.   :0.200   Min.   :1625   Min.   :1.000  
 1st Qu.:0.850   1st Qu.:1747   1st Qu.:3.000  
 Median :1.700   Median :1833   Median :4.000  
 Mean   :2.268   Mean   :1792   Mean   :4.053  
 3rd Qu.:3.550   3rd Qu.:1848   3rd Qu.:4.000  
 Max.   :7.200   Max.   :1896   Max.   :9.000  
                                               
describe(CityInfo)
                     vars  n       mean         sd    median    trimmed
City*                   1 19      10.00       5.63      10.0      10.00
State*                  2 19       8.95       5.23       9.0       8.94
Region*                 3 19       2.74       1.15       3.0       2.76
Populationmetro_2016    4 19 5809736.84 4786299.16 4950000.0 5222058.82
Sunshine_per            5 19      60.11       9.13      58.0      59.41
Bikecom_per             6 19       2.27       1.83       1.7       2.10
Year                    7 19    1791.84      82.35    1833.0    1795.53
Sportsteam_num          8 19       4.05       1.96       4.0       3.94
                            mad      min        max    range  skew kurtosis
City*                      7.41      1.0       19.0       18  0.00    -1.39
State*                     7.41      1.0       17.0       16 -0.02    -1.51
Region*                    1.48      1.0        4.0        3 -0.35    -1.40
Populationmetro_2016 1971858.00 925000.0 20685000.0 19760000  1.82     2.82
Sunshine_per               5.93     47.0       85.0       38  1.00     0.68
Bikecom_per                1.33      0.2        7.2        7  1.03     0.29
Year                      51.89   1625.0     1896.0      271 -0.78    -0.77
Sportsteam_num             1.48      1.0        9.0        8  0.94     0.59
                             se
City*                      1.29
State*                     1.20
Region*                    0.26
Populationmetro_2016 1098052.33
Sunshine_per               2.09
Bikecom_per                0.42
Year                      18.89
Sportsteam_num             0.45

Try it

Recalling what we learned about subsetting dataframes, try to complete the following tasks:

Select the Year column.
Select the 5th element of the Year column.
Select the 5th row of the CityInfo dataframe.
Select the 5th and 6th rows.

Click for solution

CityInfo$Year
CityInfo[, 7]
CityInfo$Year[5]
CityInfo[5, ]
CityInfo[c(5, 6), ]

Logical tests and dataframe subsetting

You can subset dataframes in numerous ways. Last week we discussed the subset(), $, and [,] functions. We can also use logical tests and specific functions to subset dataframes based on conditionals.

CityInfo[CityInfo$Region == "W" & CityInfo$State == "CA", ]
           City State Region Populationmetro_2016 Sunshine_per Bikecom_per Year
2   Los Angeles    CA      W             15135000           73         1.3 1781
3 San Francisco    CA      W              5955000           66         4.4 1776
  Sportsteam_num
2              8
3              6
CityInfo[CityInfo$Region == "W" | CityInfo$State == "CA", ]
            City State Region Populationmetro_2016 Sunshine_per Bikecom_per
2    Los Angeles    CA      W             15135000           73         1.3
3  San Francisco    CA      W              5955000           66         4.4
6       Portland    OR      W              2000000           48         7.2
11        Denver    CO      W              2600000           69         2.5
12       Seattle    WA      W              3475000           47         3.7
13       Phoenix    AZ      W              4295000           85         0.8
   Year Sportsteam_num
2  1781              8
3  1776              6
6  1845              1
11 1858              4
12 1851              2
13 1868              4
CityInfo[CityInfo$Region == "W" & CityInfo$State != "CA", ]
       City State Region Populationmetro_2016 Sunshine_per Bikecom_per Year
6  Portland    OR      W              2000000           48         7.2 1845
11   Denver    CO      W              2600000           69         2.5 1858
12  Seattle    WA      W              3475000           47         3.7 1851
13  Phoenix    AZ      W              4295000           85         0.8 1868
   Sportsteam_num
6               1
11              4
12              2
13              4

Suppose we want to subset the city names based on whether they are in the NE and W regions. We can use %in%.

CityInfo$Region %in% c("NE", "W")  # what does this return?
 [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
[13]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
CityInfo[CityInfo$Region %in% c("NE", "W"), "City"]  # what does this return?
[1] New York      Los Angeles   San Francisco Portland      Philadelphia 
[6] Boston        Denver        Seattle       Phoenix      
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC

There are also additional functions that allow you to match subsets of dataframes based on particular values.

CityInfo
            City State Region Populationmetro_2016 Sunshine_per Bikecom_per
1       New York    NY     NE             20685000           58         1.1
2    Los Angeles    CA      W             15135000           73         1.3
3  San Francisco    CA      W              5955000           66         4.4
4        Chicago    IL     MW              9185000           54         1.7
5    Minneapolis    MN     MW              2795000           58         4.6
6       Portland    OR      W              2000000           48         7.2
7         Dallas    TX      S              6280000           61         0.2
8   Philadelphia    PA     NE              5595000           56         1.9
9         Boston    MA     NE              4490000           58         2.4
10       Detroit    MI     MW              3660000           53         0.8
11        Denver    CO      W              2600000           69         2.5
12       Seattle    WA      W              3475000           47         3.7
13       Phoenix    AZ      W              4295000           85         0.8
14       Atlanta    GA      S              5120000           60         0.7
15   New Orleans    LA      S               925000           57         3.4
16       Houston    TX      S              6005000           59         0.6
17 Washington DC    DC      S              4950000           56         3.9
18         Miami    FL      S              5820000           70         0.9
19     Milwaukee    WI     MW              1415000           54         1.0
   Year Sportsteam_num
1  1625              9
2  1781              8
3  1776              6
4  1803              5
5  1867              4
6  1845              1
7  1841              4
8  1682              4
9  1630              4
10 1701              4
11 1858              4
12 1851              2
13 1868              4
14 1843              3
15 1718              2
16 1837              3
17 1790              4
18 1896              4
19 1833              2

which(CityInfo$Populationmetro_2016 > 3475000)  # this returns the index of the value from the vector
 [1]  1  2  3  4  7  8  9 10 13 14 16 17 18

CityInfo[which(CityInfo$Populationmetro_2016 > 3475000), ]  # this subsets the whole dataframe for only the cities with populations over 3475000.
            City State Region Populationmetro_2016 Sunshine_per Bikecom_per
1       New York    NY     NE             20685000           58         1.1
2    Los Angeles    CA      W             15135000           73         1.3
3  San Francisco    CA      W              5955000           66         4.4
4        Chicago    IL     MW              9185000           54         1.7
7         Dallas    TX      S              6280000           61         0.2
8   Philadelphia    PA     NE              5595000           56         1.9
9         Boston    MA     NE              4490000           58         2.4
10       Detroit    MI     MW              3660000           53         0.8
13       Phoenix    AZ      W              4295000           85         0.8
14       Atlanta    GA      S              5120000           60         0.7
16       Houston    TX      S              6005000           59         0.6
17 Washington DC    DC      S              4950000           56         3.9
18         Miami    FL      S              5820000           70         0.9
   Year Sportsteam_num
1  1625              9
2  1781              8
3  1776              6
4  1803              5
7  1841              4
8  1682              4
9  1630              4
10 1701              4
13 1868              4
14 1843              3
16 1837              3
17 1790              4
18 1896              4

Regular Expressions

Text can be searched using a standardized lexicon called regular expressions. Read more on regular expressions and how to [se them]](https://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions) for data science. Regular expressions are also helpful when pasting labels or programming figure titles/captions in a standardized way (for example, adding new lines). Here let’s use str_detect() to pull out some strings with regular expressions.

library(tidyverse)

# any cities ending in 's'
CityInfo$City[str_detect(CityInfo$City, "s$")]
[1] Los Angeles Minneapolis Dallas      New Orleans
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC

# which rows have a city ending in 's'
str_detect(CityInfo$City, "s$")
 [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

# any cities with a space in the name
CityInfo$City[str_detect(CityInfo$City, ". .")]
[1] New York      Los Angeles   San Francisco New Orleans   Washington DC
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC

Some basics with regular expressions:

$ means the end of a string (line, e.g. whole string, not within string)
^ means the start of a string
[] means to search for any of the included characters within that bracket at that position. Simply listing characters will lead to direct matches, while you can also create ranges and exclusions.
. can stand in for any character except a return (new line)
\n means a new line
| means “or”, just as in base R
Because certain characters have a meaning in regular expressions (e.g. . and \), you need to escape the character first in order to directly match it. See here for details.
You can look for patterns that repeat or restrict a search to only digits or letters. This can be helpful for data validation or searching for standardized strings such as zip codes, phone numbers, or names.

Try-it

Select all city names starting with “s”.
Print all city names for cities founded in the 1600s or 1700s
Print the name and year of founding for all cities with either “as” or “il” in the name.

Click for solution

# 1.  any cities starting with 's'
CityInfo$City[str_detect(CityInfo$City, "^s")]
factor(0)
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC
# why doesn't this work? R is case sensitive
CityInfo$City[str_detect(CityInfo$City, "^S")]
[1] San Francisco Seattle      
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC

# 2. Print all city names for cities founded in the 1600s or 1700s
CityInfo$City[str_detect(CityInfo$Year, "1[6-7].")]
[1] New York      Los Angeles   San Francisco Philadelphia  Boston       
[6] Detroit       New Orleans   Washington DC
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC

# 3. Print the name and year of founding for all cities with either 'as' or 'il'
# in the name.
CityInfo[str_detect(CityInfo$City, "as|il"), c("City", "Year")]
            City Year
7         Dallas 1841
8   Philadelphia 1682
17 Washington DC 1790
19     Milwaukee 1833

Data were obtained from several sources:

Sports teams in the Big 4 (NFL, MLB, NMA, NHL) and metro population estimates

Bike commuters 1

Bike commuters 2

Year of Foundation

Sunniest cities in US (Using National oceanic atmmospheric administation data. Average percent of possible sunshine)

Regions census divisions

ANTH630: R Tutorial - Packages and data wrangling with base R

Madeline Brown