ANTH630 Home

Loading packages and data

Installing and loading packages

R comes with basic functions, but you need to install packages in order to really maximize R's functionality. You can install a package as follows:

install.packages("psych")

Packages can also be installed by by using the "Tools" or "Install Packages" menu in RStudio.

Packages only need to be installed once, but must be loaded with each new session of R. Let's load the psych package. We will use this later to produce summary statistics. Use the library() function to load a package.

library(psych)

Setting the working directory

The working directory is where R will automatically look to find files to load into R and where any files you create will be exported to. You can save a lot of time by setting and using a working directory.

getwd()  #shows you the working directory
setwd("path")
# windows: 'C:\Users\username\Desktop\filename.csv' mac:
# ''/Users/username/Desktop/filename.csv''

Reading and writing data

To read in a data file, we can use the read.csv( ) function. If the data file is within your working directory, you can simply refer to the file name. On a mac you can also right click on the file and hold down the option key to copy the file pathname.

# CityInfo <- read.csv('pathname/CityInfo.csv')
# CityInfo<-read.csv('CityInfo.csv')
# write.csv(CityInfo, 'CityInfo2.csv')

Exploratory data analysis

Check out this new CityInfo dataframe.

CityInfo
            City State Region Populationmetro_2016 Sunshine_per Bikecom_per
1       New York    NY     NE             20685000           58         1.1
2    Los Angeles    CA      W             15135000           73         1.3
3  San Francisco    CA      W              5955000           66         4.4
4        Chicago    IL     MW              9185000           54         1.7
5    Minneapolis    MN     MW              2795000           58         4.6
6       Portland    OR      W              2000000           48         7.2
7         Dallas    TX      S              6280000           61         0.2
8   Philadelphia    PA     NE              5595000           56         1.9
9         Boston    MA     NE              4490000           58         2.4
10       Detroit    MI     MW              3660000           53         0.8
11        Denver    CO      W              2600000           69         2.5
12       Seattle    WA      W              3475000           47         3.7
13       Phoenix    AZ      W              4295000           85         0.8
14       Atlanta    GA      S              5120000           60         0.7
15   New Orleans    LA      S               925000           57         3.4
16       Houston    TX      S              6005000           59         0.6
17 Washington DC    DC      S              4950000           56         3.9
18         Miami    FL      S              5820000           70         0.9
19     Milwaukee    WI     MW              1415000           54         1.0
   Year Sportsteam_num
1  1625              9
2  1781              8
3  1776              6
4  1803              5
5  1867              4
6  1845              1
7  1841              4
8  1682              4
9  1630              4
10 1701              4
11 1858              4
12 1851              2
13 1868              4
14 1843              3
15 1718              2
16 1837              3
17 1790              4
18 1896              4
19 1833              2

Before we do anything else, let's make a quick boxplot of the sunshine across regions in our dataframe.

boxplot(Sunshine_per ~ Region, data = CityInfo, xlab = "Region", ylab = "Sunshine %", 
    main = "Sunshine by US region", col = "goldenrod")

Now let's examine our dataframe. Look at the first six rows

head(CityInfo)
           City State Region Populationmetro_2016 Sunshine_per Bikecom_per Year
1      New York    NY     NE             20685000           58         1.1 1625
2   Los Angeles    CA      W             15135000           73         1.3 1781
3 San Francisco    CA      W              5955000           66         4.4 1776
4       Chicago    IL     MW              9185000           54         1.7 1803
5   Minneapolis    MN     MW              2795000           58         4.6 1867
6      Portland    OR      W              2000000           48         7.2 1845
  Sportsteam_num
1              9
2              8
3              6
4              5
5              4
6              1

Examine the dataframe structure

str(CityInfo)
'data.frame':   19 obs. of  8 variables:
 $ City                : Factor w/ 19 levels "Atlanta","Boston",..: 13 8 17 3 11 16 4 14 2 6 ...
 $ State               : Factor w/ 17 levels "AZ","CA","CO",..: 12 2 2 7 11 13 15 14 9 10 ...
 $ Region              : Factor w/ 4 levels "MW","NE","S",..: 2 4 4 1 1 4 3 2 2 1 ...
 $ Populationmetro_2016: int  20685000 15135000 5955000 9185000 2795000 2000000 6280000 5595000 4490000 3660000 ...
 $ Sunshine_per        : int  58 73 66 54 58 48 61 56 58 53 ...
 $ Bikecom_per         : num  1.1 1.3 4.4 1.7 4.6 7.2 0.2 1.9 2.4 0.8 ...
 $ Year                : int  1625 1781 1776 1803 1867 1845 1841 1682 1630 1701 ...
 $ Sportsteam_num      : int  9 8 6 5 4 1 4 4 4 4 ...

And look at summary statistics for the dataframe

summary(CityInfo)
      City        State    Region Populationmetro_2016  Sunshine_per  
 Atlanta: 1   CA     : 2   MW:4   Min.   :  925000     Min.   :47.00  
 Boston : 1   TX     : 2   NE:3   1st Qu.: 3135000     1st Qu.:55.00  
 Chicago: 1   AZ     : 1   S :6   Median : 4950000     Median :58.00  
 Dallas : 1   CO     : 1   W :6   Mean   : 5809737     Mean   :60.11  
 Denver : 1   DC     : 1          3rd Qu.: 5980000     3rd Qu.:63.50  
 Detroit: 1   FL     : 1          Max.   :20685000     Max.   :85.00  
 (Other):13   (Other):11                                              
  Bikecom_per         Year      Sportsteam_num 
 Min.   :0.200   Min.   :1625   Min.   :1.000  
 1st Qu.:0.850   1st Qu.:1747   1st Qu.:3.000  
 Median :1.700   Median :1833   Median :4.000  
 Mean   :2.268   Mean   :1792   Mean   :4.053  
 3rd Qu.:3.550   3rd Qu.:1848   3rd Qu.:4.000  
 Max.   :7.200   Max.   :1896   Max.   :9.000  
                                               
describe(CityInfo)
                     vars  n       mean         sd    median    trimmed
City*                   1 19      10.00       5.63      10.0      10.00
State*                  2 19       8.95       5.23       9.0       8.94
Region*                 3 19       2.74       1.15       3.0       2.76
Populationmetro_2016    4 19 5809736.84 4786299.16 4950000.0 5222058.82
Sunshine_per            5 19      60.11       9.13      58.0      59.41
Bikecom_per             6 19       2.27       1.83       1.7       2.10
Year                    7 19    1791.84      82.35    1833.0    1795.53
Sportsteam_num          8 19       4.05       1.96       4.0       3.94
                            mad      min        max    range  skew kurtosis
City*                      7.41      1.0       19.0       18  0.00    -1.39
State*                     7.41      1.0       17.0       16 -0.02    -1.51
Region*                    1.48      1.0        4.0        3 -0.35    -1.40
Populationmetro_2016 1971858.00 925000.0 20685000.0 19760000  1.82     2.82
Sunshine_per               5.93     47.0       85.0       38  1.00     0.68
Bikecom_per                1.33      0.2        7.2        7  1.03     0.29
Year                      51.89   1625.0     1896.0      271 -0.78    -0.77
Sportsteam_num             1.48      1.0        9.0        8  0.94     0.59
                             se
City*                      1.29
State*                     1.20
Region*                    0.26
Populationmetro_2016 1098052.33
Sunshine_per               2.09
Bikecom_per                0.42
Year                      18.89
Sportsteam_num             0.45

Try it

Recalling what we learned about subsetting dataframes, try to complete the following tasks:

  1. Select the Year column.

  2. Select the 5th element of the Year column.

  3. Select the 5th row of the CityInfo dataframe.

  4. Select the 5th and 6th rows.

Click for solution

CityInfo$Year
CityInfo[, 7]
CityInfo$Year[5]
CityInfo[5, ]
CityInfo[c(5, 6), ]

Logical tests and dataframe subsetting

You can subset dataframes in numerous ways. Last week we discussed the subset(), $, and [,] functions. We can also use logical tests and specific functions to subset dataframes based on conditionals.

CityInfo[CityInfo$Region == "W" & CityInfo$State == "CA", ]
           City State Region Populationmetro_2016 Sunshine_per Bikecom_per Year
2   Los Angeles    CA      W             15135000           73         1.3 1781
3 San Francisco    CA      W              5955000           66         4.4 1776
  Sportsteam_num
2              8
3              6
CityInfo[CityInfo$Region == "W" | CityInfo$State == "CA", ]
            City State Region Populationmetro_2016 Sunshine_per Bikecom_per
2    Los Angeles    CA      W             15135000           73         1.3
3  San Francisco    CA      W              5955000           66         4.4
6       Portland    OR      W              2000000           48         7.2
11        Denver    CO      W              2600000           69         2.5
12       Seattle    WA      W              3475000           47         3.7
13       Phoenix    AZ      W              4295000           85         0.8
   Year Sportsteam_num
2  1781              8
3  1776              6
6  1845              1
11 1858              4
12 1851              2
13 1868              4
CityInfo[CityInfo$Region == "W" & CityInfo$State != "CA", ]
       City State Region Populationmetro_2016 Sunshine_per Bikecom_per Year
6  Portland    OR      W              2000000           48         7.2 1845
11   Denver    CO      W              2600000           69         2.5 1858
12  Seattle    WA      W              3475000           47         3.7 1851
13  Phoenix    AZ      W              4295000           85         0.8 1868
   Sportsteam_num
6               1
11              4
12              2
13              4

Suppose we want to subset the city names based on whether they are in the NE and W regions. We can use %in%.

CityInfo$Region %in% c("NE", "W")  # what does this return?
 [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
[13]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
CityInfo[CityInfo$Region %in% c("NE", "W"), "City"]  # what does this return?
[1] New York      Los Angeles   San Francisco Portland      Philadelphia 
[6] Boston        Denver        Seattle       Phoenix      
19 Levels: Atlanta Boston Chicago Dallas Denver Detroit Houston ... Washington DC

There are also additional functions that allow you to match subsets of dataframes based on particular values.

CityInfo
            City State Region Populationmetro_2016 Sunshine_per Bikecom_per
1       New York    NY     NE             20685000           58         1.1
2    Los Angeles    CA      W             15135000           73         1.3
3  San Francisco    CA      W              5955000           66         4.4
4        Chicago    IL     MW              9185000           54         1.7
5    Minneapolis    MN     MW              2795000           58         4.6
6       Portland    OR      W              2000000           48         7.2
7         Dallas    TX      S              6280000           61         0.2
8   Philadelphia    PA     NE              5595000           56         1.9
9         Boston    MA     NE              4490000           58         2.4
10       Detroit    MI     MW              3660000           53         0.8
11        Denver    CO      W              2600000           69         2.5
12       Seattle    WA      W              3475000           47         3.7
13       Phoenix    AZ      W              4295000           85         0.8
14       Atlanta    GA      S              5120000           60         0.7
15   New Orleans    LA      S               925000           57         3.4
16       Houston    TX      S              6005000           59         0.6
17 Washington DC    DC      S              4950000           56         3.9
18         Miami    FL      S              5820000           70         0.9
19     Milwaukee    WI     MW              1415000           54         1.0
   Year Sportsteam_num
1  1625              9
2  1781              8
3  1776              6
4  1803              5
5  1867              4
6  1845              1
7  1841              4
8  1682              4
9  1630              4
10 1701              4
11 1858              4
12 1851              2
13 1868              4
14 1843              3
15 1718              2
16 1837              3
17 1790              4
18 1896              4
19 1833              2

which(CityInfo$Populationmetro_2016 > 3475000)  # this returns the index of the value from the vector
 [1]  1  2  3  4  7  8  9 10 13 14 16 17 18

CityInfo[which(CityInfo$Populationmetro_2016 > 3475000), ]  # this subsets the whole dataframe for only the cities with populations over 3475000.
            City State Region Populationmetro_2016 Sunshine_per Bikecom_per
1       New York    NY     NE             20685000           58         1.1
2    Los Angeles    CA      W             15135000           73         1.3
3  San Francisco    CA      W              5955000           66         4.4
4        Chicago    IL     MW              9185000           54         1.7
7         Dallas    TX      S              6280000           61         0.2
8   Philadelphia    PA     NE              5595000           56         1.9
9         Boston    MA     NE              4490000           58         2.4
10       Detroit    MI     MW              3660000           53         0.8
13       Phoenix    AZ      W              4295000           85         0.8
14       Atlanta    GA      S              5120000           60         0.7
16       Houston    TX      S              6005000           59         0.6
17 Washington DC    DC      S              4950000           56         3.9
18         Miami    FL      S              5820000           70         0.9
   Year Sportsteam_num
1  1625              9
2  1781              8
3  1776              6
4  1803              5
7  1841              4
8  1682              4
9  1630              4
10 1701              4
13 1868              4
14 1843              3
16 1837              3
17 1790              4
18 1896              4

Data were obtained from several sources:

Sports teams in the Big 4 (NFL, MLB, NMA, NHL) and metro population estimates

Bike commuters 1

Bike commuters 2

Year of Foundation

Sunniest cities in US (Using National oceanic atmmospheric administation data. Average percent of possible sunshine)

Regions census divisions