ANTH630 Home

Exploratory Data Analysis

In exploratory data analysis, we often begin by calculating summary statistics for and plotting key variables in our data. The CityInfo dataframe has both factor and numeric data. For example: it might be interesting to examine the relationship between population size and the number of sports teams or between the amount of sunshine and the number of bike commuters. Or maybe we are interested in looking at regional variation in sunshine in our focal cities.

Load in data and examine the structure.

CityInfo <- read.csv("https://maddiebrown.github.io/ANTH630/data/Cityinfo.csv")
str(CityInfo)
'data.frame':   19 obs. of  8 variables:
 $ City                : Factor w/ 19 levels "Atlanta","Boston",..: 13 8 17 3 11 16 4 14 2 6 ...
 $ State               : Factor w/ 17 levels "AZ","CA","CO",..: 12 2 2 7 11 13 15 14 9 10 ...
 $ Region              : Factor w/ 4 levels "MW","NE","S",..: 2 4 4 1 1 4 3 2 2 1 ...
 $ Populationmetro_2016: int  20685000 15135000 5955000 9185000 2795000 2000000 6280000 5595000 4490000 3660000 ...
 $ Sunshine_per        : int  58 73 66 54 58 48 61 56 58 53 ...
 $ Bikecom_per         : num  1.1 1.3 4.4 1.7 4.6 7.2 0.2 1.9 2.4 0.8 ...
 $ Year                : int  1625 1781 1776 1803 1867 1845 1841 1682 1630 1701 ...
 $ Sportsteam_num      : int  9 8 6 5 4 1 4 4 4 4 ...

Let’s calculate the mean amount of sunshine for midwestern cities.

mean(CityInfo[CityInfo$Region == "MW", "Sunshine_per"])
[1] 54.75

Now let’s subset the dataframe according to regions and calculate the mean sunshine for all regions. Try using the aggregate() function.

aggregate(Sunshine_per ~ Region, CityInfo, mean)
  Region Sunshine_per
1     MW     54.75000
2     NE     57.33333
3      S     60.50000
4      W     64.66667

Try it

Often, we want to calculate numerous descriptive statistics simultaneously. There are numerous functions which can accomplish this task. Here, we’ll try using functions from the psych package we downloaded earlier.

  1. Make sure you downloaded and loaded the psych package

  2. Using the describe() function, calculate summary statistics for the entire CityInfo dataframe

  3. Calculate summary statistics for only the number of sports teams per city

  4. The describeBy() function is useful when you want to summarize numerous variables according to the same grouping variable. Use describeBy() to generate summary statistics about the number of sports teams oer city by region. Assign the results to a new object called Sportstats.

Click for solution

library(psych)
describe(CityInfo)
                     vars  n       mean         sd    median    trimmed
City*                   1 19      10.00       5.63      10.0      10.00
State*                  2 19       8.95       5.23       9.0       8.94
Region*                 3 19       2.74       1.15       3.0       2.76
Populationmetro_2016    4 19 5809736.84 4786299.16 4950000.0 5222058.82
Sunshine_per            5 19      60.11       9.13      58.0      59.41
Bikecom_per             6 19       2.27       1.83       1.7       2.10
Year                    7 19    1791.84      82.35    1833.0    1795.53
Sportsteam_num          8 19       4.05       1.96       4.0       3.94
                            mad      min        max    range  skew kurtosis
City*                      7.41      1.0       19.0       18  0.00    -1.39
State*                     7.41      1.0       17.0       16 -0.02    -1.51
Region*                    1.48      1.0        4.0        3 -0.35    -1.40
Populationmetro_2016 1971858.00 925000.0 20685000.0 19760000  1.82     2.82
Sunshine_per               5.93     47.0       85.0       38  1.00     0.68
Bikecom_per                1.33      0.2        7.2        7  1.03     0.29
Year                      51.89   1625.0     1896.0      271 -0.78    -0.77
Sportsteam_num             1.48      1.0        9.0        8  0.94     0.59
                             se
City*                      1.29
State*                     1.20
Region*                    0.26
Populationmetro_2016 1098052.33
Sunshine_per               2.09
Bikecom_per                0.42
Year                      18.89
Sportsteam_num             0.45
describe(CityInfo$Sportsteam_num)
   vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 19 4.05 1.96      4    3.94 1.48   1   9     8 0.94     0.59 0.45
describeBy(CityInfo$Sportsteam_num, CityInfo$Region)

 Descriptive statistics by group 
group: MW
   vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
X1    1 4 3.75 1.26      4    3.75 0.74   2   5     3 -0.42    -1.82 0.63
------------------------------------------------------------ 
group: NE
   vars n mean   sd median trimmed mad min max range skew kurtosis   se
X1    1 3 5.67 2.89      4    5.67   0   4   9     5 0.38    -2.33 1.67
------------------------------------------------------------ 
group: S
   vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
X1    1 6 3.33 0.82    3.5    3.33 0.74   2   4     2 -0.48    -1.58 0.33
------------------------------------------------------------ 
group: W
   vars n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 6 4.17 2.56      4    4.17 2.97   1   8     7  0.2    -1.65 1.05
Sportstats <- describeBy(CityInfo$Sportsteam_num, CityInfo$Region)
Sportstats

 Descriptive statistics by group 
group: MW
   vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
X1    1 4 3.75 1.26      4    3.75 0.74   2   5     3 -0.42    -1.82 0.63
------------------------------------------------------------ 
group: NE
   vars n mean   sd median trimmed mad min max range skew kurtosis   se
X1    1 3 5.67 2.89      4    5.67   0   4   9     5 0.38    -2.33 1.67
------------------------------------------------------------ 
group: S
   vars n mean   sd median trimmed  mad min max range  skew kurtosis   se
X1    1 6 3.33 0.82    3.5    3.33 0.74   2   4     2 -0.48    -1.58 0.33
------------------------------------------------------------ 
group: W
   vars n mean   sd median trimmed  mad min max range skew kurtosis   se
X1    1 6 4.17 2.56      4    4.17 2.97   1   8     7  0.2    -1.65 1.05

Measures of dispersion

In addition to central tendency metrics, it is also often useful to examine the spread or dispersion of observations within a dataset.

# Make a histogram of the population data
hist(CityInfo$Populationmetro_2016)


# What is the variance in population?
var(CityInfo$Populationmetro_2016)
[1] 2.290866e+13

# you can also look at a range of summary statistics for populations
describeBy(CityInfo$Populationmetro_2016)
   vars  n    mean      sd  median trimmed     mad    min      max    range
X1    1 19 5809737 4786299 4950000 5222059 1971858 925000 20685000 19760000
   skew kurtosis      se
X1 1.82     2.82 1098052

Another tool for quickly examining the spread of the data is the table() function. How many cities are in each region?

table(CityInfo$Region)

MW NE  S  W 
 4  3  6  6 

Grouping variables together

Oftentimes, data are not in the most useful format for particular analyses. We have to reshape and summarize data in order to draw insights from them. For this next section, we will use data from: Tiffany Stephens and Ginny Eckert. 2019. Boat-based counts of sea otters at specific sites in Southeast Alaska. Knowledge Network for Biocomplexity. urn:uuid:b910f74b-171b-4d2b-b065-fb21823a8e84. https://knb.ecoinformatics.org/view/urn%3Auuid%3Ab910f74b-171b-4d2b-b065-fb21823a8e84#urn%3Auuid%3A7eba259b-eeb5-4375-9596-e797bbb0b27d. These data are sea otter counts from 2017 and 2018 in Alaska.

First read in and examine the data.

otter <- read.csv("https://maddiebrown.github.io/ANTH630/data/sea_otter_counts_2017&2018_CLEANDATA.csv")
str(otter)
'data.frame':   1337 obs. of  8 variables:
 $ region     : Factor w/ 1 level "west prince of wales island, alaska": 1 1 1 1 1 1 1 1 1 1 ...
 $ site_name  : Factor w/ 43 levels "Big Clam Bay",..: 1 1 2 2 2 2 2 2 2 2 ...
 $ latitude_N : num  55.2 55.2 55.6 55.6 55.6 ...
 $ longitude_E: num  -133 -133 -133 -133 -133 ...
 $ date_DDMMYY: Factor w/ 48 levels "1/8/18","10/8/18",..: 11 30 11 11 11 11 11 11 11 11 ...
 $ year       : int  2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
 $ replicate  : int  1 2 1 1 1 1 1 1 1 1 ...
 $ n_otter    : int  1 0 1 2 1 1 3 1 13 1 ...

This dataset has 8 variables: region, site_name, latitude_N, latitude_E, data_DDMMYY, year, replicate, and n_otter.

Let’s aggregate the mean number of otters observed on a given day by site.

# data are aggregated based on site name and the function 'mean' is applied to
# them
aggregate(otter, by = list(otter$site_name), FUN = mean, na.rm = TRUE)

The above applied the mean() function to all columns. This can be useful when you have many numeric variables you wish to wuickly summarize. Other times however, it can be helpful to isolate specific variables for comparison.

aggregate(formula = n_otter ~ site_name, FUN = mean, data = otter)
           site_name   n_otter
1       Big Clam Bay 0.5000000
2       Big Tree Bay 1.9876543
3     Blanquizal Bay 2.6428571
4     Chusini Cove 1 1.7704918
5     Chusini Cove 2 2.9166667
6       Dunbar Inlet 1.6818182
7       Farallon Bay 0.0000000
8        Garcia Cove 2.2131148
9   Goat Mouth Inlet 0.0000000
10      Guktu Cove 1 3.0454545
11      Guktu Cove 2 3.7083333
12      Hauti Island 1.9583333
13        Hetta Cove 0.0000000
14        Kaguk Cove 2.5044248
15      Kinani Point 4.2325581
16   Mushroom Island 0.0000000
17 N Fish Egg Island 1.8800000
18   Natzuhini Bay 1 0.0000000
19   Natzuhini Bay 2 0.0000000
20   Natzuhini Bay 3 0.0000000
21       Naukati Bay 1.1276596
22      North Pass 1 0.0000000
23      North Pass 2 0.0000000
24      Nossuk Bay 1 1.5000000
25      Nossuk Bay 2 1.6923077
26      Nossuk Bay 3 1.6206897
27      Port Caldera 0.0000000
28      Port Refugio 0.4000000
29 S Fish Egg Island 1.8750000
30 S Wadleigh Island 1.7000000
31               S16 5.9696970
32               S23 3.4210526
33               S25 1.0000000
34                S3 1.1818182
35               S32 2.4411765
36               S33 2.2291667
37   Salt Lake Bay 1 2.7368421
38   Salt Lake Bay 2 2.0243902
39               SHB 1.0000000
40     Shinaku Inlet 2.7640449
41          Soda Bay 2.2325581
42   Sukkwan Narrows 0.5000000
43     Trocadero Bay 0.6666667

Try it

  1. How could you find the total number of otters observed by site?

  2. Using order() rearrange the dataframe so by the total number of otters per site so that the sites with the most otters are at the top of the data. Look at the help file if you are stuck.

  3. What are the top 3 sites for otter sightings? Select out just these cases

Click for solution

notterpersite <- aggregate(formula = n_otter ~ site_name, FUN = sum, data = otter)
notterpersite
           site_name n_otter
1       Big Clam Bay       1
2       Big Tree Bay     161
3     Blanquizal Bay     222
4     Chusini Cove 1     108
5     Chusini Cove 2     140
6       Dunbar Inlet      37
7       Farallon Bay       0
8        Garcia Cove     135
9   Goat Mouth Inlet       0
10      Guktu Cove 1     134
11      Guktu Cove 2     178
12      Hauti Island      47
13        Hetta Cove       0
14        Kaguk Cove     283
15      Kinani Point     182
16   Mushroom Island       0
17 N Fish Egg Island      47
18   Natzuhini Bay 1       0
19   Natzuhini Bay 2       0
20   Natzuhini Bay 3       0
21       Naukati Bay      53
22      North Pass 1       0
23      North Pass 2       0
24      Nossuk Bay 1      87
25      Nossuk Bay 2      44
26      Nossuk Bay 3      47
27      Port Caldera       0
28      Port Refugio       2
29 S Fish Egg Island      60
30 S Wadleigh Island      34
31               S16     197
32               S23      65
33               S25       1
34                S3      13
35               S32      83
36               S33     107
37   Salt Lake Bay 1     208
38   Salt Lake Bay 2     166
39               SHB       1
40     Shinaku Inlet     246
41          Soda Bay      96
42   Sukkwan Narrows       1
43     Trocadero Bay       2
# reorder dataframe
notterpersite <- notterpersite[order(notterpersite$n_otter, decreasing = T), ]
notterpersite[1:3, ]
        site_name n_otter
14     Kaguk Cove     283
40  Shinaku Inlet     246
3  Blanquizal Bay     222

Try it

Practice your data wrangling and exploratory data analysis skills by answering the following. Provide all code required to have only the required information as the output.

  1. Select the latitude and longitude of the site with the highest number of otter sightings on any single day.

  2. How many observations come from Big Tree Day?

  3. What was the total number of otters observed over the course of 2017?

Click for solution

otter[otter$n_otter == max(otter$n_otter), c("latitude_N", "longitude_E")]
    latitude_N longitude_E
899   54.88742    -132.836

length(otter$site_name[otter$site_name == "Big Tree Bay"])
[1] 81

sum(otter[otter$year == "2017", "n_otter"])
[1] 1560

Using ifelse() statements

Another common form of logical testing in R is the ifelse() statement. In this case, you pass a logical test to R and if the output is true, a certain action is performed, then if it is false, another action is performed. This can be used to make new variables, subset data, color points on a graph and much more.

Using the CityInfo data and ifelse() we can mark each row based on whether or not the region is MW.

ifelse(CityInfo$Region == "MW", "yea", "nope")
 [1] "nope" "nope" "nope" "yea"  "yea"  "nope" "nope" "nope" "nope" "yea" 
[11] "nope" "nope" "nope" "nope" "nope" "nope" "nope" "nope" "yea" 

We can also see which cities are both in the midwest and have over 2% of commuters as bike commuters.

ifelse(CityInfo$Region == "MW" & CityInfo$Bikecom_per > 2, "yea", "nope")
 [1] "nope" "nope" "nope" "nope" "yea"  "nope" "nope" "nope" "nope" "nope"
[11] "nope" "nope" "nope" "nope" "nope" "nope" "nope" "nope" "nope"

Try it

ifelse() statements can also be nested. How might you write code to produce the output “sunny” for all cities with over 70 percent sunshine, and “kind of sunny” for cities with between 55 and 70 percent sunshine and “not sunny” for cities with less than 55 percent sunshine.

Click for solution

CityInfo$sunlevel <- ifelse(CityInfo$Sunshine_per > 70, "sunny", ifelse(CityInfo$Sunshine_per > 
    55, "kind of sunny", "not sunny"))

# check your work
cbind(CityInfo$Sunshine_per, CityInfo$sunlevel)
      [,1] [,2]           
 [1,] "58" "kind of sunny"
 [2,] "73" "sunny"        
 [3,] "66" "kind of sunny"
 [4,] "54" "not sunny"    
 [5,] "58" "kind of sunny"
 [6,] "48" "not sunny"    
 [7,] "61" "kind of sunny"
 [8,] "56" "kind of sunny"
 [9,] "58" "kind of sunny"
[10,] "53" "not sunny"    
[11,] "69" "kind of sunny"
[12,] "47" "not sunny"    
[13,] "85" "sunny"        
[14,] "60" "kind of sunny"
[15,] "57" "kind of sunny"
[16,] "59" "kind of sunny"
[17,] "56" "kind of sunny"
[18,] "70" "kind of sunny"
[19,] "54" "not sunny"    

The paste() function

Strings can be modified directly using the paste() function. This can be useful for creating new columns or systematically changing string data.

Fr example, we know that the CityInfo$Year column refers to the year each city was founded. Let’s add the string “Founded in” before each year name.

paste("Founded in", CityInfo$Year)
 [1] "Founded in 1625" "Founded in 1781" "Founded in 1776" "Founded in 1803"
 [5] "Founded in 1867" "Founded in 1845" "Founded in 1841" "Founded in 1682"
 [9] "Founded in 1630" "Founded in 1701" "Founded in 1858" "Founded in 1851"
[13] "Founded in 1868" "Founded in 1843" "Founded in 1718" "Founded in 1837"
[17] "Founded in 1790" "Founded in 1896" "Founded in 1833"

Now let’s paste together the city and state names.

paste(CityInfo$City, CityInfo$State)
 [1] "New York NY"      "Los Angeles CA"   "San Francisco CA" "Chicago IL"      
 [5] "Minneapolis MN"   "Portland OR"      "Dallas TX"        "Philadelphia PA" 
 [9] "Boston MA"        "Detroit MI"       "Denver CO"        "Seattle WA"      
[13] "Phoenix AZ"       "Atlanta GA"       "New Orleans LA"   "Houston TX"      
[17] "Washington DC DC" "Miami FL"         "Milwaukee WI"    

An important feature of the paste() function is that you can customize the characters separating each value.


paste("Actor", "Model", "Writer", sep = "-")
[1] "Actor-Model-Writer"

paste(CityInfo$City, CityInfo$State, sep = ": ")
 [1] "New York: NY"      "Los Angeles: CA"   "San Francisco: CA"
 [4] "Chicago: IL"       "Minneapolis: MN"   "Portland: OR"     
 [7] "Dallas: TX"        "Philadelphia: PA"  "Boston: MA"       
[10] "Detroit: MI"       "Denver: CO"        "Seattle: WA"      
[13] "Phoenix: AZ"       "Atlanta: GA"       "New Orleans: LA"  
[16] "Houston: TX"       "Washington DC: DC" "Miami: FL"        
[19] "Milwaukee: WI"    

Plotting in base R

A number of different types of charts can be made with the base graphics package.

  1. Scatter plots can be made using the plot() function.
plot(CityInfo$Populationmetro_2016, CityInfo$Sportsteam_num)

  1. Line plots are made with the plot() function, while setting the type to line.
plot(CityInfo$Populationmetro_2016, CityInfo$Sportsteam_num, type = "l")

# This plot also conveys how R will sometimes give you an answer, even when the
# output is meaningless. It's up to you to distinguish meaningful results from
# nonsense.
  1. Histograms can be made with the hist() function.
hist(CityInfo$Bikecom_per)

  1. Boxplots can be made with the boxplot() function.
boxplot(Bikecom_per ~ Region, data = CityInfo)

Adding labels

Adding descriptive labels is an important part of data visualization. In base R plotting, labels can be added with the main=, xlab= and ylab= arguments.

boxplot(Bikecom_per ~ Region, data = CityInfo, main = "Bike commuters by region", 
    xlab = "Region", ylab = "Bike Commuters (%)")

Sometimes we also want to add annotations to plots noting things like sample size, key observations of interest, or summary statistics. Text can be added to an existing plot with thetext() function. The position is set with x and y coordinates at the scale of the plot and the text can be customized in numerous ways.

boxplot(Bikecom_per ~ Region, data = CityInfo, main = "Bike commuters by region", 
    xlab = "Region", ylab = "Bike Commuters (%)")
text(1, 6, "Added text", col = "blue")

Text can also be added to plots based on the position of data points in the plot. Here, we add text above each regional boxplot.

boxplot(Bikecom_per ~ Region, data = CityInfo, main = "Bike commuters by region", 
    xlab = "Region", ylab = "Bike Commuters (%)")
text(CityInfo$Region, 6, "Added text")

Add text over each boxplot at the mean point on each plot.

boxplot(Bikecom_per ~ Region, data = CityInfo, main = "Bike commuters by region", 
    xlab = "Region", ylab = "Bike Commuters (%)")
text(1:4, aggregate(Bikecom_per ~ Region, FUN = mean, data = CityInfo)$Bikecom_per, 
    "Added text", cex = 0.7, col = "blue")

Try it

Label the bike commuter boxplot with the number of cities per region (that is, the n for each boxplot). Set the label up so that it will automatically update if the number of observations change. That is, find the number algorithmically rather than manually setting the number.

Click for solution

boxplot(Bikecom_per ~ Region, data = CityInfo, main = "Bike commuters by region", 
    xlab = "Region", ylab = "Bike Commuters (%)")
text(1:4, 6, paste("N=", aggregate(City ~ Region, FUN = length, data = CityInfo)$City, 
    sep = ""))


# aggregate(City~Region,FUN=length, data=CityInfo)

Adjusting graphical parameters

By adjusting graphical parameters such as point shape, color, and size, you can represent multiple relationships in a single figure. In the figures below, we can differentiate the cities’ regions using different shapes and color.

Below is a basic plot, with no customized graphical parameters.

plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per)

The pch= argument sets the shape of the points. In this case, we define the point shape based on the region each city point is in.

plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per, pch = c(CityInfo$Region))

The col= argument sets the color of the points.

plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per, col = CityInfo$Region)

The cex= argument sets the size of the points.

plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per, col = CityInfo$Region, pch = 19, 
    cex = 3)

Try it

In the figures above, how do we know which symbol or color refers to which region? We have to make a legend. Try using the legend() to add a legend to your scatterplot. If you’re stuck, take a look at the legend() help file.

Click for solution What happened with the code below? How might we fix it?

plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per, col = CityInfo$Region, pch = 19, 
    cex = 1.2)
legend(80, 7, legend = CityInfo$Region, fill = CityInfo$Region)

The Region variable is a factor. This means it has unique properties we can use to our advantage., such as referring to the individual levels. It is also important to note that values are stored in the factor according to their level number, so if you want to convert the data to other classes, you must first adjust for this.

plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per, col = CityInfo$Region, pch = 19, 
    cex = 1.2)
legend(80, 7, legend = levels(CityInfo$Region), fill = 1:length(levels(CityInfo$Region)), 
    cex = 0.7)

Colors

R has 657 built-in colors. You can view them by using the colors() function. Try it!

colors()

How did I know how many named colors there are?

length(colors())
[1] 657

At any given time, R will use a specific palette whenever colors are required. You can view this with the palette() function.

palette()
[1] "black"   "red"     "green3"  "blue"    "cyan"    "magenta" "yellow" 
[8] "gray"   

This is the default palette. The colors are arranged sequentially, such that the first color used will be black, then red, then green3, etc. This is why in our plotting example above the colors could be referred to with their numerical index.

The RColorBrewer package makes beautiful palettes for data visualization in R. Check out the palettes here: ColorBrewer2.0.

We can also explore a few more useful palettes.

rpal <- rainbow(4)
hpal <- heat.colors(4)
tpal <- topo.colors(4)

par(mfrow = c(2, 2))
plot(CityInfo$Populationmetro_2016, CityInfo$Bikecom_per, col = CityInfo$Region, 
    pch = 19, cex = 1.2)
plot(CityInfo$Populationmetro_2016, CityInfo$Bikecom_per, col = rpal[CityInfo$Region], 
    pch = 19, cex = 1.2)
plot(CityInfo$Populationmetro_2016, CityInfo$Bikecom_per, col = hpal[CityInfo$Region], 
    pch = 19, cex = 1.2)
plot(CityInfo$Populationmetro_2016, CityInfo$Bikecom_per, col = tpal[CityInfo$Region], 
    pch = 19, cex = 1.2)

par(mfrow = c(1, 1))

Symbol shapes

There are 25 different shapes available in base R. Some can have different colors for the border and fill, try them out as you continue exploring plotting in R. Read more here

Exploring data with graphs

The pairs() function quickly shows the correlations between all variables in a dataframe. This can reveal potential combinations of variables that warrant further investigation.

pairs(CityInfo[4:8])