In exploratory data analysis, we often begin by calculating summary statistics for and plotting key variables in our data. The CityInfo dataframe has both factor and numeric data. For example: it might be interesting to examine the relationship between population size and the number of sports teams or between the amount of sunshine and the number of bike commuters. Or maybe we are interested in looking at regional variation in sunshine in our focal cities.
Load in data and examine the structure.
CityInfo <- read.csv("https://maddiebrown.github.io/ANTH630/data/Cityinfo.csv")
str(CityInfo)
'data.frame': 19 obs. of 8 variables:
$ City : Factor w/ 19 levels "Atlanta","Boston",..: 13 8 17 3 11 16 4 14 2 6 ...
$ State : Factor w/ 17 levels "AZ","CA","CO",..: 12 2 2 7 11 13 15 14 9 10 ...
$ Region : Factor w/ 4 levels "MW","NE","S",..: 2 4 4 1 1 4 3 2 2 1 ...
$ Populationmetro_2016: int 20685000 15135000 5955000 9185000 2795000 2000000 6280000 5595000 4490000 3660000 ...
$ Sunshine_per : int 58 73 66 54 58 48 61 56 58 53 ...
$ Bikecom_per : num 1.1 1.3 4.4 1.7 4.6 7.2 0.2 1.9 2.4 0.8 ...
$ Year : int 1625 1781 1776 1803 1867 1845 1841 1682 1630 1701 ...
$ Sportsteam_num : int 9 8 6 5 4 1 4 4 4 4 ...
Let’s calculate the mean amount of sunshine for midwestern cities.
mean(CityInfo[CityInfo$Region == "MW", "Sunshine_per"])
[1] 54.75
Now let’s subset the dataframe according to regions and calculate the mean sunshine for all regions. Try using the aggregate()
function.
aggregate(Sunshine_per ~ Region, CityInfo, mean)
Region Sunshine_per
1 MW 54.75000
2 NE 57.33333
3 S 60.50000
4 W 64.66667
Often, we want to calculate numerous descriptive statistics simultaneously. There are numerous functions which can accomplish this task. Here, we’ll try using functions from the psych
package we downloaded earlier.
Make sure you downloaded and loaded the psych
package
Using the describe()
function, calculate summary statistics for the entire CityInfo
dataframe
Calculate summary statistics for only the number of sports teams per city
The describeBy()
function is useful when you want to summarize numerous variables according to the same grouping variable. Use describeBy()
to generate summary statistics about the number of sports teams oer city by region. Assign the results to a new object called Sportstats
.
Click for solution
library(psych)
describe(CityInfo)
vars n mean sd median trimmed
City* 1 19 10.00 5.63 10.0 10.00
State* 2 19 8.95 5.23 9.0 8.94
Region* 3 19 2.74 1.15 3.0 2.76
Populationmetro_2016 4 19 5809736.84 4786299.16 4950000.0 5222058.82
Sunshine_per 5 19 60.11 9.13 58.0 59.41
Bikecom_per 6 19 2.27 1.83 1.7 2.10
Year 7 19 1791.84 82.35 1833.0 1795.53
Sportsteam_num 8 19 4.05 1.96 4.0 3.94
mad min max range skew kurtosis
City* 7.41 1.0 19.0 18 0.00 -1.39
State* 7.41 1.0 17.0 16 -0.02 -1.51
Region* 1.48 1.0 4.0 3 -0.35 -1.40
Populationmetro_2016 1971858.00 925000.0 20685000.0 19760000 1.82 2.82
Sunshine_per 5.93 47.0 85.0 38 1.00 0.68
Bikecom_per 1.33 0.2 7.2 7 1.03 0.29
Year 51.89 1625.0 1896.0 271 -0.78 -0.77
Sportsteam_num 1.48 1.0 9.0 8 0.94 0.59
se
City* 1.29
State* 1.20
Region* 0.26
Populationmetro_2016 1098052.33
Sunshine_per 2.09
Bikecom_per 0.42
Year 18.89
Sportsteam_num 0.45
describe(CityInfo$Sportsteam_num)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 19 4.05 1.96 4 3.94 1.48 1 9 8 0.94 0.59 0.45
describeBy(CityInfo$Sportsteam_num, CityInfo$Region)
Descriptive statistics by group
group: MW
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 4 3.75 1.26 4 3.75 0.74 2 5 3 -0.42 -1.82 0.63
------------------------------------------------------------
group: NE
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 3 5.67 2.89 4 5.67 0 4 9 5 0.38 -2.33 1.67
------------------------------------------------------------
group: S
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 6 3.33 0.82 3.5 3.33 0.74 2 4 2 -0.48 -1.58 0.33
------------------------------------------------------------
group: W
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 6 4.17 2.56 4 4.17 2.97 1 8 7 0.2 -1.65 1.05
Sportstats <- describeBy(CityInfo$Sportsteam_num, CityInfo$Region)
Sportstats
Descriptive statistics by group
group: MW
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 4 3.75 1.26 4 3.75 0.74 2 5 3 -0.42 -1.82 0.63
------------------------------------------------------------
group: NE
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 3 5.67 2.89 4 5.67 0 4 9 5 0.38 -2.33 1.67
------------------------------------------------------------
group: S
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 6 3.33 0.82 3.5 3.33 0.74 2 4 2 -0.48 -1.58 0.33
------------------------------------------------------------
group: W
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 6 4.17 2.56 4 4.17 2.97 1 8 7 0.2 -1.65 1.05
In addition to central tendency metrics, it is also often useful to examine the spread or dispersion of observations within a dataset.
# Make a histogram of the population data
hist(CityInfo$Populationmetro_2016)
# What is the variance in population?
var(CityInfo$Populationmetro_2016)
[1] 2.290866e+13
# you can also look at a range of summary statistics for populations
describeBy(CityInfo$Populationmetro_2016)
vars n mean sd median trimmed mad min max range
X1 1 19 5809737 4786299 4950000 5222059 1971858 925000 20685000 19760000
skew kurtosis se
X1 1.82 2.82 1098052
Another tool for quickly examining the spread of the data is the table()
function. How many cities are in each region?
table(CityInfo$Region)
MW NE S W
4 3 6 6
Oftentimes, data are not in the most useful format for particular analyses. We have to reshape and summarize data in order to draw insights from them. For this next section, we will use data from: Tiffany Stephens and Ginny Eckert. 2019. Boat-based counts of sea otters at specific sites in Southeast Alaska. Knowledge Network for Biocomplexity. urn:uuid:b910f74b-171b-4d2b-b065-fb21823a8e84. https://knb.ecoinformatics.org/view/urn%3Auuid%3Ab910f74b-171b-4d2b-b065-fb21823a8e84#urn%3Auuid%3A7eba259b-eeb5-4375-9596-e797bbb0b27d. These data are sea otter counts from 2017 and 2018 in Alaska.
First read in and examine the data.
otter <- read.csv("https://maddiebrown.github.io/ANTH630/data/sea_otter_counts_2017&2018_CLEANDATA.csv")
str(otter)
'data.frame': 1337 obs. of 8 variables:
$ region : Factor w/ 1 level "west prince of wales island, alaska": 1 1 1 1 1 1 1 1 1 1 ...
$ site_name : Factor w/ 43 levels "Big Clam Bay",..: 1 1 2 2 2 2 2 2 2 2 ...
$ latitude_N : num 55.2 55.2 55.6 55.6 55.6 ...
$ longitude_E: num -133 -133 -133 -133 -133 ...
$ date_DDMMYY: Factor w/ 48 levels "1/8/18","10/8/18",..: 11 30 11 11 11 11 11 11 11 11 ...
$ year : int 2018 2018 2018 2018 2018 2018 2018 2018 2018 2018 ...
$ replicate : int 1 2 1 1 1 1 1 1 1 1 ...
$ n_otter : int 1 0 1 2 1 1 3 1 13 1 ...
This dataset has 8 variables: region, site_name, latitude_N, latitude_E, data_DDMMYY, year, replicate, and n_otter.
Let’s aggregate the mean number of otters observed on a given day by site.
# data are aggregated based on site name and the function 'mean' is applied to
# them
aggregate(otter, by = list(otter$site_name), FUN = mean, na.rm = TRUE)
The above applied the mean()
function to all columns. This can be useful when you have many numeric variables you wish to wuickly summarize. Other times however, it can be helpful to isolate specific variables for comparison.
aggregate(formula = n_otter ~ site_name, FUN = mean, data = otter)
site_name n_otter
1 Big Clam Bay 0.5000000
2 Big Tree Bay 1.9876543
3 Blanquizal Bay 2.6428571
4 Chusini Cove 1 1.7704918
5 Chusini Cove 2 2.9166667
6 Dunbar Inlet 1.6818182
7 Farallon Bay 0.0000000
8 Garcia Cove 2.2131148
9 Goat Mouth Inlet 0.0000000
10 Guktu Cove 1 3.0454545
11 Guktu Cove 2 3.7083333
12 Hauti Island 1.9583333
13 Hetta Cove 0.0000000
14 Kaguk Cove 2.5044248
15 Kinani Point 4.2325581
16 Mushroom Island 0.0000000
17 N Fish Egg Island 1.8800000
18 Natzuhini Bay 1 0.0000000
19 Natzuhini Bay 2 0.0000000
20 Natzuhini Bay 3 0.0000000
21 Naukati Bay 1.1276596
22 North Pass 1 0.0000000
23 North Pass 2 0.0000000
24 Nossuk Bay 1 1.5000000
25 Nossuk Bay 2 1.6923077
26 Nossuk Bay 3 1.6206897
27 Port Caldera 0.0000000
28 Port Refugio 0.4000000
29 S Fish Egg Island 1.8750000
30 S Wadleigh Island 1.7000000
31 S16 5.9696970
32 S23 3.4210526
33 S25 1.0000000
34 S3 1.1818182
35 S32 2.4411765
36 S33 2.2291667
37 Salt Lake Bay 1 2.7368421
38 Salt Lake Bay 2 2.0243902
39 SHB 1.0000000
40 Shinaku Inlet 2.7640449
41 Soda Bay 2.2325581
42 Sukkwan Narrows 0.5000000
43 Trocadero Bay 0.6666667
How could you find the total number of otters observed by site?
Using order()
rearrange the dataframe so by the total number of otters per site so that the sites with the most otters are at the top of the data. Look at the help file if you are stuck.
What are the top 3 sites for otter sightings? Select out just these cases
Click for solution
notterpersite <- aggregate(formula = n_otter ~ site_name, FUN = sum, data = otter)
notterpersite
site_name n_otter
1 Big Clam Bay 1
2 Big Tree Bay 161
3 Blanquizal Bay 222
4 Chusini Cove 1 108
5 Chusini Cove 2 140
6 Dunbar Inlet 37
7 Farallon Bay 0
8 Garcia Cove 135
9 Goat Mouth Inlet 0
10 Guktu Cove 1 134
11 Guktu Cove 2 178
12 Hauti Island 47
13 Hetta Cove 0
14 Kaguk Cove 283
15 Kinani Point 182
16 Mushroom Island 0
17 N Fish Egg Island 47
18 Natzuhini Bay 1 0
19 Natzuhini Bay 2 0
20 Natzuhini Bay 3 0
21 Naukati Bay 53
22 North Pass 1 0
23 North Pass 2 0
24 Nossuk Bay 1 87
25 Nossuk Bay 2 44
26 Nossuk Bay 3 47
27 Port Caldera 0
28 Port Refugio 2
29 S Fish Egg Island 60
30 S Wadleigh Island 34
31 S16 197
32 S23 65
33 S25 1
34 S3 13
35 S32 83
36 S33 107
37 Salt Lake Bay 1 208
38 Salt Lake Bay 2 166
39 SHB 1
40 Shinaku Inlet 246
41 Soda Bay 96
42 Sukkwan Narrows 1
43 Trocadero Bay 2
# reorder dataframe
notterpersite <- notterpersite[order(notterpersite$n_otter, decreasing = T), ]
notterpersite[1:3, ]
site_name n_otter
14 Kaguk Cove 283
40 Shinaku Inlet 246
3 Blanquizal Bay 222
Practice your data wrangling and exploratory data analysis skills by answering the following. Provide all code required to have only the required information as the output.
Select the latitude and longitude of the site with the highest number of otter sightings on any single day.
How many observations come from Big Tree Day?
What was the total number of otters observed over the course of 2017?
Click for solution
otter[otter$n_otter == max(otter$n_otter), c("latitude_N", "longitude_E")]
latitude_N longitude_E
899 54.88742 -132.836
length(otter$site_name[otter$site_name == "Big Tree Bay"])
[1] 81
sum(otter[otter$year == "2017", "n_otter"])
[1] 1560
Another common form of logical testing in R is the ifelse()
statement. In this case, you pass a logical test to R and if the output is true, a certain action is performed, then if it is false, another action is performed. This can be used to make new variables, subset data, color points on a graph and much more.
Using the CityInfo
data and ifelse()
we can mark each row based on whether or not the region is MW.
ifelse(CityInfo$Region == "MW", "yea", "nope")
[1] "nope" "nope" "nope" "yea" "yea" "nope" "nope" "nope" "nope" "yea"
[11] "nope" "nope" "nope" "nope" "nope" "nope" "nope" "nope" "yea"
We can also see which cities are both in the midwest and have over 2% of commuters as bike commuters.
ifelse(CityInfo$Region == "MW" & CityInfo$Bikecom_per > 2, "yea", "nope")
[1] "nope" "nope" "nope" "nope" "yea" "nope" "nope" "nope" "nope" "nope"
[11] "nope" "nope" "nope" "nope" "nope" "nope" "nope" "nope" "nope"
ifelse()
statements can also be nested. How might you write code to produce the output “sunny” for all cities with over 70 percent sunshine, and “kind of sunny” for cities with between 55 and 70 percent sunshine and “not sunny” for cities with less than 55 percent sunshine.
Click for solution
CityInfo$sunlevel <- ifelse(CityInfo$Sunshine_per > 70, "sunny", ifelse(CityInfo$Sunshine_per >
55, "kind of sunny", "not sunny"))
# check your work
cbind(CityInfo$Sunshine_per, CityInfo$sunlevel)
[,1] [,2]
[1,] "58" "kind of sunny"
[2,] "73" "sunny"
[3,] "66" "kind of sunny"
[4,] "54" "not sunny"
[5,] "58" "kind of sunny"
[6,] "48" "not sunny"
[7,] "61" "kind of sunny"
[8,] "56" "kind of sunny"
[9,] "58" "kind of sunny"
[10,] "53" "not sunny"
[11,] "69" "kind of sunny"
[12,] "47" "not sunny"
[13,] "85" "sunny"
[14,] "60" "kind of sunny"
[15,] "57" "kind of sunny"
[16,] "59" "kind of sunny"
[17,] "56" "kind of sunny"
[18,] "70" "kind of sunny"
[19,] "54" "not sunny"
Strings can be modified directly using the paste()
function. This can be useful for creating new columns or systematically changing string data.
Fr example, we know that the CityInfo$Year
column refers to the year each city was founded. Let’s add the string “Founded in” before each year name.
paste("Founded in", CityInfo$Year)
[1] "Founded in 1625" "Founded in 1781" "Founded in 1776" "Founded in 1803"
[5] "Founded in 1867" "Founded in 1845" "Founded in 1841" "Founded in 1682"
[9] "Founded in 1630" "Founded in 1701" "Founded in 1858" "Founded in 1851"
[13] "Founded in 1868" "Founded in 1843" "Founded in 1718" "Founded in 1837"
[17] "Founded in 1790" "Founded in 1896" "Founded in 1833"
Now let’s paste together the city and state names.
paste(CityInfo$City, CityInfo$State)
[1] "New York NY" "Los Angeles CA" "San Francisco CA" "Chicago IL"
[5] "Minneapolis MN" "Portland OR" "Dallas TX" "Philadelphia PA"
[9] "Boston MA" "Detroit MI" "Denver CO" "Seattle WA"
[13] "Phoenix AZ" "Atlanta GA" "New Orleans LA" "Houston TX"
[17] "Washington DC DC" "Miami FL" "Milwaukee WI"
An important feature of the paste()
function is that you can customize the characters separating each value.
paste("Actor", "Model", "Writer", sep = "-")
[1] "Actor-Model-Writer"
paste(CityInfo$City, CityInfo$State, sep = ": ")
[1] "New York: NY" "Los Angeles: CA" "San Francisco: CA"
[4] "Chicago: IL" "Minneapolis: MN" "Portland: OR"
[7] "Dallas: TX" "Philadelphia: PA" "Boston: MA"
[10] "Detroit: MI" "Denver: CO" "Seattle: WA"
[13] "Phoenix: AZ" "Atlanta: GA" "New Orleans: LA"
[16] "Houston: TX" "Washington DC: DC" "Miami: FL"
[19] "Milwaukee: WI"
A number of different types of charts can be made with the base graphics package.
plot()
function.plot(CityInfo$Populationmetro_2016, CityInfo$Sportsteam_num)
plot()
function, while setting the type to line.plot(CityInfo$Populationmetro_2016, CityInfo$Sportsteam_num, type = "l")
# This plot also conveys how R will sometimes give you an answer, even when the
# output is meaningless. It's up to you to distinguish meaningful results from
# nonsense.
hist()
function.hist(CityInfo$Bikecom_per)
boxplot()
function.boxplot(Bikecom_per ~ Region, data = CityInfo)
Adding descriptive labels is an important part of data visualization. In base R plotting, labels can be added with the main=
, xlab=
and ylab=
arguments.
boxplot(Bikecom_per ~ Region, data = CityInfo, main = "Bike commuters by region",
xlab = "Region", ylab = "Bike Commuters (%)")
Sometimes we also want to add annotations to plots noting things like sample size, key observations of interest, or summary statistics. Text can be added to an existing plot with thetext()
function. The position is set with x and y coordinates at the scale of the plot and the text can be customized in numerous ways.
boxplot(Bikecom_per ~ Region, data = CityInfo, main = "Bike commuters by region",
xlab = "Region", ylab = "Bike Commuters (%)")
text(1, 6, "Added text", col = "blue")
Text can also be added to plots based on the position of data points in the plot. Here, we add text above each regional boxplot.
boxplot(Bikecom_per ~ Region, data = CityInfo, main = "Bike commuters by region",
xlab = "Region", ylab = "Bike Commuters (%)")
text(CityInfo$Region, 6, "Added text")
Add text over each boxplot at the mean point on each plot.
boxplot(Bikecom_per ~ Region, data = CityInfo, main = "Bike commuters by region",
xlab = "Region", ylab = "Bike Commuters (%)")
text(1:4, aggregate(Bikecom_per ~ Region, FUN = mean, data = CityInfo)$Bikecom_per,
"Added text", cex = 0.7, col = "blue")
Label the bike commuter boxplot with the number of cities per region (that is, the n for each boxplot). Set the label up so that it will automatically update if the number of observations change. That is, find the number algorithmically rather than manually setting the number.
Click for solution
boxplot(Bikecom_per ~ Region, data = CityInfo, main = "Bike commuters by region",
xlab = "Region", ylab = "Bike Commuters (%)")
text(1:4, 6, paste("N=", aggregate(City ~ Region, FUN = length, data = CityInfo)$City,
sep = ""))
# aggregate(City~Region,FUN=length, data=CityInfo)
By adjusting graphical parameters such as point shape, color, and size, you can represent multiple relationships in a single figure. In the figures below, we can differentiate the cities’ regions using different shapes and color.
Below is a basic plot, with no customized graphical parameters.
plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per)
The pch=
argument sets the shape of the points. In this case, we define the point shape based on the region each city point is in.
plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per, pch = c(CityInfo$Region))
The col=
argument sets the color of the points.
plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per, col = CityInfo$Region)
The cex=
argument sets the size of the points.
plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per, col = CityInfo$Region, pch = 19,
cex = 3)
In the figures above, how do we know which symbol or color refers to which region? We have to make a legend. Try using the legend()
to add a legend to your scatterplot. If you’re stuck, take a look at the legend()
help file.
Click for solution
What happened with the code below? How might we fix it?
plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per, col = CityInfo$Region, pch = 19,
cex = 1.2)
legend(80, 7, legend = CityInfo$Region, fill = CityInfo$Region)
The Region
variable is a factor. This means it has unique properties we can use to our advantage., such as referring to the individual levels. It is also important to note that values are stored in the factor according to their level number, so if you want to convert the data to other classes, you must first adjust for this.
plot(CityInfo$Sunshine_per, CityInfo$Bikecom_per, col = CityInfo$Region, pch = 19,
cex = 1.2)
legend(80, 7, legend = levels(CityInfo$Region), fill = 1:length(levels(CityInfo$Region)),
cex = 0.7)
R has 657 built-in colors. You can view them by using the colors()
function. Try it!
colors()
How did I know how many named colors there are?
length(colors())
[1] 657
At any given time, R will use a specific palette whenever colors are required. You can view this with the palette()
function.
palette()
[1] "black" "red" "green3" "blue" "cyan" "magenta" "yellow"
[8] "gray"
This is the default palette. The colors are arranged sequentially, such that the first color used will be black, then red, then green3, etc. This is why in our plotting example above the colors could be referred to with their numerical index.
The RColorBrewer
package makes beautiful palettes for data visualization in R. Check out the palettes here: ColorBrewer2.0.
We can also explore a few more useful palettes.
rpal <- rainbow(4)
hpal <- heat.colors(4)
tpal <- topo.colors(4)
par(mfrow = c(2, 2))
plot(CityInfo$Populationmetro_2016, CityInfo$Bikecom_per, col = CityInfo$Region,
pch = 19, cex = 1.2)
plot(CityInfo$Populationmetro_2016, CityInfo$Bikecom_per, col = rpal[CityInfo$Region],
pch = 19, cex = 1.2)
plot(CityInfo$Populationmetro_2016, CityInfo$Bikecom_per, col = hpal[CityInfo$Region],
pch = 19, cex = 1.2)
plot(CityInfo$Populationmetro_2016, CityInfo$Bikecom_per, col = tpal[CityInfo$Region],
pch = 19, cex = 1.2)
par(mfrow = c(1, 1))
There are 25 different shapes available in base R. Some can have different colors for the border and fill, try them out as you continue exploring plotting in R. Read more here
The pairs()
function quickly shows the correlations between all variables in a dataframe. This can reveal potential combinations of variables that warrant further investigation.
pairs(CityInfo[4:8])