Unlike plotting with base R, ggplot2
relies on adding different layers onto a plot one at a time to create a complete figure. These layers are called geoms
, and include axes, labels, points, or other information. The aesthetics of geoms
are further customized with the aes()
argument.
library(ggplot2)
In ggplot2, figures are created one layer at a time. Let’s work with the diamonds
dataset to explore ggplot2’s functions.
ggplot(diamonds, aes(carat, price))
This output gives us an empty plot. One thing to notice is that the axis limits are set for the data referenced in the aes()
argument, though the points are not plotted.
Now we can add in points with the geom_point()
function.
ggplot(diamonds, aes(carat, price)) + geom_point()
You can also plot multiple layers of data on the same plot.
ggplot(diamonds, aes(carat, price)) + geom_point() + geom_smooth()
Let’s finish the plot with some descriptive labels.
ggplot(diamonds, aes(carat, price)) + geom_point() + geom_smooth() + ggtitle("Diamond prices by carat") +
labs(x = "Carat", y = "Price (USD)")
Now you’ve mastered building plots in layers.
There are multiple aesthetic parameters that can be customized in ggplots. This includes: color, fill, linetype, size, shape, font, and more. It just depends on which geom
you are working with. We will explore some of these graphical parameters further as this tutorial introduces different geoms. Here is a vignette about aesthetic customization in ggplot2.
geom_col()
geom_point()
geom_line()
geom_smooth()
geom_histogram()
geom_boxplot()
geom_text()
geom_density()
geom_errorbar()
geom_hline()
geom_abline()
Bar plots are great for showing frequencies or proportions across different groups. Let’s return to the otter dataset we analyzed last week. First, let’s calculate the total number of otters observed per site and then plot this in a bargraph with ggplot2
.
otter <- read.csv("https://maddiebrown.github.io/ANTH630/data/sea_otter_counts_2017&2018_CLEANDATA.csv")
notterpersite <- aggregate(formula = n_otter ~ site_name, FUN = sum, data = otter)
ggplot(notterpersite, aes(x = site_name, y = n_otter)) + geom_col()
Hmm…there are way too many columns in the plot above to make it a meaningful barchart. Subset out the top 5 sites with the most otter sightings and recreate the above chart.
Click for solution
notterpersite
site_name n_otter
1 Big Clam Bay 1
2 Big Tree Bay 161
3 Blanquizal Bay 222
4 Chusini Cove 1 108
5 Chusini Cove 2 140
6 Dunbar Inlet 37
7 Farallon Bay 0
8 Garcia Cove 135
9 Goat Mouth Inlet 0
10 Guktu Cove 1 134
11 Guktu Cove 2 178
12 Hauti Island 47
13 Hetta Cove 0
14 Kaguk Cove 283
15 Kinani Point 182
16 Mushroom Island 0
17 N Fish Egg Island 47
18 Natzuhini Bay 1 0
19 Natzuhini Bay 2 0
20 Natzuhini Bay 3 0
21 Naukati Bay 53
22 North Pass 1 0
23 North Pass 2 0
24 Nossuk Bay 1 87
25 Nossuk Bay 2 44
26 Nossuk Bay 3 47
27 Port Caldera 0
28 Port Refugio 2
29 S Fish Egg Island 60
30 S Wadleigh Island 34
31 S16 197
32 S23 65
33 S25 1
34 S3 13
35 S32 83
36 S33 107
37 Salt Lake Bay 1 208
38 Salt Lake Bay 2 166
39 SHB 1
40 Shinaku Inlet 246
41 Soda Bay 96
42 Sukkwan Narrows 1
43 Trocadero Bay 2
notterpersite <- notterpersite[order(notterpersite$n_otter, decreasing = T), ]
top5 <- notterpersite[1:5, ]
ggplot(top5, aes(x = site_name, y = n_otter)) + geom_col()
Alright, this figure is looking better. But suppose we want to reorder the categories on the X-axis such that they are in order from fewest to greatest number of otter sightings?
The site names are currently a factor column. This means there are ordered levels within the variable. We can reorder the levels of the factor according to the value in the n_otter column. First, inspect the structure of the site_name data. How many factor levels are there? Why do you think this is the case?
Our first step is to drop all the extra levels. You’ll notice that the remaining factors are in alphabetical order.
str(top5$site_name)
Factor w/ 43 levels "Big Clam Bay",..: 14 40 3 37 31
top5$site_name <- droplevels(top5$site_name)
top5$site_name <- reorder(top5$site_name, top5$n_otter)
p1 <- ggplot(top5, aes(x = site_name, y = n_otter)) + geom_col()
p1
You can also reorder factors with the levels
argument of the factor()
function.
R will usually make default x and y limits, but sometimes we want to manually adjust these ranges. Let’s adjust the ylimit of our otter plot to 500. In reality, we wouldn’t want to adjust the axes in this case, but this represents the principle.
p1 + ylim(0, 500)
We can also flip the x and y coordinates.
p1 + coord_flip()
Using the breaks
and labels
arguments in the scale_y_continuous()
function, change the labels in the otter plot to any text of your choosing. Finish your plot with a ggtitle()
Click for solution
p1 + scale_y_continuous(breaks = c(50, 100, 150, 200, 250, 300), labels = c("a few",
"some", "more", "a bunch", "a lot", "a ton")) + ggtitle("Otter Counts")
With ggplot you can customize the colors of different components of the graph in numerous ways. Let’s work with the otters graph we just made to see how ggplot understands color arguments.
Using the fill and color arguments in geom_col, add colors to the otter plot.
What happens if you put the same arguments into the aes() argument of the ggplot() function?
What happens if you assign “fill” to the site name variable?
Click for solution
ggplot(top5, aes(x = site_name, y = n_otter)) + geom_col(fill = "firebrick", color = "lightsalmon2")
ggplot(top5, aes(x = site_name, y = n_otter, fill = "firebrick", color = "lightsalmon2")) +
geom_col()
ggplot(top5, aes(x = site_name, y = n_otter, fill = site_name)) + geom_col()
ggplot(top5, aes(x = site_name, y = n_otter)) + geom_col(aes(fill = site_name))
There is a major difference between assigning color within the aes() argument or outside of it.
Stacked bar chart
Suppose we wanted to take the top 5 otter sites and show the different observation counts for each year. Try using the fill argument to create this figure. You will also likely need to retransform the data while including the three columns of interest.
Click for solution
notterperyearsite <- aggregate(formula = n_otter ~ site_name + year, FUN = sum, data = otter)
ggplot(notterperyearsite, aes(site_name, n_otter, fill = year)) + geom_col()
What if we want to use color to represent a scale?
# colors mapped to categorical variable
ggplot(diamonds, aes(carat, price, color = cut)) + geom_point()
# ggplot creates an automatic scale
ggplot(diamonds, aes(x, carat, color = price)) + geom_point()
# you can add a manual scale with two colors on either end
ggplot(diamonds, aes(x, carat, color = price)) + geom_point() + scale_colour_gradient(low = "lightpink",
high = "darkmagenta")
# you can add a manual scale with two colors on either end and a clear midpoint
ggplot(diamonds, aes(x, carat, color = price)) + geom_point() + scale_colour_gradient2(low = "deeppink4",
mid = "white", high = "lightblue", midpoint = median(diamonds$price))
There are also a variety of built-in palettes in R that are useful.
# viridis is great for making color-blind friendly plots
ggplot(diamonds, aes(carat, price, color = price)) + geom_point() + scale_colour_viridis_c()
# color brewer has convenient built-in palettes and is usful for base R plotting
ggplot(diamonds, aes(carat, price, color = clarity)) + geom_point() + scale_colour_brewer(palette = "Oranges")
ggplot(diamonds, aes(carat, price, color = clarity)) + geom_point() + scale_colour_brewer(palette = "Set3")
# we can also make a custom rainbow palette, as we did with base r last week.
rpal <- rainbow(8)
ggplot(diamonds, aes(carat, price, color = rpal[diamonds$clarity])) + geom_point()
# Adding text We can add text to a plot using the
annotate()
function.
ggplot(diamonds, aes(carat, price, color = rpal[diamonds$clarity])) + geom_point() +
annotate("text", x = 4, y = 5000, label = "Diamond prices")
Using the same diamonds plot as before. Make a new text annotation in the upper left corner that is large, bold, serif font, and a color other than black.
Click for solution
ggplot(diamonds, aes(carat, price, color = rpal[diamonds$clarity])) + geom_point() +
annotate("text", x = 1, y = 15000, label = "Diamond prices", family = "serif",
fontface = "bold", size = 6, color = "purple")
We can also add text labels to points, bars, or other values in a plot.
ggplot(top5, aes(x = site_name, y = n_otter)) + geom_col(aes(fill = site_name)) +
geom_text(aes(label = n_otter), vjust = 1.5, color = "bisque")
Sometimes it is helpful to create multiple plots on the same axis, but grouped by another variable. This is called faceting.
library(MASS)
ggplot(Cars93, aes(Horsepower, Price)) + geom_point() + facet_wrap(Cars93$Type)
R will automatically produce a legend in ggplot2. Often we need to customize this legend further.
ggplot(Cars93, aes(Horsepower, Price, color = Type)) + geom_point()
You can change the order of items in the legend.
ggplot(Cars93, aes(Horsepower, Price, color = Type)) + geom_point() + scale_color_discrete(breaks = c("Van",
"Compact", "Large", "Midsize", "Small", "Sporty"))
You can also hide titles or the legend as a whole. More information in the [Cookbook for R](http://www.cookbook-r.com/Graphs/Legends_(ggplot2)
ggplot(Cars93, aes(Horsepower, Price, color = Type)) + geom_point() + guides(color = guide_legend(title = NULL))
ggplot(Cars93, aes(Horsepower, Price, color = Type)) + geom_point() + guides(color = FALSE)
When plotting in base R, we can change the parameters of the graphics output to include multiple plots. This is accomplished with par(mfrow=c(#r,#c))
. It is usually best practice to reset the parameters after changing them, as otherwise all subsequent plots will also follow those parameters.
par(mfrow = c(1, 2))
boxplot(cars$speed, main = "Car speeds")
boxplot(cars$speed, main = "Car speeds")
abline(h = mean(cars$speed), col = "red", lty = 2)
abline(h = median(cars$speed), col = "blue", lty = 2)
par(mfrow = c(1, 1))
ggplot
doesn’t work with resetting the parameters. We can use functions from gridExtra
to put multiple different plots in the same output page. Here is the example from today’s lecture slides, where two barcharts are plotted next to one another (don’t copy the poor y-axis limits seen here!).
library(gridExtra)
# example adapted from field 2012, common idea in data visualization
samp <- data.frame(month = c("May", "June"), beachvisitors = c(100, 175))
p1 <- ggplot(data = samp, aes(x = month, y = beachvisitors)) + geom_col() + ggtitle("Beach visitors per month")
p2 <- ggplot(data = samp, aes(x = month, y = beachvisitors)) + geom_col() + ylim(0,
700) + ggtitle("Beach visitors per month")
grid.arrange(p1, p2, nrow = 1)
After making a beautiful graph, you can save it in a couple of different ways. First, in RStudio you can save the plot directly from the plots tab. Otherwise, you can also use several functions for saving graphical output.
samp <- data.frame(month = c("May", "June"), beachvisitors = c(100, 175))
p1 <- ggplot(data = samp, aes(x = month, y = beachvisitors)) + geom_col() + ggtitle("Beach visitors per month")
# save as a pdf pdf('beach.pdf') ggplot(data=samp, aes(x=month, y=beachvisitors))
# + geom_col() + ggtitle('Beach visitors per month') dev.off()
# saves the last ggplot to a file in the working directory ggsave('beach2.pdf')
ggplot(diamonds, aes(carat, price)) + geom_point() + theme_bw()
ggplot(diamonds, aes(carat, price)) + geom_point() + theme_classic()
ggplot(diamonds, aes(carat, price)) + geom_point() + theme_minimal()
Even more themes with ggthemes
. Here is a helpful gallery.
library(ggthemes)
ggplot(diamonds, aes(carat, price)) + geom_point() + theme_wsj()
ggplot(diamonds, aes(carat, price)) + geom_point() + theme_fivethirtyeight()
ggplot(diamonds, aes(carat, price)) + geom_point() + theme_gdocs()