Getting started with ggplot2

Unlike plotting with base R, ggplot2 relies on adding different layers onto a plot one at a time to create a complete figure. These layers are called geoms, and include axes, labels, points, or other information. The aesthetics of geoms are further customized with the aes() argument.

library(ggplot2)

Creating a plot in layers

In ggplot2, figures are created one layer at a time. Let’s work with the diamonds dataset to explore ggplot2’s functions.

ggplot(diamonds, aes(carat, price))

This output gives us an empty plot. One thing to notice is that the axis limits are set for the data referenced in the aes() argument, though the points are not plotted.

Now we can add in points with the geom_point() function.

ggplot(diamonds, aes(carat, price)) + geom_point()

You can also plot multiple layers of data on the same plot.


ggplot(diamonds, aes(carat, price)) + geom_point() + geom_smooth()

Let’s finish the plot with some descriptive labels.


ggplot(diamonds, aes(carat, price)) + geom_point() + geom_smooth() + ggtitle("Diamond prices by carat") + 
    labs(x = "Carat", y = "Price (USD)")

Now you’ve mastered building plots in layers.

Customizing aesthetics

There are multiple aesthetic parameters that can be customized in ggplots. This includes: color, fill, linetype, size, shape, font, and more. It just depends on which geom you are working with. We will explore some of these graphical parameters further as this tutorial introduces different geoms. Here is a vignette about aesthetic customization in ggplot2.

Different geoms


geom_col()
geom_point()
geom_line()
geom_smooth()
geom_histogram()
geom_boxplot()
geom_text()
geom_density()
geom_errorbar()
geom_hline()
geom_abline()

Bar plots

Bar plots are great for showing frequencies or proportions across different groups. Let’s return to the otter dataset we analyzed last week. First, let’s calculate the total number of otters observed per site and then plot this in a bargraph with ggplot2.


otter <- read.csv("https://maddiebrown.github.io/ANTH630/data/sea_otter_counts_2017&2018_CLEANDATA.csv")
notterpersite <- aggregate(formula = n_otter ~ site_name, FUN = sum, data = otter)
ggplot(notterpersite, aes(x = site_name, y = n_otter)) + geom_col()

Try it

Hmm…there are way too many columns in the plot above to make it a meaningful barchart. Subset out the top 5 sites with the most otter sightings and recreate the above chart.

Click for solution

notterpersite
           site_name n_otter
1       Big Clam Bay       1
2       Big Tree Bay     161
3     Blanquizal Bay     222
4     Chusini Cove 1     108
5     Chusini Cove 2     140
6       Dunbar Inlet      37
7       Farallon Bay       0
8        Garcia Cove     135
9   Goat Mouth Inlet       0
10      Guktu Cove 1     134
11      Guktu Cove 2     178
12      Hauti Island      47
13        Hetta Cove       0
14        Kaguk Cove     283
15      Kinani Point     182
16   Mushroom Island       0
17 N Fish Egg Island      47
18   Natzuhini Bay 1       0
19   Natzuhini Bay 2       0
20   Natzuhini Bay 3       0
21       Naukati Bay      53
22      North Pass 1       0
23      North Pass 2       0
24      Nossuk Bay 1      87
25      Nossuk Bay 2      44
26      Nossuk Bay 3      47
27      Port Caldera       0
28      Port Refugio       2
29 S Fish Egg Island      60
30 S Wadleigh Island      34
31               S16     197
32               S23      65
33               S25       1
34                S3      13
35               S32      83
36               S33     107
37   Salt Lake Bay 1     208
38   Salt Lake Bay 2     166
39               SHB       1
40     Shinaku Inlet     246
41          Soda Bay      96
42   Sukkwan Narrows       1
43     Trocadero Bay       2
notterpersite <- notterpersite[order(notterpersite$n_otter, decreasing = T), ]
top5 <- notterpersite[1:5, ]
ggplot(top5, aes(x = site_name, y = n_otter)) + geom_col()

Alright, this figure is looking better. But suppose we want to reorder the categories on the X-axis such that they are in order from fewest to greatest number of otter sightings?

The site names are currently a factor column. This means there are ordered levels within the variable. We can reorder the levels of the factor according to the value in the n_otter column. First, inspect the structure of the site_name data. How many factor levels are there? Why do you think this is the case?

Our first step is to drop all the extra levels. You’ll notice that the remaining factors are in alphabetical order.

str(top5$site_name)
 Factor w/ 43 levels "Big Clam Bay",..: 14 40 3 37 31
top5$site_name <- droplevels(top5$site_name)
top5$site_name <- reorder(top5$site_name, top5$n_otter)
p1 <- ggplot(top5, aes(x = site_name, y = n_otter)) + geom_col()
p1

You can also reorder factors with the levels argument of the factor() function.

Working with axes and labels

R will usually make default x and y limits, but sometimes we want to manually adjust these ranges. Let’s adjust the ylimit of our otter plot to 500. In reality, we wouldn’t want to adjust the axes in this case, but this represents the principle.

p1 + ylim(0, 500)

We can also flip the x and y coordinates.

p1 + coord_flip()

Try it

Using the breaks and labels arguments in the scale_y_continuous() function, change the labels in the otter plot to any text of your choosing. Finish your plot with a ggtitle()

Click for solution

p1 + scale_y_continuous(breaks = c(50, 100, 150, 200, 250, 300), labels = c("a few", 
    "some", "more", "a bunch", "a lot", "a ton")) + ggtitle("Otter Counts")

Working with colors

With ggplot you can customize the colors of different components of the graph in numerous ways. Let’s work with the otters graph we just made to see how ggplot understands color arguments.

Try it

  1. Using the fill and color arguments in geom_col, add colors to the otter plot.

  2. What happens if you put the same arguments into the aes() argument of the ggplot() function?

  3. What happens if you assign “fill” to the site name variable?

Click for solution

ggplot(top5, aes(x = site_name, y = n_otter)) + geom_col(fill = "firebrick", color = "lightsalmon2")


ggplot(top5, aes(x = site_name, y = n_otter, fill = "firebrick", color = "lightsalmon2")) + 
    geom_col()


ggplot(top5, aes(x = site_name, y = n_otter, fill = site_name)) + geom_col()


ggplot(top5, aes(x = site_name, y = n_otter)) + geom_col(aes(fill = site_name))

There is a major difference between assigning color within the aes() argument or outside of it.

Try it

Stacked bar chart

Suppose we wanted to take the top 5 otter sites and show the different observation counts for each year. Try using the fill argument to create this figure. You will also likely need to retransform the data while including the three columns of interest.

Click for solution

notterperyearsite <- aggregate(formula = n_otter ~ site_name + year, FUN = sum, data = otter)
ggplot(notterperyearsite, aes(site_name, n_otter, fill = year)) + geom_col()

Working with color palettes

What if we want to use color to represent a scale?

# colors mapped to categorical variable

ggplot(diamonds, aes(carat, price, color = cut)) + geom_point()


# ggplot creates an automatic scale
ggplot(diamonds, aes(x, carat, color = price)) + geom_point()


# you can add a manual scale with two colors on either end
ggplot(diamonds, aes(x, carat, color = price)) + geom_point() + scale_colour_gradient(low = "lightpink", 
    high = "darkmagenta")


# you can add a manual scale with two colors on either end and a clear midpoint
ggplot(diamonds, aes(x, carat, color = price)) + geom_point() + scale_colour_gradient2(low = "deeppink4", 
    mid = "white", high = "lightblue", midpoint = median(diamonds$price))

There are also a variety of built-in palettes in R that are useful.


# viridis is great for making color-blind friendly plots
ggplot(diamonds, aes(carat, price, color = price)) + geom_point() + scale_colour_viridis_c()


# color brewer has convenient built-in palettes and is usful for base R plotting
ggplot(diamonds, aes(carat, price, color = clarity)) + geom_point() + scale_colour_brewer(palette = "Oranges")


ggplot(diamonds, aes(carat, price, color = clarity)) + geom_point() + scale_colour_brewer(palette = "Set3")


# we can also make a custom rainbow palette, as we did with base r last week.
rpal <- rainbow(8)
ggplot(diamonds, aes(carat, price, color = rpal[diamonds$clarity])) + geom_point()

# Adding text We can add text to a plot using the annotate() function.

ggplot(diamonds, aes(carat, price, color = rpal[diamonds$clarity])) + geom_point() + 
    annotate("text", x = 4, y = 5000, label = "Diamond prices")

Try it

Using the same diamonds plot as before. Make a new text annotation in the upper left corner that is large, bold, serif font, and a color other than black.

Click for solution

ggplot(diamonds, aes(carat, price, color = rpal[diamonds$clarity])) + geom_point() + 
    annotate("text", x = 1, y = 15000, label = "Diamond prices", family = "serif", 
        fontface = "bold", size = 6, color = "purple")

Adding text based on values in plot

We can also add text labels to points, bars, or other values in a plot.

ggplot(top5, aes(x = site_name, y = n_otter)) + geom_col(aes(fill = site_name)) + 
    geom_text(aes(label = n_otter), vjust = 1.5, color = "bisque")

Facet plots

Sometimes it is helpful to create multiple plots on the same axis, but grouped by another variable. This is called faceting.

library(MASS)

ggplot(Cars93, aes(Horsepower, Price)) + geom_point() + facet_wrap(Cars93$Type)

Customizing the legend

R will automatically produce a legend in ggplot2. Often we need to customize this legend further.

ggplot(Cars93, aes(Horsepower, Price, color = Type)) + geom_point()

You can change the order of items in the legend.

ggplot(Cars93, aes(Horsepower, Price, color = Type)) + geom_point() + scale_color_discrete(breaks = c("Van", 
    "Compact", "Large", "Midsize", "Small", "Sporty"))

You can also hide titles or the legend as a whole. More information in the [Cookbook for R](http://www.cookbook-r.com/Graphs/Legends_(ggplot2)

ggplot(Cars93, aes(Horsepower, Price, color = Type)) + geom_point() + guides(color = guide_legend(title = NULL))

ggplot(Cars93, aes(Horsepower, Price, color = Type)) + geom_point() + guides(color = FALSE)

Putting multiple plots together

When plotting in base R, we can change the parameters of the graphics output to include multiple plots. This is accomplished with par(mfrow=c(#r,#c)). It is usually best practice to reset the parameters after changing them, as otherwise all subsequent plots will also follow those parameters.

par(mfrow = c(1, 2))
boxplot(cars$speed, main = "Car speeds")
boxplot(cars$speed, main = "Car speeds")
abline(h = mean(cars$speed), col = "red", lty = 2)
abline(h = median(cars$speed), col = "blue", lty = 2)

par(mfrow = c(1, 1))

Putting multiple plots together with ggplot

ggplot doesn’t work with resetting the parameters. We can use functions from gridExtra to put multiple different plots in the same output page. Here is the example from today’s lecture slides, where two barcharts are plotted next to one another (don’t copy the poor y-axis limits seen here!).

library(gridExtra)
# example adapted from field 2012, common idea in data visualization
samp <- data.frame(month = c("May", "June"), beachvisitors = c(100, 175))
p1 <- ggplot(data = samp, aes(x = month, y = beachvisitors)) + geom_col() + ggtitle("Beach visitors per month")
p2 <- ggplot(data = samp, aes(x = month, y = beachvisitors)) + geom_col() + ylim(0, 
    700) + ggtitle("Beach visitors per month")
grid.arrange(p1, p2, nrow = 1)

Saving graphical output

After making a beautiful graph, you can save it in a couple of different ways. First, in RStudio you can save the plot directly from the plots tab. Otherwise, you can also use several functions for saving graphical output.

samp <- data.frame(month = c("May", "June"), beachvisitors = c(100, 175))
p1 <- ggplot(data = samp, aes(x = month, y = beachvisitors)) + geom_col() + ggtitle("Beach visitors per month")

# save as a pdf pdf('beach.pdf') ggplot(data=samp, aes(x=month, y=beachvisitors))
# + geom_col() + ggtitle('Beach visitors per month') dev.off()

# saves the last ggplot to a file in the working directory ggsave('beach2.pdf')

Themes


ggplot(diamonds, aes(carat, price)) + geom_point() + theme_bw()


ggplot(diamonds, aes(carat, price)) + geom_point() + theme_classic()


ggplot(diamonds, aes(carat, price)) + geom_point() + theme_minimal()

Even more themes with ggthemes. Here is a helpful gallery.

library(ggthemes)

ggplot(diamonds, aes(carat, price)) + geom_point() + theme_wsj()


ggplot(diamonds, aes(carat, price)) + geom_point() + theme_fivethirtyeight()


ggplot(diamonds, aes(carat, price)) + geom_point() + theme_gdocs()

Further practice

An interesting exploratory data analysis and graphing tutorial using median household incomes in Maryland. Linked here

A pretty bubble plot tutorial can be found here.