This course introduces graduate students to theory and methods in quantitative anthropological and archaeological research. This is accomplished through three main themes threaded throughout the semester: 1) asking quantitative questions in anthropology, 2) statistics / data science theory, and 3) data analysis and management. In the first theme, we consider how to design quantitative anthropological research studies, scaling from question design through ethical best practices. In the second theme, students will be introduced to basic statistical theories and contemporary trends in quantitative social science. Finally, the third theme creates space for students to apply new skills in data analysis, management and visualization using R. Collectively, these themes enable students to build theoretical and methodological foundations for conducting independent quantitative anthropological research.
Below are webpages with answer guides and handouts from the books used in this course. You are on your honor to not look up Problem Set answers online. All solutions must be presented in your own words.
No class session this week. Instead, find a time to meet with your peer review partner to discuss your final project progress. Details in ELMS.
sf
package and spatial joins]Note: This week alone you are able to submit the Problem Set as a word document. In subsequent weeks you must submit a clean, knit, .html file documenting your results. Code printouts alone will not be accepted.
library(datasets)
data(beavers)
beaver1
. The data should be in the datasets
package (may already be loaded, if not, install the package).dplyr
in this
chapter. Don’t worry if pipes don’t make sense yet, we will cover this
in detail when we cover tidyverse
. You can complete this
Problem Set with either base R or dplyr
.Your research prospectus should be a 250-500 word description of your research project. In your description, please include the following: 1) your proposed question/hypothesis; 2) what methods you plan to use, including your sampling strategy; 3) why this study is significant (both intellectually and in the ’real world‘). In addition to the above, you may also include the following, as appropriate for your project: 1) a list of potential questions or data sources; 2) list of groups/agencies etc. to be sampled; 3) schedule of data collection; 3) citations of papers motivating your study. The more specific you are in your research prospectus, the more feedback you will be able to receive at this stage of your project.
With a text of your choice complete the following analyses and produce a data report. Be sure to document and justify any decisions that you make during the data cleaning and subsetting process. If you are struggling to find a text from your own work you can download one from Project Gutenberg or examine a different open-ended question from the permafrost survey we covered in class. Write up your results in the form of a brief data analysis report.
Wrangle and tidy up the text by removing stop words, missing values and any extra characters that do not add to the analysis. Produce a table of the top 50 words and bigrams. How do you interpret these results?
If there are multiple groups, sections or documents in your text, compare the word frequencies across these different subgroups. Alternatively, you may tag parts of speech and compare the frequency of different parts of speech in your text.
Produce two different figures of your choice based on your analysis of word frequencies. Make sure each figure is accompanied by a descriptive caption.
Either analyze the sentiments or create a topic model of based on your text. Create one figure or table based on this analysis. Explain how you chose which sentiments to focus on or how you created the topic model.
Interpret the results from your analyses and figures. Did you find anything surprising or noteworthy from this analysis? What further questions do you have about the text that remain unanswered? What additional data or analyses would you need in order to answer those questions?
This Problem Set is an opportunity to practice working with real world datasets. You can use a dataset of choice (e.g. from social media, webscraping, from a database, from whitepapers) as an example. With your dataset, complete at least 3 of the following exercises. Document the process of transforming and cleaning the data.
Make a new column based on subsetting or grouping the original data. Use string searches to help with this.
Pivot all or part of the dataframe into either wide or long format.
Using cleaned and wrangled data either a) make a table of the new categories or b) analyze word frequencies.
With an unstructured text field, convert text to lowercase, remove stopwords and analyze the sentiments. Make a custom stopword list to augment an existing stopword list.
Create a clearly labeled, multi-color plot based on your dataset.
For this Problem Set you will be obtaining and analyzing data from Open Data DC about Urban Forestry Street Trees. You will work in your coding team to produce a polished, clear report analyzing these data. At minimum, include the following sections:
Using the Thawing Permafrost and Rural Communities Survey, answer the following questions and create the associated graphics. I’ve written the problems below with reference to question 8, but you might also consider also using question 54 (What do you see as the 3 biggest changes you will have to make in your day-to-day life because of thawing permafrost?).
The following code might help you get started:
pfissues <- surv %>% select(ID, Village,
X8.1..PF.Issue, X8.2..PF.Issue, X8.3..PF.Issue)
What are the top 10 most frequently mentioned issues in response to the question: “8) What do you think are the 3 biggest issues that will result from thawing permafrost in this area?”. Looking at this list, are there any issues that stand out to you? How useful is this first round of analysis on the raw data?
Remove missing values, convert to lowercase, and categorize the responses into meaningful groups. We don’t know what the overall goal of the researchers might have been, but we can create our own classifications systems for these responses. This means that you are grouping responses according to some shared criteria such as “water issues” or “changes in animal behavior”. Briefly explain how you decided to group the responses and then calculate the salience for these issue groups. Produce a barchart detailing the Smith’s S calculations for the most frequently occuring issues with a vertical line showing the 0.1 level Smith’s S at 0.1.
If you are stuck on how to get started, look back at the lesson on text analysis and string data wrangling.
This sample contains individuals from two different villages. Recalculate the issue salience metrics grouping responses by each community. Produce a facted figure detailing the most salient issues for each community.
Briefly reflect on what you have learned about the perception of issues related to thawing permafrost in each of the communities in the survey. In what ways is this type of analysis meaningful or less informative compared to other analytical techniques you might use on this same set of question responses?
Note: If you have a dataset from your own research that you would like to analyze in place of the provided datasets, please contact me in advance and we can discuss alternatives.
In this Problem Set, we will analyze the network of character interactions from two classic Indiana Jones films: Raiders of the Lost Ark and The Last Crusade. Data are taken from the MovieGalaxies database. Using these networks, answer the following questions. Data can be downloaded here and here.
Create a summary table of the two networks. At minimum, include the following information:
Plot each network with nodes sized based on a centrality metric of choice and a clear title. Can you compare the node centrality measures across each movie? Why or why not? What do you notice about the overall structure of each movie’s network? Who are the more central nodes and who are more peripheral? How do you interpret these patterns?
Calculate the network densities for each movie. What does this tell you about the character interactions? Can you compare the density for these networks? Why or why not?
Run a community detection algorithm of choice on each network and compare the number of communities and their composition for each film. What do you think drives the modularity in these networks? You may need to look into the plot of each film in order to interpret these results.
Note: If you have a network dataset from your own research that you would like to analyze in place of the provided datasets, please contact me in advance and we can discuss alternatives.
This problem set asks you to step out on your own to analyze and present data using the tools you have learned throughout the semester. You may use the dataset that you plan to present in your final project or a different dataset of your choosing. The goal is that these graphics and analyses will be able to be used in your final project.
Create and interpret a static graphic.
Create and interpret an interactive graphic.
Create and interpret a summary table, regression model or other statistical analysis.