Introduction

How can we approach qualitative data with structured and/or quantitative methods? This lesson walks through some of the ways anthropologists and other social scientists can add new layers of analysis to their data interpretation with the use of data science tools.

Structuring data with tags

When working with unstructured or qualitative data, one way to add additional layers of analysis to the data is to code it based on bits of information included in the texts. Tagging terms or phrases is similar to adding hashtags to blog or social media posts, file organization systems, or other areas where tags become shorthand for clustering similar entities.

Thematic tags

Much basic text analysis relies on thematic tagging of words and phrases. Sentiment analysis and parts of speech tagging are both examples of thematic tagging.

Like the two examples linked above, there are numerous existing tagging libraries or lexicons available. For example, CrisisLex contains tools and tagged datasets for using social media to address crisis or disaster situations. You can also make your own custom tagging framework. A few example approaches are described below.

Coding based on terms in your data

You might constrain the set of terms to be coded based on what terms are present in your existing dataset. For example, this technique might be useful if you are interested in how fishermen think about conservation and have a series of interviews focused on this topic. You could begin by tokenizing the data into words or pairs of words, sort them by frequency, and go through the list adding tags to each token. You may decide in advance which types of themes are important (e.g. biophysical issues, political issues, economic issues) and give each response one or more tags based on these themes.

Alternatively, you can start by tagging terms in sequence, going back through the terms as the themes begin to emerge. So for example, you might begin by marking terms in your fisherman dataset with codes such as “ecological” or “social” but over time go back through the dataset and refine these codes into “water quality” and “resource access” or “family” and “church”. This approach also enables you to easily make nested or multilevel tags, with broader and more specific categories (e.g. family and church could both be nested in the broader category of social). This approach takes a bit more time than others but the results can be worth it.

Coding based on external sources

When tagging data, it can also be helpful to link your analysis to existing theory or conceptual frameworks. This might make your analysis more readily comparable to other scholarly or applied work, as well as help guide the structure of your analysis. Two approaches to using external sources for content tagging are described below.

First, you might use an external source to define the thematic codes used to tag your data. In this case, you might find a model or theory that describes a phenomenon and then think through your own data using this framework. For example, you might use Maslow’s hierarchy of needs or Bourdieu’s theory of capital or the United Nation’s Environmental Program’s “21 Issues for the 21st Century” as a framework. Then you can go through your list of terms and see whether each one applies to one or more of these categories (e.g. “cognitive needs” or “social capital”) and tag them accordingly. You may find that a number of terms do not apply to any themes and that is ok. This can happen when the dataset is about multiple different topics, not all of which apply to this external theory.

Second, you might take an external list of terms and apply it directly to your own dataset. There are a few ways to approach this depending on the level of detail available in the list of terms chosen. The simplest way to approach this is with simple binary (Y/N) tagging. For example, you might identify a list of terms and tag your dataset based on the presence or absence of a match between datasets. You might use something like the Canadian Department of Toxic Substances Control’s “Glossary of Environmental Terms” or a List of Olympic sports. These lists can be directly matched with your dataset. This method can also be powerful if you have multiple similar or divergent lists of terms. For example, you could compare which environmental issues are considered pressing by different agencies and how well these align with the data in your text.

Georeferenced tags

Sometimes geographic places are encoded in your data. These can be extracted and used to make maps based on data or theme distributions. You might look for lists of cities, counties, countries, or other geographic scales depending on your dataset. These can then be matched with your dataset and geocoded accordingly. Check out this helpful tutorial for more information: Finding Places in Text with the World Historical Gazeteer - Susan Grunewald and Andrew Janco

Relational tags

Relational tagging can be approached in several ways. First, if your data included multiple individuals talking to one another, you might make a network dataset based on interactions. Second, you could look at co-occurrence of ideas or terms in smaller units of your data. For example, you might have multiple newspaper articles, tweets, or interview responses and want to understand how often the word “change” co-occurs with “climate” or “social”.

Creating structured variables

Even when working with qualitative data, there are underlying ways to extract structured variables. This can be as simple as looking through your interview notes and noting how many people and which groups of people mentioned a certain theme or idea. This is slightly different than the text analysis methods described above as in this case, you are cherry picking through the whole dataset in search of one specific variable. These types of descriptions can help contextualize your results for readers. Other methods are described below.

Summaries based on data structure

In some cases, your data may have built in quantitative or structured summary variables that aren’t initially apparent. For example, if your dataset contains newspapers from several cities in the US over a defined time period, you could imagine describing your dataset in several ways.

“This research uses newspaper articles from across major US cities published during the 20th century”

“This research uses 60 articles published in four major newspapers (LA Times [n=30], NY Times [n=10], Washington Post [n=15], Chicago Tribune [n=5]) between 1925 and 2000, with an average publication date of 1970.”

Each of these descriptions may be adequate for describing your dataset, but think about the different levels of information conveyed in each. With large datasets where individual data sources have multiple attributes (e.g. tweets, museum object tags, environmental policies) you can also try running cluster analyses on these sources to group them based on similar attributes. This can help you better understand the structure of your data.