9  Creating and Managing Corpora

9.1 Objectives

In this chapter, we cover the corpus object. We explain why you need a text corpus for text analysis and how the selection of texts can affect your results and inferences. We also outline approaches for changing the unit of analysis (reshaping and segmenting corpora), how to filter a text corpus based on variables associated with texts, how to retrieve the raw texts from a text corpus, and how to manage metadata about a text corpus.

9.2 Methods

Every text analysis project begins with a set of texts, grouped in a collection known in text analysis as a corpus. A corpus is a body of documents generally collected for common purpose. Each corpus usually consists of three elements: the documents containing text, variables specific to each document, and metadata about the corpus as a whole.

Creating a text corpus starts with defining a sample of the available texts, out of all possible texts you could have selected. A text corpus could include all articles published on immigration in Irish newspapers, with each article recorded as one document in the text corpus. A corpus could also be the reviews about one specific hotel, or a random sample of reviews about hotels in Europe. Researchers need to consider and justify why certain texts are (not) included in a corpus, since the generalisability and validity of findings can depend on the documents selected for analysis.

The principles of your specific project or research design should guide your decisions to include or exclude of documents for analysis. For example, if you want to study rhetoric during televised debates, the corpus would be limited to transcripts of televised debates. When comparing issue salience in televised debates and campaign speeches, the corpus will contain debate transcripts and speeches. Thus, the research question should drive document selection.

In some text analysis applications, the sample could constitute the entirety of available text documents. Analysing all texts released by the actor(s) of interest does not necessarily mean that an analysis is without problems. Selection issues can drive the information that is recorded. Texts or information not transcribed or published cannot be included in the analysis. When analysing budget speeches, speaker selection is an important caveat. Parties strategically select Members of Parliament who can express their opinion on government policy and budget decisions (Herzog and Benoit 2015). The positions of politicians not selected for speaking at a budget debate cannot be considered in the textual analysis. Researchers should therefore consider potential selection effects or systematic missingness of information when assembling a text corpus.

Including and comparing texts with very different characteristics may affect your analysis. For example, written speeches often differ from spoken speeches (Benoit, Munger, and Spirling 2019). Debates follow a different data-generating process than campaign speeches. Debates rely on questions and spontaneous responses, while politicians or their campaign teams draft campaign speeches well in advance. This does not mean that different types of text cannot or should not be compared since such a comparison can reveal structural differences based on the medium of communication. However, we would strongly advise you to identify structural differences between types of texts by comparing different groups of texts. We discuss how to differences in word usage across groups in Chapter 17. Chapter 24 shows how to identify variation in topic prevalence for two or more groups of documents.

Besides the raw text, corpora (usually) include attributes that distinguish texts. We call these attributes document-level variables or, in the language of quanteda, “docvars”. Docvars contain additional information on each document and allow researchers to differentiate between texts in their analysis. Examples of document-level variables are the name of the author of a text, the newspaper in which an article was published, the hotel which was reviewed on TripAdvisor, or the date when the document was created.

A text corpus also typically contains metadata recording important information about the corpus as a whole. Metadata that is generally useful to record in a corpus include the source where the corpus was obtained, including possibly the URL; the author of the corpus (if from a single source); a title for the corpus; and possibly important keywords that might be used later in categorising the corpus. Corpus metadata can be quite general, for instance potentially including a codebook, instructions how to cite the corpus, or a license for the use of the corpus or copyright information.

While no universal standard exists for which metadata to record, there are guidelines for corpus objects that are followed in the quanteda packages that provide example corpora, including those in the package that accompanies this book. These all record the following metadata fields: title and description, providing a title and short text description of the corpus respectively; source and url, documenting in plain text and as a web address where the corpus was obtained; author, even when there was no single author of the documents; and keywords, a vector of keywords categorising the corpus. We recommend using this scheme for new corpus object that you create.

Thus far we have been using the terminology of “document” and “text” interchangeably, but the definition of what constitutes a document is more deliberate and more consequential, since documents define the unit of analysis. Just as in making the decision to select which documents should be included in the corpus, the text analyst must also think carefully about what will define a document. Sometimes, this decision is natural, such as in collecting individual product reviews, individual speeches (as in the Presidential inaugural speech corpus), or individual social media posts. Because the definition of a document can be fluid, however, not all such decisions are so clear cut. A user might wish to cut some longer documents into sub-sections (like chapters of a book, or paragraphs of a speech) that will form documents. Or, when facing lots of smaller texts such as posts on Twitter, a user might wish to aggregate these by user, or by day or week, to define new documents.

As with the decision to select texts for a corpus, the decision as to precisely how documents should be defined will depend on a user’s needs for a specific project. Sometimes, this will produce a need to redefine the document units in which texts were collected, into document units that resemble the units that the user will analyse for the purposes of a broader study. In their study about the association between emotions and real-time voter reactions, for instance, Boussalis et al. (2021) redefine the speech unit of documents in which campaign speeches where found, into specific utterances that were later analysed as specific statements associated with variable levels of sentiment. In reviews of hotels, the unit could be an individual review of a hotel, the mention of the hotel and its immediate context, or all texts that review a hotel. The point is that the document units in which texts are collected may not be the same as the document units that a text researcher will need for the purposes of their analysis. In the section that follows, we will show to reshape, split, and combine document units from a corpus that allow a flexible redefinition of document-level units to meet specific analytic needs. We also cover how to access and modify all elements of the corpus mentioned in this section.

Some tools for text analysis call for “cleaning” a corpus, sometimes by removing elements that are deemed to be unwanted, such as punctuation. We strongly discourage this, because such radical interventions lessen the generality of a corpus. We take the same approach to defining documents: We prefer that a corpus contain natural units of analysis, such a individual reviews, rather than analytic units of analysis such as combined or aggregated reviews. Chapter 11 and Chapter 12 show to combine documents to the analytic unit of analysis using tokens_group() or dfm_group(). Cleaning should be limited to removing textual cruft, such as page numbers if a document was converted from pdf format, and these are not of interest.

9.3 Applications

In this section, we will demonstrate how to create a corpus from different input sources, how to access and assign docvars and corpus metadata, how to subset documents from a corpus based on document-level variables, and how to draw a random sample of documents from a text corpus.

9.3.1 Creating a Corpus from Different Input Sources

We can create a quanteda corpus from different input sources, for example a data frame, a tm corpus object, or a keyword-in-context object (kwic).

In Chapter 5 we show how to use the readtext package for importing texts stored in various formats, e.g., PDF documents, Word documents, or spreadsheets. The output of the readtext() function is always a data frame, which can be easily transformed to a corpus object using corpus().

Creating a corpus from a data frame or tibble object is straightforward. We use corpus() and determine the text_field that contains the text. By default, all remaining variables of the data frame will be added as docvars.

library("quanteda")

# create data frame for illustration purposes
dat <- data.frame(
    letter_factor = factor(rep(letters[1:3], each = 2)),
    some_ints = 1L:6L,
    some_text = paste0("This is text number ", 1:6, "."),
    stringsAsFactors = FALSE,
    row.names = paste0("from_df_", 1:6)
)

head(dat)
          letter_factor some_ints              some_text
from_df_1             a         1 This is text number 1.
from_df_2             a         2 This is text number 2.
from_df_3             b         3 This is text number 3.
from_df_4             b         4 This is text number 4.
from_df_5             c         5 This is text number 5.
from_df_6             c         6 This is text number 6.
# create corpus
corp_dataframe <- corpus(dat, text_field = "some_text")

summary(corp_dataframe)
Corpus consisting of 6 documents, showing 6 documents:

      Text Types Tokens Sentences letter_factor some_ints
 from_df_1     6      6         1             a         1
 from_df_2     6      6         1             a         2
 from_df_3     6      6         1             b         3
 from_df_4     6      6         1             b         4
 from_df_5     6      6         1             c         5
 from_df_6     6      6         1             c         6

We can also create a quanteda corpus from a tm corpus object. The following example uses the corpus crude corpus from the tm package which contains 20 exemplary news articles.

# load in a tm example VCorpus object
data(crude, package = "tm")

# create quanteda corpus
corp_newsarticles <- corpus(crude)

In addition, we can create a text corpus from a kwic() object. Keywords-in-context is covered extensively in Chapter 15.

For example, we could extract all mentions of “freedom”, “war”, and “economy”, and the immediate context of 20 words from our corpus of hotel reviews, and convert the output to a new text corpus.

# create keyword-in-context object for three terms
# and a window of ±20 tokens
kw_inaugural <- data_corpus_inaugural |>
    tokens(remove_separators = FALSE) |>
    kwic(pattern = c("freedom", "war","econom*"),
         window = 20, separator = " ")

# check number of matches
nrow(kw_inaugural)
[1] 478
# check how often each pattern was matched
table(kw_inaugural$pattern)

freedom     war econom* 
    185     181     112 
# convert kwic object to a new text corpus
corp_kwic <- corpus(kw_inaugural, split_context = FALSE)

print(corp_kwic, max_ndoc = 5, max_nchar = 40)
Corpus consisting of 478 documents and 1 docvar.
1789-Washington.L1841 :
"truth   more   thoroughly   established ..."

1797-Adams.L367 :
"The   zeal   and   ardor   of   the   pe..."

1801-Jefferson.L2619 :
"best   reliance   in   peace   and   for..."

1801-Jefferson.L2652 :
"  the   supremacy   of   the   civil   o..."

1801-Jefferson.L2756 :
"  all   abuses   at   the   bar   of   t..."

[ reached max_ndoc ... 473 more documents ]

9.3.2 Inspecting Document-Level Variables and Metadata of a Corpus

Document-level variables are crucial for many text analysis projects. For instance, if we want to compare differences in issue emphasis in inaugural before and after World War II, we need a document-level variable specifying the year when a speech was delivered. We may also want to create a binary variable indicating whether a speech was delivered before or after World War II. Below, we provide examples on how to inspect document-level variables and how to create new docvars.

First, we load the packages and inspect the text corpus using summary().

# provide summary of corpus and print document-level variables for the first six documents
summary(data_corpus_inaugural, n = 6)
Corpus consisting of 59 documents, showing 6 documents:

            Text Types Tokens Sentences Year  President FirstName
 1789-Washington   625   1537        23 1789 Washington    George
 1793-Washington    96    147         4 1793 Washington    George
      1797-Adams   826   2577        37 1797      Adams      John
  1801-Jefferson   717   1923        41 1801  Jefferson    Thomas
  1805-Jefferson   804   2380        45 1805  Jefferson    Thomas
    1809-Madison   535   1261        21 1809    Madison     James
                 Party
                  none
                  none
            Federalist
 Democratic-Republican
 Democratic-Republican
 Democratic-Republican

The corpus consists of 59 documents. Each document is one inaugural speech delivered between 1789 and 2021.

The meta() function returns a named list containing the corpus-level information stored in the corpus, reflecting the standard set of metadata fields that we have chosen to use for quanteda objects. The “keywords” element of this list is a character vector containing five keywords.

meta(data_corpus_inaugural)
$description
[1] "Transcripts of all inaugural addresses delivered by United States Presidents, from Washington 1789 onward.  Data compiled by Gerhard Peters."

$source
[1] "Gerhard Peters and John T. Woolley. The American Presidency Project."

$url
[1] "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/inaugural-addresses"

$author
[1] "(various US Presidents)"

$keywords
[1] "political"     "US politics"   "United States" "presidents"   
[5] "presidency"   

$title
[1] "US presidential inaugural address speeches"

We can also access the names of our docvars.

names(docvars(data_corpus_inaugural))
[1] "Year"      "President" "FirstName" "Party"    

We see that the Year variable stores the year when each speech was delivered. We could create a binary variable PrePostWW2 which distinguishes between speeches held before and after 1945. T

# use $ operator
data_corpus_inaugural$PrePostWW2 <-
    ifelse(data_corpus_inaugural$Year > 1945,
           "Post World War II", "Pre World War II"
    )

# equivalent to
docvars(data_corpus_inaugural, "PrePostWW") <-
    ifelse(docvars(data_corpus_inaugural, "Year") > 1945,
           "Post World War II", "Pre World War II"
    )

There are multiple ways to access docvars in a quanteda object—here a corpus, but also other objects derived from a corpus such as tokens object, or separate objects such as dictionaries. These can be docvars(data_corpus_inaugural, "Party") or, for individual docvars, using the $ operator known from lists or data.frames, i.e. data_corpus_inaugural$Party.

9.3.3 Subsetting a Corpus

Applying corpus_subset() to a corpus allows us to subset a text corpus based on values of document-level variables. We can filter, for instance, only speeches delivered by Democratic Presidents, speeches delivered after World War II, or speeches from selected presidents.

# subset only Republican Presidents
corp_democrats <- corpus_subset(data_corpus_inaugural, Party == "Republican")

# subset speeches delivered after WW2
corp_postww2 <- corpus_subset(data_corpus_inaugural,
                              PrePostWW2 == "Post World War II")

corp_postww2 <- corpus_subset(data_corpus_inaugural, Year > 1945)

# subset speeches delivered by Clinton, Obama, and Biden
corp_cob <- corpus_subset(data_corpus_inaugural,
                          President %in% c("Clinton", "Obama", "Biden"))

# remove Biden's 2021 speech
corp_nobiden <- corpus_subset(data_corpus_inaugural,
                              President != "Biden")

Usually, we use logical operators for subsetting a text corpus or creating a new document-level variable based on certain conditions. The most relevant logical operators are:

  • <: less than
  • <=: less than or equal to
  • >: greater than
  • >=: greater than or equal to
  • ==: equal
  • !=: not equal
  • !x: not x (negation)
  • x | y: x OR y
  • x & y: x AND y

9.3.4 Randomly Sampling Documents From a Corpus

Often, users may be interested in taking a random sample of documents from the text corpus. For example, a user may implement and test the workflow for a small, random subset of the documents, and only later run the analysis on the full text corpus.

The corpus_sample() function allows for sampling documents from a specified size (size) with or without replacement (replace), optionally stratified by grouping variables (by) or with probability weights (prob).

# sample 30 inaugural speeches without replacement
set.seed(10987)
corp_sample30 <- corpus_sample(data_corpus_inaugural, size = 30, 
                               replace = FALSE)

# check that the corpus consists of 30 documents
ndoc(corp_sample30)
[1] 30
corp_sample_prepost <- corpus_sample(data_corpus_inaugural,
                                     size = 15,
                                     replace = FALSE, by = PrePostWW2)

# check that corpus contains 15 pre- and 15 post-WW II documents
table(corp_sample_prepost$PrePostWW2)

Post World War II  Pre World War II 
               15                15 

9.4 Advanced

9.4.1 Corpus “Cleaning”

It’s common to speak about “cleaning” a corpus, by removing unwanted elements such as numbers or punctuation characters. Our approach is more conservative, however, since we prefer to preserve the entirety of a document in our corpus, leaving the decision as to which elements to keep or remove to downstream processing. Why do we advocate this approach, and what do we mean by it?

If we recall of the text analysis workflow from Chapter 8, the construction of the corpus is the first step, and tokenisation the second. Most of what is “cleaned” from a corpus in some forms of natural language processing consist of removing categories of tokens, such as whitespace, punctuation, numbers, or even specific words (for instance “stopwords”). But this step cannot even take place until a document has been processed by being tokenised. (We view it as ironic, therefore, that decisions on which tokens to remove is often called “pre-processing”.) But the decision as to what should be removed and what to be kept is very unlikely to be universally agreed. Indeed, the decision as to what even constitutes a token may differ (something we cover more in the next two chapters). If there is no universal agreement, then a corpus should preserve the entire text of each document, including all of its linguistic elements, and leave the decision as to what to remove to be open-ended and adaptable to whatever use to which the corpus is being put.

To put this another way, we think that the language itself of “cleaning” texts is misguided. If we think of decisions about what to retain from texts as one of cleaning, it implies that some elements are a form of dirt to be scrubbed away. But what to one text analyst might be dirt to be wiped away might be rich topsoil to another. In most stopword lists, for instance, pronouns are removed, but permanently removing these from a corpus would make it impossible to analyse gendered language, where gendered pronouns are markers for substantive differences — for instance, Monroe, Quinn, and Colaresi (2008) found that gendered pronouns in the Senate floor were far more often used by Democratic than by Republic Party speakers. In their classic study of the authorship of the unsigned Federalist Papers, Mosteller and Wallace (1963) discovered that it was the statistical distribution of common words such as articles, conjunctions, and prepositions that differentiated the articles penned by Hamilton instead of Madison. The lesson is that it is far better then to leave the original documents intact rather than presuppose that any specific form of surgery will always improve them.

Punctuation and capitalisation are often used by natural language processing tools to detect boundaries between sentences or clauses. In the dataset that accompanies Pang, Lee, and Vaithyanathan (2002)’s classic study of the Naive Bayes classifier to detect sentiment from movie reviews, the reviews have been converted to lowercase and the punctuation characters separated from the words they would normally follow. Once this step has been taken, it is very difficult or impossible to reconstruct the original text as it was originally written.

By relegating the decision to “downstream” processing, we mean that decisions on “cleaning” up documents become issues of feature selection. Feature selection is a better framing of the decision to intervene in the content of our documents prior to analysis, since feature selection underscores that the goal is to support a particular type of analysis, not to scrub away elements that are innately undesirable.

An exception to this rule where we do feel it is appropriate to use the language of cleaning up texts, or least tidying them, is when they have been converted from other formats that generate content that is not part of the text itself, but rather artefacts of the converted format. When converting from pdf for instance — a format designed for the printed page — the resulting plain text may frequently contain page numbers, headers, or footers that are not part of a document’s core text, and say more about the printed page and font sizes than about the document content. We would always remove these.

Another exception is when a document contains information at the beginning or end that are more usefully treated as metadata, such as a title, author, or date. These we would remove and put into document variables. All Project Gutenberg books, for instance, contain header information about the project and its license, metadata about the title, author, release date, and language, among other fields. We would always trim this part, and separate it into document-level metadata (or corpus-level metadata, if the documents are chapters).

9.4.2 The Structure of a quanteda Corpus Object

As we will see with most of the core object classes in quanteda, its text analytic objects tend to be special versions of common R object types. In the R language, the atomic data type for recording textual data is the object type known as “character”. A character data type is a an atomic vector that has a length and attributes. In quanteda, a corpus is simply a special type of character object, with its attributes recording the extra information that make it special, including its metadata, its document-level variables, and other hidden information that are used by quanteda functions.

The advantage of building a corpus on the character data type is a property known as inheritance: functions that already work on a character data object will also work on a quanteda corpus. As a language built entirely around vectorised operations, furthermore, having a corpus built around a type of atomic vector is highly efficient. The other advantage is simplicity: to convert a character vector to a corpus, we only need to add special attributes. To convert a corpus back to a character vector, we can simply drop these attributes (using the coercion method as.character(), which we detail in the next section).

There are some operations that requires a bit of additional care in order to preserve a corpus object’s special attributes. Some low-level functions, especially those that call underlying functions in another language (C or C++) may revert a corpus to a character type if not handled carefully. There is a way around this, however: Use the subset operator [.

To replace a character in a corpus with another, for instance, will typically strip its corpus attributes. We can get around this by using the subset operator:

# not a corpus any longer
corp <- corpus(c(letters[1:3]))
corp <- stringr::str_replace_all(corp, "b", "B")
corp
[1] "a" "B" "c"
# still a corpus
corp <- corpus(c(letters[1:3]))
corp[seq_along(corp)] <- stringr::str_replace_all(corp, "b", "B")
corp
Corpus consisting of 3 documents.
text1 :
"a"

text2 :
"B"

text3 :
"c"

9.4.3 Conversion from and to Other Formats

Below, we provide examples of converting a quanteda corpus to other objects. Converting a corpus object to a data frame could be useful, for instance, if you want to reshape a corpus to sentences and hand-code a sample of these sentences for validation purposes or a training set for supervised classifiers (see Chapter 23). The example below shows how to sample 1,000 hotel reviews, and converting this corpus to a data frame.

library("TAUR")

# set seed for reproducibility
set.seed(35)

corp_TAhotels_sample <- corpus_sample(data_corpus_TAhotels,
                                      size = 1000,
                                      replace = FALSE)

# convert corpus to data frame
dat_TAhotels_sample <- convert(corp_TAhotels_sample, to = "data.frame")

which will consist of the columns doc_id, containing the document name; text, containing the actual text of the document; and any docvars that were in the corpus, which in this example consist of location and date.

The resulting data frame consists of 1000 randomly sampled reviews and 3 columns: doc_id, containing the document name; text, containing the actual text of the document; and any docvars that were in the corpus (in our case Rating).

We can easily store this data frame as a spreadsheet, which can be useful when hand-code a selection of reviews (for instance, based on their sentiment).

# store data frame as csv file with UTF-8 encoding and remove row names
write.csv(dat_TAhotels_sample_base_r,
          file = "dat_TAhotels_sample.csv",
          fileEncoding = "UTF-8",
          row.names = FALSE)

# the rio package allows us to store
# the data frame in many different file formats
library("rio")

export(dat_TAhotels_sample_base_r,
       file = "dat_TAhotels_sample.xlsx")

9.4.4 Converting or Coercing to Other Object Types

We can coerce an object to a text corpus (as.corpus()) or extract the texts from a corpus (as.character()). as.corpus() can only be applied to a quanteda corpus object and upgrades it to the newest format. This function can be handy for researchers who work with older quanteda objects. For transforming data frames or tm corpus objects into a quanteda corpus, you should use the corpus() function. The function as.character() returns the corpus text as a plain character vector.

# retrieve texts from corpus object
chars_TAhotels <- as.character(data_corpus_TAhotels)

# inspect character object
str(chars_TAhotels)
 Named chr [1:20491] "nice hotel expensive parking got good deal stay hotel anniversary, arrived late evening took advice previous re"| __truncated__ ...
 - attr(*, "names")= chr [1:20491] "text1" "text2" "text3" "text4" ...

9.4.5 Changing the Unit of Analysis

In a corpus, the units are documents, and each document contains text. But what defines a “document”? We have already discussed how we might split a naturally found found document into smaller elements, for example splitting a book into new documents consisting of single chapters. Going in the other direction, we might also group naturally occurring documents, such as social media posts, into a user’s combined weekly or daily posts, especially when these are short format posts such as Tweets.

This ability to redefine document units in terms of split or grouped textual units points to a curious feature of textual data that other data, such as numerical data, do not possess. If we think of the processed elements of documents as features, such as tokens that might be tabulated into a table of token counts by document, then the units of analysis of that table—the documents—are defined in terms of collections of its columns—the token features. Since a document is just an arbitrary collection of terms, however, it means that the more we segment our document into smaller collections, the more it approaches being a features unit defined by the column dimension of the data. Conversely, grouping documents together does the opposite. Redefining the boundaries of what constitutes a “document”, therefore, involves shifting data from columns into rows or vice versa. This ability to reshape our data matrix because one dimension is defined in terms of a collection of the other is unique to text analysis. We could not perform a similar reshaping operation on say, a survey dataset where by spreading an individual’s observed responses across additional rows, because we cannot split an individual as a unit and because that individual is defined in terms of a sampled, physical individual, not as an arbitrary collection of survey questions.

Ultimately, how we reshape our documentary units by grouping or splitting them will depend on our research question and the needs of our method for analysing the data. Knowing how the sampling procedure for the textual data selection relates to the sampling units and the units of analysis may have implications for subsequent inference, given that the units of analysis are not randomly sampled textual data, irrespective of the sampling units. Determining which are most suitable will depend on the nature of the analytical technique and the insight it is designed to yield, and sometimes the length and nature of the texts themselves.

To illustrate how we can reshape a corpus into smaller units and then re-aggregate them into larger units, but different from the original units, we will demonstrate the construction of a corpus that requires some reshaping. For this purpose, we will use the corpus of televised debate transcripts from the U.S. Presidential election campaign of 2020. Donald Trump and Joe Biden participated in two televised debates. The first debate took place in Cleveland, Ohio, on 29 September 2020. The two candidates met again in Nashville, Tennessee on 10 December.1 The corpus TAUR package contains this text corpus as data_corpus_debates.

# inspect text corpus
summary(data_corpus_debates)
Corpus consisting of 2 documents, showing 2 documents:

              Text Types Tokens Sentences             location       date
 Debate 2020-09-29  2454  24836      2039      Cleveland, Ohio 2020-09-29
 Debate 2020-10-22  2403  21726      1488 Nashville, Tennessee 2020-10-22

The output of summary(data_corpus_debates) reveals that the corpus contains only two documents, i.e. the full debate transcript. The first document contains the transcript of the debate in Cleveland, consisting of 24,836 tokens, 2,454 types (i.e., unique tokens), and 2,039 sentences. The second debate in Nashville is slightly shorter (21,726 tokens and 2,403 types).

“Types” and “tokens” are specific linguistic terms that refer to the quantity and quality of the linguistic units found in a text. A type is a unique lexical unit, and a token is any lexical unit. Lexical units are most often words, but may also include punctuation characters, numerals, emoji, or even spaces. We cover this in greater detail in Chapter 10.

The debate corpus contains only two documents, each a transcript of a length debate that included many different statements by the presidential candidates and the moderator. The corpus recorded these “natural” speech units as a document, but in the case of transcripts, that is often not the “analytical” document unit that will meet the analyst’s needs.

To make the document unit align with our analytical purposes, we will need to reshape the corpus by segmenting it into individual statements, recording the speaker and the sequence in which the statement occurred, as new document-level variables.

The tool for this segmentation in quanteda is corpus_segment(). As discussed in Chapter 8, the name of this function reflects the object that it takes as an input and produces as an output (a corpus) and the verb element describes the operation it will perform (segmentation). To perform this segmentation, we will rely on the presence in the transcript of regular patterns marking the introduction of a new statement. quanteda can take multiple forms of patterns, including a fixed match, a “glob” match, and a full regular expression match. Because we need to match a very specific pattern with more elements than a simple glob pattern can handle, we will need to use the regular expression “valuetype”. (For a brief primer on pattern matching and regular expressions see Appendix C.)

In the transcript, a statement starts with the speaker’s name in ALL CAPS, followed by a colon. To match this marker of a new statement, we will use the regular expression "\\s*[[:upper:]]+:\\s+". This identifies speaker names in ALL CAPS (\\s*[[:upper:]]+), indicating no or more spaces followed by one or more upper case words, followed by a colon (the literal:), followed by one or more white spaces \\s+. The case_insensitive = FALSE tells the pattern matcher to pay attention to case (how a word’s letters are capitalised or not).

By default, corpus_segment() will split the original documents into smaller documents corresponding to the segments preceded by the pattern, and extract the pattern found to a new variable for the split document (a docvar named pattern). It is also smart enough to record the original, longer document from which each split document originally came.

# segment text corpus to level of utterances
data_corpus_debatesseg <- corpus_segment(data_corpus_debates,
                                         pattern = "\\s*[[:upper:]]+:\\s+",
                                         valuetype = "regex",
                                         case_insensitive = FALSE
)

# overview of text corpus; n = 4 prints only the first four documents
summary(data_corpus_debatesseg, n = 4)
Corpus consisting of 1208 documents, showing 4 documents:

                Text Types Tokens Sentences        location       date
 Debate 2020-09-29.1   139    251        13 Cleveland, Ohio 2020-09-29
 Debate 2020-09-29.2     6      6         1 Cleveland, Ohio 2020-09-29
 Debate 2020-09-29.3     5      5         1 Cleveland, Ohio 2020-09-29
 Debate 2020-09-29.4     3      3         1 Cleveland, Ohio 2020-09-29
     pattern
   WALLACE: 
 \n\nBIDEN: 
 \n\nTRUMP: 
 \n\nBIDEN: 

The new, split corpus, has turned the 2 original documents into 1,208 new documents. Note also that we have kept the quanteda naming conventions in assigning the new object a name that identifies it as data, as a corpus, and describes what it is. Because any function in quanteda that begins with corpus_ will always take in and output a corpus, we know that this object will be a corpus.

In the split document, the new docvar pattern contains the pattern that we matched. Because this also included additional elements such as the spaces and the colon, however, it’s not as clean as we would prefer. Let’s turn this into a new speaker document-level variable by cleaning it up and renaming it. To make this easy, we will use the stringr for string manipulation, and the quanteda.tidy package for applying some dplyr functions from the “tidyverse” to manipulate the variables. Specifically, we will use mutate() to create a new speaker variable, and the stringr functions to remove empty whitespace (str_trim()) and the colon (str_remove_all()), and to change the names from UPPER CASE to Title Case (str_to_title()).

library("stringr")
library("quanteda.tidy")

data_corpus_debatesseg <- data_corpus_debatesseg |>
    mutate(speaker = stringr::str_trim(pattern),
           speaker = stringr::str_remove_all(speaker, ":"),
           speaker = stringr::str_to_title(speaker))

Next, we can use simple base R functions to inspect the count of utterances by speaker and debate.

# cross-table of speaker statements by debate
table(data_corpus_debatesseg$location,
      data_corpus_debatesseg$speaker)
                      
                       Biden Trump Wallace Welker
  Cleveland, Ohio        269   341     246      0
  Nashville, Tennessee    84   122       0    146

We could further reshape the corpus to the level of sentences with corpus_reshape() if we are interested, for instance, in sentence-level sentiment or issue salience.

data_corpus_debatessent <- corpus_reshape(data_corpus_debatesseg,
                                          to = "sentences")

ndoc(data_corpus_debatessent)
[1] 3560

The new text corpus moved from 1208 utterances to 3560 sentences. Using functions such as quanteda.textstat’s textstat_summary() we can retrieve summary statistics of the sentence-level corpus.

library("quanteda.textstats")
dat_summary_sents <- textstat_summary(data_corpus_debatessent)

# aggregated summary statistics
summary(dat_summary_sents)
   document             chars            sents       tokens         types      
 Length:3560        Min.   :  3.00   Min.   :1   Min.   : 1.0   Min.   : 1.00  
 Class :character   1st Qu.: 24.00   1st Qu.:1   1st Qu.: 6.0   1st Qu.: 6.00  
 Mode  :character   Median : 41.00   Median :1   Median :10.0   Median : 9.00  
                    Mean   : 55.32   Mean   :1   Mean   :12.4   Mean   :10.85  
                    3rd Qu.: 73.00   3rd Qu.:1   3rd Qu.:16.0   3rd Qu.:14.00  
                    Max.   :414.00   Max.   :1   Max.   :79.0   Max.   :56.00  
     puncts          numbers           symbols             urls        tags  
 Min.   : 0.000   Min.   :0.00000   Min.   :0.00000   Min.   :0   Min.   :0  
 1st Qu.: 1.000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0   1st Qu.:0  
 Median : 2.000   Median :0.00000   Median :0.00000   Median :0   Median :0  
 Mean   : 2.074   Mean   :0.07893   Mean   :0.01376   Mean   :0   Mean   :0  
 3rd Qu.: 3.000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0   3rd Qu.:0  
 Max.   :15.000   Max.   :4.00000   Max.   :2.00000   Max.   :0   Max.   :0  
     emojis 
 Min.   :0  
 1st Qu.:0  
 Median :0  
 Mean   :0  
 3rd Qu.:0  
 Max.   :0  

Segmenting corpora into smaller units requires a common pattern across the documents. In the example above, we identified utterances based on the combination of a speaker’s surname in capital letters followed by a colon. Other corpora may include markers such as line breaks or headings that can be used to segment a corpus. When segmenting text corpora, we strongly recommend inspecting the resulting text corpus and spot-check that the segmentation worked as expected.

9.4.6 Reshaping Corpora after Statistical Analysis of Texts

In many applications, the unit of analysis of the text corpus differs from the dataset used for statistical analyses. For example, Castanho Silva and Proksch (2021) study sentiment on European politics in tweets and parliamentary speeches. The authors construct a corpus of speeches and tweets that mention keywords relating to Europe or the EU and apply a sentiment dictionary to each document. The authors aggregate sentiment to the level of all relevant texts by a single politician, moving from over 100,000 Europe-related tweets and 20,000 Europe-related speeches to around 2500 observations. Each observation stores the sentiment by one Member of Parliament during their period of investigation. These sentiment scores are then used in regression models.

Applying the group_by() in combination with the summarise() functions of the dplyr packages allows you to reshape the output of a textual analysis, stored as a data frame, to a higher-level unit of analysis.

9.5 Further Reading

  • Selecting document and considerations of “found data”: Grimmer, Roberts, and Stewart (2022, ch. 4)
  • Definitions of document units and the reasons for and implications of redefining these: Benoit (2020, pp480–481, “Defining Documents and Choosing the Unit of Analysis”)
  • Adjusting strings: Wickham, Çetinkaya-Rundel, and Grolemund (2023, ch. 15)

9.6 Exercises

In the exercises below, we use a corpus of TripAdvisor hotel reviews, data_corpus_TAhotels, included in the TAUR package).

  1. Identify the number of documents in data_corpus_TAhotels.
  2. Subset the corpus by selecting only reviews with the maximum rating of 5.
  3. Reshape this subsetted corpus to the level of sentences.
  4. Explore textstat_summary() of the quanteda.textstats package. Apply the function to data_corpus_TAhotels and assign it to an object called tstat_sum_TAhotels.
  5. What are the average, median, minimum, and maximum document lengths?
  6. Advanced: use data_corpus_TAhotels filter only speeches consisting of at least 300 tokens by combining ntoken() and corpus_subset().
  7. Create a new document-level variable RankingRecoded that which splits up the rankings into three categories: Negative (Ranking = 1 and 2), Neutral (Ranking = 3), and Positive (Ranking = 4 and 5). Which of the three categories is the most frequent one?

  1. The transcripts are available from the American Presidency Project, specifically here and here.↩︎