10 Creating and Managing Tokens

10.1 Objectives

In this chapter, we cover the basics of tokenisation and the quanteda tokens object. You will learn what to pay attention to when tokenizing texts, and how to select, keep, and remove tokens. We explain methods for selecting tokens to remove or to modify, for instance removing “stopwords” or removing suffixes through stemming and lemmatisation, methods for reducing words to their base or root form. Finally, we show how to manage metadata in a tokens object, which largely mirrors the way metadata is managed in a corpus object. At the end of this chapter, the reader will have a solid understanding of how to create and manage tokens objects in quanteda that will serve as a foundation for more advanced tokens manipulation methods in later chapters.

10.2 Methods

After having collected the text for analysis and collecting these texts in a corpus (see Chapter 9), the next most common step is tokenise our texts. Tokenisation is the process of segmenting longer texts into individual units known as tokens, based on semantic, linguistic, or lexographic distinctions. The most common variety of tokens consists of distinct words, but tokens can also be punctuation characters, numeric or alphabetic characters, emoji, or speaces. Tokens can also consist of sequences of these, such as sentences, paragraphs, or word sequences of arbitrary length.

Tokenization is the process of splitting a text into its constituent tokens (which includes punctuation characters as tokens). Tokenization usually happens by recognising the delimiters between words, which in most languages takes the form of a space. In more technical language, inter-word delimiters are known as whitespace, and include additional machine characters such as newlines, tabs, and space variants. Most languages separate words by whitespace, but some major ones such as Chinese, Japanese, and Korean do not. Tokenizing these languages requires a set of rules to recognise word boundaries, usually from a listing of common word endings. Smart tokenizers will also separate punctuation characters that occur immediately following a word, such as the comma after word in this sentence.

In quanteda, the built-in tokenizer provides the option to tokenise our texts to different levels: words, sentences, or individual characters. Most of the time, we tokenise our documents to the level of words, although the default “word” tokeniser also separates out punctuation characters and numerals. Each tokenised document will consist of the list of tokens found in that document, but always still organised into the same document units that define the corpus.

Throughout all tokenisation steps, we know the position of each tokens in the document, which we can use to identify and compound multiword-expressions or apply a dictionary with multiword expressions. These aspects will be covered in much more detail in Chapter 11. For now, it is important to keep in mind the main difference between tokens objects and a document-feature matrix: while we know the relative position of each feature in a tokens object, a document-feature matrix reports the counts of features (which can be words, punctuation characters, numbers, or multiword expressions) in each document, but does not allow us to identify where a certain feature appeared in the document.

The next step after the tokenization of our documents is often described as pre-processing, but we prefer “processing.” Processing does not precede the analysis, but is an integral part of the workflow and can influence subsequent results (Denny and Spirling 2018). The most common types is the lower-casing of text (e.g., “Party” will be change to “party”); the removal of punctuation characters and symbols; the removal of so-called stopwords which appear frequently throughout all documents but do not add specific meaning; stemming or lemmatisation; or compounding phrases/multiword expressions to a single token. All of these decisions can influence our results. In this chapter, we focus on lower-casing, the removal of punctuation characters, and stopwords. The subsequent chapter covers more advanced tokenisation approaches, including phrases, tokens replacement, and chunking.

Lower-casing words is a standard procedure in many text analysis projects. The rationale behind this is that “Income” and “income” should be interpreted as the same textual feature due to their shared meaning. Furthermore, it’s a common practice to remove punctuation characters like commas, colons, semi-colons, question marks, and exclamation marks. Though these characters appear prolifically across texts, they often don’t significantly contribute to a quantitative text analysis. However, in certain contexts, punctuation can carry significant weight. For instance, the frequency of question marks can differentiate between positive and negative reviews of movies or hotels. They can also demarcate the rhetoric of opposition versus governing parties in parliamentary debates. Negative reviews might employ more question marks than positive ones, while opposition parties might employ rhetorical questions to criticise the ruling party. Symbols are another category often pruned during text processing.

The removal of stopwords prior to quantitative analysis is another frequent step. The rationale behind removing stopwords might be to shrink the vector space, condense the size of document-feature matrices, or prevent common words from inflating document similarities. It’s pivotal to understand that there’s no one-size-fits-all stopwords list. These lists are usually developed by researchers and tend to be domain-specific. Some words might be redundant for specific research topics but invaluable for others. For instance, feminine pronouns like “she” and “her” are integral when scrutinising partisan bias in abortion debates (Monroe and Schrodt 2008), even though they might appear in many stopwords lists. In another case, the word “will” plays a pivotal role in discerning the temporal direction of a sentence (Müller 2022). Applying stopword lists without close inspection may lead to the removal of essential terms, undermining subsequent analysis. It is imperative that researchers critically evaluate which words to retain or exclude.

Stopword lists often originate from two primary methodologies. The first method involves examining frequent words in text corpora and manually pinpointing non-essential features. The second method leverages automated techniques, like term-frequency-inverse-document-frequency (tf-idf), to detect stopwords (Sarica and Luo 2021; Wilbur and Sirotkin 1992). Refer to Chapter 17 for an in-depth exploration of strategies to discern both informative and non-informative features.

Stemming and lemmatisation serve as strategies to consolidate features. Stemming truncates tokens to their stems. In contrast, lemmatisation transforms a word into its fundamental form. Most stemming methodologies use predefined lists of suffixes and associated rules governing suffix removal. Many languages have these lists readily available. An exemplary rule-based stemming algorithm is the Snowball stemmer, developed by Martin F. Porter (Porter 2001). Lemmatisation, being more nuanced than stemming, ensures that tokens align with their root form. For example, a stemmer might truncate “easily” to “easili” and leave “easier” untouched. In contrast, a lemmatiser would convert both “easily” and “easier” to their root form: “easy”. While stemming in particular, and lemmatisation to a lower degree, are very popular processing step, reducing features to their base forms often does not change substantive results. Schofield and Mimno (2016) compare and apply various stemmers before running topic models (Chapter 24). Their careful validation reveals that “stemmers produce no meaningful improvement in likelihood and coherence and in fact can degrade topic stability” (Schofield and Mimno 2016: 287).

10.3 Applications

In this section, we apply the processing steps described above. The examples in this chapter are limited to tokenizing short texts. In practice and in most other chapters, you will be working with much larger text data sets. We always recommend creating a corpus object first and then tokenizing the corpus, rather than moving directly from a character vector or data frame to a tokens object.

NOTE: We could use a diagram here.

10.3.1 Tokenizing and Lowercasing Texts

Let’s start with exploring the tokens() function.

# texts for examples
txt <- c(
    doc1 = "A sentence, showing how tokens() works.",
    doc2 = "@quantedainit and #textanalysis https://quanteda.org")

# tokenisation without any processing
tokens(txt)

Tokens consisting of 2 documents.
doc1 :
 [1] "A"        "sentence" ","        "showing"  "how"      "tokens"  
 [7] "("        ")"        "works"    "."       

doc2 :
[1] "@quantedainit"        "and"                  "#textanalysis"       
[4] "https://quanteda.org"

The tokens() function includes several arguments for changing the tokenisation.

# tokenise to sentences (rarely used)
tokens(txt, what = "sentence")

Tokens consisting of 2 documents.
doc1 :
[1] "A sentence, showing how tokens() works."

doc2 :
[1] "@quantedainit and #textanalysis https://quanteda.org"

# tokenise to character-level
tokens(txt, what = "character")

Tokens consisting of 2 documents.
doc1 :
 [1] "A" "s" "e" "n" "t" "e" "n" "c" "e" "," "s" "h"
[ ... and 22 more ]

doc2 :
 [1] "@" "q" "u" "a" "n" "t" "e" "d" "a" "i" "n" "i"
[ ... and 37 more ]

We can lowercase our tokens object by applying the function tokens_tolower().

tokens(txt) |>
    tokens_tolower()

Tokens consisting of 2 documents.
doc1 :
 [1] "a"        "sentence" ","        "showing"  "how"      "tokens"  
 [7] "("        ")"        "works"    "."       

doc2 :
[1] "@quantedainit"        "and"                  "#textanalysis"       
[4] "https://quanteda.org"

10.3.2 Removing Punctuation, Separators, Symbols

We can remove several tokens with inbuilt functions or adjust how hyphens are tokenised.

# remove punctuation
tokens(txt, remove_punct = TRUE)

Tokens consisting of 2 documents.
doc1 :
[1] "A"        "sentence" "showing"  "how"      "tokens"   "works"   

doc2 :
[1] "@quantedainit"        "and"                  "#textanalysis"       
[4] "https://quanteda.org"

# remove numbers, symbols, and separators
tokens(txt,
       remove_numbers = TRUE,
       remove_separators = TRUE,
       remove_symbols = TRUE)

Tokens consisting of 2 documents.
doc1 :
 [1] "A"        "sentence" ","        "showing"  "how"      "tokens"  
 [7] "("        ")"        "works"    "."       

doc2 :
[1] "@quantedainit"        "and"                  "#textanalysis"       
[4] "https://quanteda.org"

# split tags and hyphens
tokens(txt,
       split_tags = TRUE,
       split_hyphens = TRUE)

Tokens consisting of 2 documents.
doc1 :
 [1] "A"        "sentence" ","        "showing"  "how"      "tokens"  
 [7] "("        ")"        "works"    "."       

doc2 :
 [1] "@"            "quantedainit" "and"          "#"            "textanalysis"
 [6] "https"        ":"            "/"            "/"            "quanteda.org"

Details on processing steps are provided in the documentation of the tokens function, which can be accessed through the document for tokens() (accessed by typing ?tokens into the R console).

With large text corpora, it might be difficult to assess whether the tokenisation works as expected. We therefore encourage researchers to work with minimal working examples, e.g., one or two sentences that contain certain features you want to tokenise, remove, keep, or compound. You can run your code on this small example and test whether the tokenisation worked as expected before applying the code to the entire corpus.

10.3.3 Inspecting and Removing Stopwords

The quanteda package contains several functions that process tokens. You start with tokenizing your text corpus, possibly apply some of the processing options included in the tokens() function, and proceed by applying more advanced processing steps, which always start with tokens_.

Let’s start examining pre-existing stopword lists. We use quanteda’s default Showball stopword list.

# number of stopwords in the English Snowball stopword list
length(quanteda::stopwords("en"))

[1] 175

# first 5 stopwords of of English Snowball stopword list
head(quanteda::stopwords("en"), 5)

[1] "i"      "me"     "my"     "myself" "we"

# default German Snowball stopword list
length(quanteda::stopwords("de"))

[1] 231

# first 5 stopwords of German Snowball stopword list
head(quanteda::stopwords("de"), 5)

[1] "aber"  "alle"  "allem" "allen" "aller"

Because quanteda’s stopwords() function is merely a re-export from the same function in the stand-alone stopwords package, we can access the additional stopwords lists defined in that package.

# check the first ten stopwords from an expanded English stopword list
# (note that list includes numbers)
head(stopwords("en", source = "stopwords-iso"), 10)

 [1] "'ll"       "'tis"      "'twas"     "'ve"       "10"        "39"       
 [7] "a"         "a's"       "able"      "ableabout"

Finally, you can create your own list of stopwords adding stopwords in a character vector. The short my_stopwords list below is for illustration purposes only since many custom lists will be considerably longer.

my_stopwords <- c("a", "an", "the")

In the next step, we apply various stopword lists to our tokens object using tokens_select(x, selection = "remove") and the wrapper function tokens_remove().

# remove English stopwords and inspect output
tokens(txt) |>
    tokens_select(
        pattern = quanteda::stopwords("en"),
        selection = "remove"
    )

Tokens consisting of 2 documents.
doc1 :
[1] "sentence" ","        "showing"  "tokens"   "("        ")"        "works"   
[8] "."       

doc2 :
[1] "@quantedainit"        "#textanalysis"        "https://quanteda.org"

# the following code is equivalent
tokens(txt) |>
    tokens_remove(pattern = quanteda::stopwords("en"))

Tokens consisting of 2 documents.
doc1 :
[1] "sentence" ","        "showing"  "tokens"   "("        ")"        "works"   
[8] "."       

doc2 :
[1] "@quantedainit"        "#textanalysis"        "https://quanteda.org"

# remove patterns that match the custom stopword list
tokens(txt) |>
    tokens_remove(pattern = my_stopwords)

Tokens consisting of 2 documents.
doc1 :
[1] "sentence" ","        "showing"  "how"      "tokens"   "("        ")"       
[8] "works"    "."       

doc2 :
[1] "@quantedainit"        "and"                  "#textanalysis"       
[4] "https://quanteda.org"

10.4 Pattern Matching

Pattern matching is central when compounding or selecting tokens. Let’s consider the following example: we might want to keep only “president”, “president’s” and “presidential” in our tokens object. One option is to use fixed pattern matching and only keep the exact matches. We specify the patternand valuetype in the tokens_select() function and determine whether to treat patterns case-sensitive or case-insensitive.

Let’s go through this trio systematically. The pattern can be one ore more unigrams or multi-word sequences. When including multi-word sequences, make sure to use the phrase() function as described above. case_insensitive specifies whether or not to ignore the case of terms when matching a pattern. The valuetype can take one of three arguments: "glob" for “glob”-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching.

We start explaining fixed pattern matching and the the behaviour of case_insensitive before moving to “glob”-style pattern matching and matching based on regular expressions. We refer readers to Chapter 4 and Appendix C for details about regular expressions.

# create tokens object
toks_president <- tokens("The President attended the presidential gala
                         where the president's policies were applauded.")

# fixed (literal) pattern matching
tokens_keep(toks_president, pattern = c("president", "presidential",
                                        "president's"),
            valuetype = "fixed")

Tokens consisting of 1 document.
text1 :
[1] "President"    "presidential" "president's"

The default pattern match is case_insentitive = TRUE. Therefore, President remains part of the tokens object even though the pattern includes president in lower-case. We could change this behaviour by setting tokens_keep(x, case_insensitive = FALSE).

# fixed (literal) pattern matching: case-sensitive
tokens_keep(toks_president, pattern = c("president", "presidential",
                                        "president's"),
            valuetype = "fixed",
            case_insensitive = FALSE)

Tokens consisting of 1 document.
text1 :
[1] "presidential" "president's"

Now only presidential and president's are kept in the tokens object while the term President is not capture since it does not match the term “president” when selecting tokens in a case-sensitive way.

* and ?: two “glob”-style matches to rule them all

Pattern matching in quanteda defaults to “glob”-style because it’s simpler than regular expression matching and suffices for the majority of user requirements. Moreover, it aligns with fixed pattern matching when wildcard characters (* and ?) aren’t utilised. The implementation in quanteda uses * to match any number of any characters including none, and ? to match any single character.

Let’s take a look at a few examples to explain the behaviour of “glob”-style pattern matching.

# match the token "president" and all terms starting with "president"
tokens_keep(toks_president, pattern = "president*",
            valuetype = "glob")

Tokens consisting of 1 document.
text1 :
[1] "President"    "presidential" "president's"

# match tokens ending on "ing*
tokens("buying buy paying pay playing laying lay") |> 
    tokens_keep(pattern = "*ing", valuetype = "glob")

Tokens consisting of 1 document.
text1 :
[1] "buying"  "paying"  "playing" "laying"

# match tokens starting with "p" and ending on "ing"
tokens("buying buy paying pay playing laying lay") |> 
    tokens_keep(pattern = "p*ing", valuetype = "glob")

Tokens consisting of 1 document.
text1 :
[1] "paying"  "playing"

# match tokens starting with a character followed by "ay"
tokens("buying buy paying pay playing laying lay") |> 
    tokens_keep(pattern = "?ay", valuetype = "glob")

Tokens consisting of 1 document.
text1 :
[1] "pay" "lay"

# match tokens starting with a character, followed "ay" and none or more characters
tokens("buying buy paying pay playing laying lay") |> 
    tokens_keep(pattern = "?ay*", valuetype = "glob")

Tokens consisting of 1 document.
text1 :
[1] "paying" "pay"    "laying" "lay"

If you want to have more control over pattern matches, we recommend regular expressions (valuetype = "regex"), which we explain in more detail in Appendix C.

10.5 Stemming

The quanteda packages includes the function tokens_wordstem(), a wrapper around wordStem() from the SnowballC package. The function uses Martin Porter’s (Porter 2001) algorithm described above. The example below shows how tokens_wordstem() adjust various words.

# example applied to tokens
txt <- c(
    one = "eating eater eaters eats ate",
    two = "taxing taxis taxes taxed my tax return"
)

# create tokens object
tokens(txt)

Tokens consisting of 2 documents.
one :
[1] "eating" "eater"  "eaters" "eats"   "ate"   

two :
[1] "taxing" "taxis"  "taxes"  "taxed"  "my"     "tax"    "return"

# create tokens object and stem tokens
txt |>
    tokens() |>
    tokens_wordstem()

Tokens consisting of 2 documents.
one :
[1] "eat"   "eater" "eater" "eat"   "ate"  

two :
[1] "tax"    "taxi"   "tax"    "tax"    "my"     "tax"    "return"

Lemmatisation is more complex than stemming since it does not rely on pre-defined rules. The spacyr package allows you to lemmatise a text corpus. We describe lemmatisation in the Advanced section below.

10.6 Advanced

10.6.1 Applying Different Tokenisers

quanteda contains several tokenisers, which can be applied in tokens(). Moreover, you can apply tokenisers included in other packages.

The current default tokeniser is word3 included in quanteda version 3 and above. For forward compatibility including use of a more advanced tokeniser that will be used in major version 4, there is also a word4 tokeniser that is even smarter than the defaults. You can apply the different tokenisers by specifying the word argument in tokens().

The tokenizers package includes additional tokenisers (Mullen et al. 2018). These tokenisers can also be applied and transformed to a quanteda tokens object.

# load the tokenizers package
library(tokenizers)

# tokenisation without processing
tokenizers::tokenize_words(txt) %>%
    tokens()

Tokens consisting of 2 documents.
one :
[1] "eating" "eater"  "eaters" "eats"   "ate"   

two :
[1] "taxing" "taxis"  "taxes"  "taxed"  "my"     "tax"    "return"

# tokenisation with processing in both functions
tokenizers::tokenize_words(txt, lowercase = FALSE, strip_punct = FALSE) %>%
    tokens(remove_symbols = TRUE)

Tokens consisting of 2 documents.
one :
[1] "eating" "eater"  "eaters" "eats"   "ate"   

two :
[1] "taxing" "taxis"  "taxes"  "taxed"  "my"     "tax"    "return"

10.6.2 Lemmatisation

While stemming works directly in quanteda using tokens_wordstem(), lemmatisation, i.e., changing tokens to its base form, requires different packages. You can use the spacyr package, a wrapper around the spaCy Python library, to lemmatise a quanteda tokens object. Note that you will need to install Python and a virtual environment to use the spaCy package.¹

# load spacyr package
library("spacyr")

# use spacy_install() to install spacy in a new or existing
# virtual environment. Check ?spacy_install() for details

# initialise and use English language model
spacy_initialize(model = "en_core_web_sm")

successfully initialized (spaCy Version: 3.7.2, language model: en_core_web_sm)

txt_compare <- c(
    one = "The cats are running quickly.",
    two = "The geese were flying overhead."
)

# parse texts, return, part-of-speech and lemma
toks_spacy <- spacy_parse(txt_compare, pos = TRUE, lemma = TRUE)

# show first 10 tokens, which are stored as a data frame
head(toks_spacy, 10)

   doc_id sentence_id token_id   token   lemma   pos entity
1     one           1        1     The     the   DET       
2     one           1        2    cats     cat  NOUN       
3     one           1        3     are      be   AUX       
4     one           1        4 running     run  VERB       
5     one           1        5 quickly quickly   ADV       
6     one           1        6       .       . PUNCT       
7     two           1        1     The     the   DET       
8     two           1        2   geese   geese  NOUN NORP_B
9     two           1        3    were      be   AUX       
10    two           1        4  flying     fly  VERB

# transform object to a quanteda tokens object and use lemma
as.tokens(toks_spacy, use_lemma = TRUE)

Tokens consisting of 2 documents.
one :
[1] "the"     "cat"     "be"      "run"     "quickly" "."      

two :
[1] "the"      "geese"    "be"       "fly"      "overhead" "."

# compare with Snowball stemmer
txt_compare |>
    tokens() |>
    tokens_wordstem()

Tokens consisting of 2 documents.
one :
[1] "The"   "cat"   "are"   "run"   "quick" "."    

two :
[1] "The"      "gees"     "were"     "fli"      "overhead" "."

# finalise spaCy and terminate Python process to free up memory
spacy_finalize()

The code above highlights the differences between stemming and lemmatisation. Stemming can truncate words, resulting in non-real words. Lemmatisation reduces words to their canonical, valid form. The word flying is stemmed to fli, while the lemmatiser changes the word to its base form, fly.

10.6.3 Modifying Stopword Lists

In many cases, you might want to use an existing stopword list but remove or add certain features. You can use quanteda’s char_remove() and base R’s c() function to remove or add features. The examples below show how to remove features from the default English stopword list.

# check if "will" is included in default stopword list
"will" %in% stopwords("en")

[1] TRUE

# remove "will" and store output as new stopword list
stopw_reduced <- char_remove(stopwords("en"), pattern = "will")

# check whether "will" was removed
"will" %in% stopw_reduced

[1] FALSE

We use c() from base R to add words to stopword lists. For example, the feature further is included in the default English stopword list, but furthermore and therefore are not included. Let’s add both terms.

# check if terms are included in stopword list
c("furthermore", "therefore") %in% stopwords("en")

[1] FALSE FALSE

# extend stopword list
stop_extended <- c(stopwords("en"), "furthermore", "therefore")

# check the last parts of the character vector
tail(stop_extended)

[1] "than"        "too"         "very"        "will"        "furthermore"
[6] "therefore"

As discussed above, tokenisation and processing involves many steps, and we can combine these steps using the base R pipe (|>). The example below shows a typical workflow.

# tokenise data_corpus_inaugural,
# remove punctuation and numbers,
# remove stopwords,
# stem the tokens,
# and transform object to lowercase

toks_inaugural <- data_corpus_inaugural |>
    tokens(remove_punct = TRUE, remove_numbers = TRUE) |>
    tokens_remove(pattern = stopwords("en")) |>
    tokens_wordstem() |>
    tokens_tolower()

# inspect first tokens from the first two speeches
head(toks_inaugural, 2)

Tokens consisting of 2 documents and 4 docvars.
1789-Washington :
 [1] "fellow-citizen" "senat"          "hous"           "repres"        
 [5] "among"          "vicissitud"     "incid"          "life"          
 [9] "event"          "fill"           "greater"        "anxieti"       
[ ... and 640 more ]

1793-Washington :
 [1] "fellow"   "citizen"  "call"     "upon"     "voic"     "countri" 
 [7] "execut"   "function" "chief"    "magistr"  "occas"    "proper"  
[ ... and 50 more ]

The sequence of processing steps during the tokensation is important. For example, if we first stem our tokens and remove stopwords or specific patterns afterwards, we might not remove all desired features. Consider the following example:

txt <- "During my stay in London I visited the museum
and attended a very good concert."

# remove stopwords before stemming tokens
tokens(txt, remove_punct = TRUE) %>%
    tokens_remove(stopwords("en")) %>%
    tokens_wordstem()

Tokens consisting of 1 document.
text1 :
[1] "stay"    "London"  "visit"   "museum"  "attend"  "good"    "concert"

# stem tokens before removing stopwords
tokens(txt, remove_punct = TRUE) %>%
    tokens_wordstem() %>%
    tokens_remove(stopwords("en"))

Tokens consisting of 1 document.
text1 :
[1] "Dure"    "stay"    "London"  "visit"   "museum"  "attend"  "veri"   
[8] "good"    "concert"

The first example produces what most users want: it removes all terms from our stopword list (during, my, I, the, and, a, very), while the second example first stems During to dure and very to veri, which changes the terms to tokens that are not included in stopwords("en") (and therefore remain in the tokens object).

10.6.4 Managing Document-Level Variables and Metadata

By default, tokens object contain the document-level variables and the metadata assigned to your corpus. You can access or modify these variables in the same way as we did in Chapter 9.

# tokenise US inaugural speeches
toks_inaugural <- tokens(data_corpus_inaugural)

# add document level variable
toks_inaugural$post_1990 <- ifelse(
    toks_inaugural$Year > 1990, "Post-1990", "Pre-1990"
)

# inspect new document-level variable
table(toks_inaugural$post_1990)


Post-1990  Pre-1990 
        8        51

10.7 Further Reading

The concept of tokenisation and how to build a custom tokeniser: Hvitfeldt and Silge (2021, ch. 2)
The intuition behind processing and tokenising texts: Grimmer, Roberts, and Stewart (2022, ch. 5.3)
Introduction to the tokenizers package: Mullen et al. (2018)
How processing decisions can influence results: Denny and Spirling (2018)

10.8 Exercises

Add some here.

Detailed instructions are provided at http://spacyr.quanteda.io and Appendix A.↩︎