In this chapter, we cover the basics of tokenisation and the quanteda tokens object. You will learn what to pay attention to when tokenizing texts, and how to select, keep, and remove tokens. We explain methods for selecting tokens to remove or to modify, for instance removing “stopwords” or removing suffixes through stemming and lemmatisation, methods for reducing words to their base or root form. Finally, we show how to manage metadata in a tokens object, which largely mirrors the way metadata is managed in a corpus object. At the end of this chapter, the reader will have a solid understanding of how to create and manage tokens objects in quanteda that will serve as a foundation for more advanced tokens manipulation methods in later chapters.
10.2 Methods
After having collected the text for analysis and collecting these texts in a corpus (see Chapter 9), the next most common step is tokenise our texts. Tokenisation is the process of segmenting longer texts into individual units known as tokens, based on semantic, linguistic, or lexographic distinctions. The most common variety of tokens consists of distinct words, but tokens can also be punctuation characters, numeric or alphabetic characters, emoji, or speaces. Tokens can also consist of sequences of these, such as sentences, paragraphs, or word sequences of arbitrary length.
Tokenization is the process of splitting a text into its constituent tokens (which includes punctuation characters as tokens). Tokenization usually happens by recognising the delimiters between words, which in most languages takes the form of a space. In more technical language, inter-word delimiters are known as whitespace, and include additional machine characters such as newlines, tabs, and space variants. Most languages separate words by whitespace, but some major ones such as Chinese, Japanese, and Korean do not. Tokenizing these languages requires a set of rules to recognise word boundaries, usually from a listing of common word endings. Smart tokenizers will also separate punctuation characters that occur immediately following a word, such as the comma after word in this sentence.
In quanteda, the built-in tokenizer provides the option to tokenise our texts to different levels: words, sentences, or individual characters. Most of the time, we tokenise our documents to the level of words, although the default “word” tokeniser also separates out punctuation characters and numerals. Each tokenised document will consist of the list of tokens found in that document, but always still organised into the same document units that define the corpus.
Throughout all tokenisation steps, we know the position of each tokens in the document, which we can use to identify and compound multiword-expressions or apply a dictionary with multiword expressions. These aspects will be covered in much more detail in Chapter 11. For now, it is important to keep in mind the main difference between tokens objects and a document-feature matrix: while we know the relative position of each feature in a tokens object, a document-feature matrix reports the counts of features (which can be words, punctuation characters, numbers, or multiword expressions) in each document, but does not allow us to identify where a certain feature appeared in the document.
The next step after the tokenization of our documents is often described as pre-processing, but we prefer “processing.” Processing does not precede the analysis, but is an integral part of the workflow and can influence subsequent results (Denny and Spirling 2018). The most common types is the lower-casing of text (e.g., “Party” will be change to “party”); the removal of punctuation characters and symbols; the removal of so-called stopwords which appear frequently throughout all documents but do not add specific meaning; stemming or lemmatisation; or compounding phrases/multiword expressions to a single token. All of these decisions can influence our results. In this chapter, we focus on lower-casing, the removal of punctuation characters, and stopwords. The subsequent chapter covers more advanced tokenisation approaches, including phrases, tokens replacement, and chunking.
Lower-casing words is a standard procedure in many text analysis projects. The rationale behind this is that “Income” and “income” should be interpreted as the same textual feature due to their shared meaning. Furthermore, it’s a common practice to remove punctuation characters like commas, colons, semi-colons, question marks, and exclamation marks. Though these characters appear prolifically across texts, they often don’t significantly contribute to a quantitative text analysis. However, in certain contexts, punctuation can carry significant weight. For instance, the frequency of question marks can differentiate between positive and negative reviews of movies or hotels. They can also demarcate the rhetoric of opposition versus governing parties in parliamentary debates. Negative reviews might employ more question marks than positive ones, while opposition parties might employ rhetorical questions to criticise the ruling party. Symbols are another category often pruned during text processing.
The removal of stopwords prior to quantitative analysis is another frequent step. The rationale behind removing stopwords might be to shrink the vector space, condense the size of document-feature matrices, or prevent common words from inflating document similarities. It’s pivotal to understand that there’s no one-size-fits-all stopwords list. These lists are usually developed by researchers and tend to be domain-specific. Some words might be redundant for specific research topics but invaluable for others. For instance, feminine pronouns like “she” and “her” are integral when scrutinising partisan bias in abortion debates (Monroe and Schrodt 2008), even though they might appear in many stopwords lists. In another case, the word “will” plays a pivotal role in discerning the temporal direction of a sentence (Müller 2022). Applying stopword lists without close inspection may lead to the removal of essential terms, undermining subsequent analysis. It is imperative that researchers critically evaluate which words to retain or exclude.
Stopword lists often originate from two primary methodologies. The first method involves examining frequent words in text corpora and manually pinpointing non-essential features. The second method leverages automated techniques, like term-frequency-inverse-document-frequency (tf-idf), to detect stopwords (Sarica and Luo 2021; Wilbur and Sirotkin 1992). Refer to Chapter 17 for an in-depth exploration of strategies to discern both informative and non-informative features.
Stemming and lemmatisation serve as strategies to consolidate features. Stemming truncates tokens to their stems. In contrast, lemmatisation transforms a word into its fundamental form. Most stemming methodologies use predefined lists of suffixes and associated rules governing suffix removal. Many languages have these lists readily available. An exemplary rule-based stemming algorithm is the Snowball stemmer, developed by Martin F. Porter (Porter 2001). Lemmatisation, being more nuanced than stemming, ensures that tokens align with their root form. For example, a stemmer might truncate “easily” to “easili” and leave “easier” untouched. In contrast, a lemmatiser would convert both “easily” and “easier” to their root form: “easy”. While stemming in particular, and lemmatisation to a lower degree, are very popular processing step, reducing features to their base forms often does not change substantive results. Schofield and Mimno (2016) compare and apply various stemmers before running topic models (Chapter 24). Their careful validation reveals that “stemmers produce no meaningful improvement in likelihood and coherence and in fact can degrade topic stability” (Schofield and Mimno 2016: 287).
10.3 Applications
In this section, we apply the processing steps described above. The examples in this chapter are limited to tokenizing short texts. In practice and in most other chapters, you will be working with much larger text data sets. We always recommend creating a corpus object first and then tokenizing the corpus, rather than moving directly from a character vector or data frame to a tokens object.
NOTE: We could use a diagram here.
10.3.1 Tokenizing and Lowercasing Texts
Let’s start with exploring the tokens() function.
# texts for examplestxt <-c(doc1 ="A sentence, showing how tokens() works.",doc2 ="@quantedainit and #textanalysis https://quanteda.org")# tokenisation without any processingtokens(txt)
The tokens() function includes several arguments for changing the tokenisation.
# tokenise to sentences (rarely used)tokens(txt, what ="sentence")
Tokens consisting of 2 documents.
doc1 :
[1] "A sentence, showing how tokens() works."
doc2 :
[1] "@quantedainit and #textanalysis https://quanteda.org"
# tokenise to character-leveltokens(txt, what ="character")
Tokens consisting of 2 documents.
doc1 :
[1] "A" "s" "e" "n" "t" "e" "n" "c" "e" "," "s" "h"
[ ... and 22 more ]
doc2 :
[1] "@" "q" "u" "a" "n" "t" "e" "d" "a" "i" "n" "i"
[ ... and 37 more ]
We can lowercase our tokens object by applying the function tokens_tolower().
Details on processing steps are provided in the documentation of the tokens function, which can be accessed through the document for tokens() (accessed by typing ?tokens into the R console).
With large text corpora, it might be difficult to assess whether the tokenisation works as expected. We therefore encourage researchers to work with minimal working examples, e.g., one or two sentences that contain certain features you want to tokenise, remove, keep, or compound. You can run your code on this small example and test whether the tokenisation worked as expected before applying the code to the entire corpus.
10.3.3 Inspecting and Removing Stopwords
The quanteda package contains several functions that process tokens. You start with tokenizing your text corpus, possibly apply some of the processing options included in the tokens() function, and proceed by applying more advanced processing steps, which always start with tokens_.
Let’s start examining pre-existing stopword lists. We use quanteda’s default Showball stopword list.
# number of stopwords in the English Snowball stopword listlength(quanteda::stopwords("en"))
[1] 175
# first 5 stopwords of of English Snowball stopword listhead(quanteda::stopwords("en"), 5)
[1] "i" "me" "my" "myself" "we"
# default German Snowball stopword listlength(quanteda::stopwords("de"))
[1] 231
# first 5 stopwords of German Snowball stopword listhead(quanteda::stopwords("de"), 5)
[1] "aber" "alle" "allem" "allen" "aller"
Because quanteda’s stopwords() function is merely a re-export from the same function in the stand-alone stopwords package, we can access the additional stopwords lists defined in that package.
# check the first ten stopwords from an expanded English stopword list# (note that list includes numbers)head(stopwords("en", source ="stopwords-iso"), 10)
Finally, you can create your own list of stopwords adding stopwords in a character vector. The short my_stopwords list below is for illustration purposes only since many custom lists will be considerably longer.
my_stopwords <-c("a", "an", "the")
In the next step, we apply various stopword lists to our tokens object using tokens_select(x, selection = "remove") and the wrapper function tokens_remove().
# remove English stopwords and inspect outputtokens(txt) |>tokens_select(pattern = quanteda::stopwords("en"),selection ="remove" )
Pattern matching is central when compounding or selecting tokens. Let’s consider the following example: we might want to keep only “president”, “president’s” and “presidential” in our tokens object. One option is to use fixed pattern matching and only keep the exact matches. We specify the patternand valuetype in the tokens_select() function and determine whether to treat patterns case-sensitive or case-insensitive.
Let’s go through this trio systematically. The pattern can be one ore more unigrams or multi-word sequences. When including multi-word sequences, make sure to use the phrase() function as described above. case_insensitive specifies whether or not to ignore the case of terms when matching a pattern. The valuetype can take one of three arguments: "glob" for “glob”-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching.
We start explaining fixed pattern matching and the the behaviour of case_insensitive before moving to “glob”-style pattern matching and matching based on regular expressions. We refer readers to Chapter 4 and Appendix C for details about regular expressions.
# create tokens objecttoks_president <-tokens("The President attended the presidential gala where the president's policies were applauded.")# fixed (literal) pattern matchingtokens_keep(toks_president, pattern =c("president", "presidential","president's"),valuetype ="fixed")
The default pattern match is case_insentitive = TRUE. Therefore, President remains part of the tokens object even though the pattern includes president in lower-case. We could change this behaviour by setting tokens_keep(x, case_insensitive = FALSE).
Tokens consisting of 1 document.
text1 :
[1] "presidential" "president's"
Now only presidential and president's are kept in the tokens object while the term President is not capture since it does not match the term “president” when selecting tokens in a case-sensitive way.
* and ?: two “glob”-style matches to rule them all
Pattern matching in quanteda defaults to “glob”-style because it’s simpler than regular expression matching and suffices for the majority of user requirements. Moreover, it aligns with fixed pattern matching when wildcard characters (* and ?) aren’t utilised. The implementation in quanteda uses * to match any number of any characters including none, and ? to match any single character.
Let’s take a look at a few examples to explain the behaviour of “glob”-style pattern matching.
# match the token "president" and all terms starting with "president"tokens_keep(toks_president, pattern ="president*",valuetype ="glob")
# match tokens starting with "p" and ending on "ing"tokens("buying buy paying pay playing laying lay") |>tokens_keep(pattern ="p*ing", valuetype ="glob")
Tokens consisting of 1 document.
text1 :
[1] "paying" "playing"
# match tokens starting with a character followed by "ay"tokens("buying buy paying pay playing laying lay") |>tokens_keep(pattern ="?ay", valuetype ="glob")
Tokens consisting of 1 document.
text1 :
[1] "pay" "lay"
# match tokens starting with a character, followed "ay" and none or more characterstokens("buying buy paying pay playing laying lay") |>tokens_keep(pattern ="?ay*", valuetype ="glob")
If you want to have more control over pattern matches, we recommend regular expressions (valuetype = "regex"), which we explain in more detail in Appendix C.
10.5 Stemming
The quanteda packages includes the function tokens_wordstem(), a wrapper around wordStem() from the SnowballC package. The function uses Martin Porter’s (Porter 2001) algorithm described above. The example below shows how tokens_wordstem() adjust various words.
# example applied to tokenstxt <-c(one ="eating eater eaters eats ate",two ="taxing taxis taxes taxed my tax return")# create tokens objecttokens(txt)
Tokens consisting of 2 documents.
one :
[1] "eating" "eater" "eaters" "eats" "ate"
two :
[1] "taxing" "taxis" "taxes" "taxed" "my" "tax" "return"
# create tokens object and stem tokenstxt |>tokens() |>tokens_wordstem()
Tokens consisting of 2 documents.
one :
[1] "eat" "eater" "eater" "eat" "ate"
two :
[1] "tax" "taxi" "tax" "tax" "my" "tax" "return"
Lemmatisation is more complex than stemming since it does not rely on pre-defined rules. The spacyr package allows you to lemmatise a text corpus. We describe lemmatisation in the Advanced section below.
10.6 Advanced
10.6.1 Applying Different Tokenisers
quanteda contains several tokenisers, which can be applied in tokens(). Moreover, you can apply tokenisers included in other packages.
The current default tokeniser is word3 included in quanteda version 3 and above. For forward compatibility including use of a more advanced tokeniser that will be used in major version 4, there is also a word4 tokeniser that is even smarter than the defaults. You can apply the different tokenisers by specifying the word argument in tokens().
The tokenizers package includes additional tokenisers (Mullen et al. 2018). These tokenisers can also be applied and transformed to a quanteda tokens object.
# load the tokenizers packagelibrary(tokenizers)# tokenisation without processingtokenizers::tokenize_words(txt) %>%tokens()
Tokens consisting of 2 documents.
one :
[1] "eating" "eater" "eaters" "eats" "ate"
two :
[1] "taxing" "taxis" "taxes" "taxed" "my" "tax" "return"
# tokenisation with processing in both functionstokenizers::tokenize_words(txt, lowercase =FALSE, strip_punct =FALSE) %>%tokens(remove_symbols =TRUE)
Tokens consisting of 2 documents.
one :
[1] "eating" "eater" "eaters" "eats" "ate"
two :
[1] "taxing" "taxis" "taxes" "taxed" "my" "tax" "return"
10.6.2 Lemmatisation
While stemming works directly in quanteda using tokens_wordstem(), lemmatisation, i.e., changing tokens to its base form, requires different packages. You can use the spacyr package, a wrapper around the spaCy Python library, to lemmatise a quanteda tokens object. Note that you will need to install Python and a virtual environment to use the spaCy package.1
# load spacyr packagelibrary("spacyr")# use spacy_install() to install spacy in a new or existing# virtual environment. Check ?spacy_install() for details# initialise and use English language modelspacy_initialize(model ="en_core_web_sm")
successfully initialized (spaCy Version: 3.7.2, language model: en_core_web_sm)
txt_compare <-c(one ="The cats are running quickly.",two ="The geese were flying overhead.")# parse texts, return, part-of-speech and lemmatoks_spacy <-spacy_parse(txt_compare, pos =TRUE, lemma =TRUE)# show first 10 tokens, which are stored as a data framehead(toks_spacy, 10)
doc_id sentence_id token_id token lemma pos entity
1 one 1 1 The the DET
2 one 1 2 cats cat NOUN
3 one 1 3 are be AUX
4 one 1 4 running run VERB
5 one 1 5 quickly quickly ADV
6 one 1 6 . . PUNCT
7 two 1 1 The the DET
8 two 1 2 geese geese NOUN NORP_B
9 two 1 3 were be AUX
10 two 1 4 flying fly VERB
# transform object to a quanteda tokens object and use lemmaas.tokens(toks_spacy, use_lemma =TRUE)
Tokens consisting of 2 documents.
one :
[1] "the" "cat" "be" "run" "quickly" "."
two :
[1] "the" "geese" "be" "fly" "overhead" "."
# compare with Snowball stemmertxt_compare |>tokens() |>tokens_wordstem()
Tokens consisting of 2 documents.
one :
[1] "The" "cat" "are" "run" "quick" "."
two :
[1] "The" "gees" "were" "fli" "overhead" "."
# finalise spaCy and terminate Python process to free up memoryspacy_finalize()
The code above highlights the differences between stemming and lemmatisation. Stemming can truncate words, resulting in non-real words. Lemmatisation reduces words to their canonical, valid form. The word flying is stemmed to fli, while the lemmatiser changes the word to its base form, fly.
10.6.3 Modifying Stopword Lists
In many cases, you might want to use an existing stopword list but remove or add certain features. You can use quanteda’s char_remove() and base R’s c() function to remove or add features. The examples below show how to remove features from the default English stopword list.
# check if "will" is included in default stopword list"will"%in%stopwords("en")
[1] TRUE
# remove "will" and store output as new stopword liststopw_reduced <-char_remove(stopwords("en"), pattern ="will")# check whether "will" was removed"will"%in% stopw_reduced
[1] FALSE
We use c() from base R to add words to stopword lists. For example, the feature further is included in the default English stopword list, but furthermore and therefore are not included. Let’s add both terms.
# check if terms are included in stopword listc("furthermore", "therefore") %in%stopwords("en")
[1] FALSE FALSE
# extend stopword liststop_extended <-c(stopwords("en"), "furthermore", "therefore")# check the last parts of the character vectortail(stop_extended)
As discussed above, tokenisation and processing involves many steps, and we can combine these steps using the base R pipe (|>). The example below shows a typical workflow.
# tokenise data_corpus_inaugural,# remove punctuation and numbers,# remove stopwords,# stem the tokens,# and transform object to lowercasetoks_inaugural <- data_corpus_inaugural |>tokens(remove_punct =TRUE, remove_numbers =TRUE) |>tokens_remove(pattern =stopwords("en")) |>tokens_wordstem() |>tokens_tolower()# inspect first tokens from the first two speecheshead(toks_inaugural, 2)
Tokens consisting of 2 documents and 4 docvars.
1789-Washington :
[1] "fellow-citizen" "senat" "hous" "repres"
[5] "among" "vicissitud" "incid" "life"
[9] "event" "fill" "greater" "anxieti"
[ ... and 640 more ]
1793-Washington :
[1] "fellow" "citizen" "call" "upon" "voic" "countri"
[7] "execut" "function" "chief" "magistr" "occas" "proper"
[ ... and 50 more ]
The sequence of processing steps during the tokensation is important. For example, if we first stem our tokens and remove stopwords or specific patterns afterwards, we might not remove all desired features. Consider the following example:
txt <-"During my stay in London I visited the museumand attended a very good concert."# remove stopwords before stemming tokenstokens(txt, remove_punct =TRUE) %>%tokens_remove(stopwords("en")) %>%tokens_wordstem()
The first example produces what most users want: it removes all terms from our stopword list (during, my, I, the, and, a, very), while the second example first stems During to dure and very to veri, which changes the terms to tokens that are not included in stopwords("en") (and therefore remain in the tokens object).
10.6.4 Managing Document-Level Variables and Metadata
By default, tokens object contain the document-level variables and the metadata assigned to your corpus. You can access or modify these variables in the same way as we did in Chapter 9.
# tokenise US inaugural speechestoks_inaugural <-tokens(data_corpus_inaugural)# add document level variabletoks_inaugural$post_1990 <-ifelse( toks_inaugural$Year >1990, "Post-1990", "Pre-1990")# inspect new document-level variabletable(toks_inaugural$post_1990)
Post-1990 Pre-1990
8 51
10.7 Further Reading
The concept of tokenisation and how to build a custom tokeniser: Hvitfeldt and Silge (2021, ch. 2)
The intuition behind processing and tokenising texts: Grimmer, Roberts, and Stewart (2022, ch. 5.3)
Introduction to the tokenizers package: Mullen et al. (2018)
How processing decisions can influence results: Denny and Spirling (2018)
10.8 Exercises
Add some here.
Denny, Matthew W., and Arthur Spirling. 2018. “Text Preprocessing for Unsupervised Learning: Why It Matters, When It Misleads, and What to Do about It.”Political Analysis 26 (2): 168–89. https://doi.org/10.1017/pan.2017.44.
Grimmer, Justin, Margaret E. Roberts, and Brandon M. Stewart. 2022. Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts: A New Framework for Machine Learning and the Social Sciences. Princeton: Princeton University Press.
Hvitfeldt, Emil, and Julia Silge. 2021. Supervised Machine Learning for Text Analysis in r. Boca Raton: CRC Press. https://smltar.com.
Monroe, Burt L., and Philip A. Schrodt. 2008. “Introduction to the Special Issue: The Statistical Analysis of Political Text.”Political Analysis 16 (4): 351–55. https://doi.org/10.1093/pan/mpn017.
Mullen, Lincoln A., Kenneth Benoit, Os Keyes, Dmitry Selivanov, and Jeffrey Arnold. 2018. “Fast, Consistent Tokenization of Natural Language Text.”Journal of Open Source Software 3: 655. https://doi.org/10.21105/joss.00655.
Müller, Stefan. 2022. “The Temporal Focus of Campaign Communication.”The Journal of Politics 84 (1): 585–90. https://doi.org/10.1086/715165.
Porter, Martin F. 2001. “Snowball: A Language for Stemming Algorithms.” In.
Schofield, Alexandra, and David Mimno. 2016. “Comparing Apples to Apple: The Effects of Stemmers on Topic Models.”Transactions of the Association for Computational Linguistics 4: 287–300. https://doi.org/10.1162/tacl_a_00099.
Wilbur, W. John, and Karl Sirotkin. 1992. “The Automatic Identification of Stop Words.”Journal of Information Science 18 (1): 45–55. https://doi.org/10.1177/01655515920180010.