11  Advanced Token Manipulation

11.1 Objectives

The previous chapter introduced you to the basics of tokenisation and processing of tokens objects. Now we move to advanced token manipulations. We show how to replace tokens and introduce n-grams and skip-grams. We explain why we often want to compound multi-word expressions into a single token. We also outline why you might want to keep specific tokens and a context window around these tokens, and close the chapter with a brief introduction to lookup functions, which we cover much more extensively in Chapter 16.

11.2 Methods

Occasionally, you might want to replace certain tokens. For example, in political texts, Labour spelled usually implies a mention to the named-entity Labour Party, while labour refers to the noun. Country names is another example: if we want to understand mentions of the United States of America in UN General Debates, we could replace US, USA, and United States with united_states_of_america. Dictionaries, discussed in Chapter 16, are an effective tool for replacing tokens with an overarching “key”, such as united_states.

In some applications, tokens sequences might reveal more information than individual tokens. Before transforming a tokens object to a dfm, many existing studies create n-grams or skipgrams. N-grams are sequences of “n” items. Skip-grams are variations of n-grams which pairs non-consecutive tokens. N-grams capture local word patterns, whereas skip-grams capture broader contexts within texts. The sentence I loved my stay in New York. would result in the following bi-grams (sequence of two tokens): "I_loved" "loved_my" "my_stay" "stay_in" "in_New" "New_York" "York_.". Skip-grams of size 2 with a distance of 0 and 1 would change the object to: "I_loved" "I_my" "loved_my" "loved_time" "my_time" "my_in" "time_in" "time_New" "in_New" "in_York" "New_York" "New_." "York_.".

These examples highlight advantages and shortcomings of n-grams and skip-grams. On the one hand, both approaches provide information about the context of each token. On the other hand, n-grams and skip-grams increase the number of types (i.e., unique tokens) in our corpus. For example, the number of types in the corpus of US inaugural speeches more than doubles when creating bi-grams rather than uni-grams and triples when creating tri-grams instead of uni-grams. Instead of creating bi-grams or tri-grams, manual or automated identification of meaningful multi-word expressions is often sufficient or even preferred over n-grams.

So far, we treated all tokens as so-called unigrams. We separated tokens by spaces and did not combine two or more tokens that might form a multi-word expression. In languages with compound words, we do not need to pay much attention to multi-word expresions. For example, the German term “Mehrwertsteuer” is a single word in German, but its English equivalent consists of three words: “value added tax”. Suppose we are interested in companies’ or politicians’ focus on different forms of taxation. In that case, we want to treat value added tax as a multi-word expression rather than three separate tokens value, added, tax. Identifying multi-word expressions is especially important for document-feature matrices (dfm) (Chapter 12), which contain the counts of features in each document. If we do not explicitly compound value added tax, the words will be included as three separate tokens in our dfm. Compounding the expression during the tokenisation process, will ensure that the dfm contains the compound noun value_added_tax.

In many cases, we know multi-word expressions through our domain knowledge. For example, in reviews about hotels in New York, we might want to compound New_York, Madison_Square_Garden, and Wall_Street. In parliamentary speeches, we want to compound party names: instead of treating the combination green party as separate tokens, we might prefer the multi-word expression green party before proceeding with our statistical analysis.

Users need to discover relevant multi-word expressions. We can use approaches such as keywords-in-context (Chapter 15) to explore the context of specific words or conduct a collocation analysis to identify terms that tend to co-occur together automatically. We introduce these methods in add reference to new chapter. Having identified multi-word expressions, you can compound these collocations before continuing your textual analysis.

Keeping tokens and their context windows is another effective—and sometimes underused—tokenisation operation. We can keep specific tokens and the words around these patterns to refine our research question and focus on specific aspects of our text corpus. Let’s imagine the following example: we are working with all speeches delivered in parliament over a period of three decades and want to understand how parties’ focus and positions about climate change have evolved. Most speeches in our corpus will focus on different policies or contain procedural language, but we could create a list of words and phrases relating to the environment, and keep these terms with a context of several words. This approach would allow for a “targeted” analysis. Instead of analysing the full text corpus, we narrowed down our documents to the parts relevant to our research question. For example, Lupia, Soroka, and Beatty (2020) limit U.S Congressional speeches to sentences mentioning the National Science Foundation (NSF). Afterwards, the authors identify which of these context words distinguish Democrats from Republicans, and how the topics (Chapter 24) mentioned in these sentences are moderated by the party of a speaker. Rauh, Bes, and Schoonvelde (2020) extract mentions of European institutions and a context window of three sentences from speeches delivered by European prime ministers. In the next step, the authors measure speech complexity and sentiment in these statements on European institutions. Their results reveals that prime minister tend to speak more favourably about the European Union when they face a strong Eurosceptic challenger party.

Finally, we briefly introduce the concept of looking up tokens. We match tokens against a predefined list. This approach requires users to develop “dictionaries”, consisting of one or more “keys” or categories. Each of these keys, in turn, contains various patterns, usually words or multi-word expressions. A sentiment analysis, covered in Chapter 16, often relies on lists of terms and phrases scored as “positive” and “negative” and involves looking up these tokens.

Classifying topics, policy areas, or concepts can also be conducted with a “lookup approach.” For example, Gessler and Hunger (2022) create a dictionary of keywords and phrases related to immigration. Afterwards, the authors apply this dictionary to party press releases. The authors keep all documents containing keywords from their immigration dictionary. Their rule-based approach is more computationally efficient than supervised classification and produced valid results. Subsequent analyses apply scaling methods to this subset of immigration-related press releases to understand how the 2015 “refugee crisis” in Europe changed party positions on migration policy. Chapter 16 provides more details on creating and applying dictionaries.

11.3 Examples

In this section, we rely on short sentences and text corpora of political speeches and hotel reviews to explain how to replace tokens, how to create n-grams and skip-grams, how to compound multi-word expressions, and how to select tokens and their context.

11.3.1 Replacing and Looking Up Tokens

In some cases, users may want to substitute tokens. Reasons to replace tokens include standardising terms, accounting for synonyms, acronyms, or fixing typographical errors. For example, it may be reasonable to harmonise “EU” and “European Union” in political texts. The function tokens_replace() allows us to conduct one-to-one matching and replace EU with European Union.

toks_eu_uk <- tokens("The European Union negotiated with the UK.")

# important: use phrase if you want to detect a multi-word expression
tokens_replace(toks_eu_uk, 
               pattern = phrase("European Union"),
               replacement = "EU")
Tokens consisting of 1 document.
text1 :
[1] "The"        "EU"         "negotiated" "with"       "the"       
[6] "UK"         "."         
# we can also replace "UK" with a multi-word expression "United Kingdom"
tokens_replace(toks_eu_uk, 
               pattern = "UK", 
               replacement = phrase("United Kingdom"))
Tokens consisting of 1 document.
text1 :
[1] "The"        "European"   "Union"      "negotiated" "with"      
[6] "the"        "United"     "Kingdom"    "."         
# if we want to treat United Kingdom and European Union 
# as multi-word expressions across all texts,
# we can compound it after the replacement
toks_eu_uk |> 
    tokens_replace(pattern = "UK", replacement = phrase("United Kingdom")) |> 
    tokens_compound(pattern = phrase(c("United Kingdom",
                                       "European Union")))
Tokens consisting of 1 document.
text1 :
[1] "The"            "European_Union" "negotiated"     "with"          
[5] "the"            "United_Kingdom" "."             
How to declare multi-word expressions

We need to declare explicitly when we work with multi-word expressions. The phrase() function declares a pattern to be a sequence of separate patterns. By using phrase() you make explicit that the elements should be used for matching multi-word sequences rather than individual matches to single words. It is vital to use phrase() in all functions involving multi-word expressions, including tokens_compound().1

# make phrases from characters
phrase(c("natural language processing"))
[[1]]
[1] "natural"    "language"   "processing"
# show that replacements of multi-word expressions
# require phrase()
tokens("quantitative text analysis with quanteda") |> 
    tokens_replace(pattern = phrase(c("quantitative text analysis")),
                   replacement = "QTA")
Tokens consisting of 1 document.
text1 :
[1] "QTA"      "with"     "quanteda"
# replacement does not work without phrase()
tokens("quantitative text analysis with quanteda") |> 
    tokens_replace(pattern = "quantitative text analysis",
                   replacement = "QTA")
Tokens consisting of 1 document.
text1 :
[1] "quantitative" "text"         "analysis"     "with"         "quanteda"    

More common than one-to-one replacements is the conversation of tokens into equivalence classes defined by values of a dictionary object. Dictionaries, covered in much greater detail in chapter Chapter 16, allow us to look up uni-grams or multi-word expressions and replace these terms with the dictionary “key”. We introduce the intuition behind tokens_lookup() with a simple example. The example below replaces selected European institutions with its dictionary key eu_institution.

# create a dictionary (covered more extensively in the next chapter)
dict_euinst <- dictionary(
    list(eu_institution = c("european commission", "ecb")))

# tokenise a sentence 
toks_eu <- tokens("The European Commission is based in Brussels 
                  and the ECB in Frankfurt.")

# look up institutions (default behaviour)
tokens_lookup(toks_eu, dictionary = dict_euinst)
Tokens consisting of 1 document.
text1 :
[1] "eu_institution" "eu_institution"
# show unmatched tokens
tokens_lookup(toks_eu, dictionary = dict_euinst,
              nomatch = "_UNMATCHED")
Tokens consisting of 1 document.
text1 :
 [1] "_UNMATCHED"     "eu_institution" "_UNMATCHED"     "_UNMATCHED"    
 [5] "_UNMATCHED"     "_UNMATCHED"     "_UNMATCHED"     "_UNMATCHED"    
 [9] "eu_institution" "_UNMATCHED"     "_UNMATCHED"     "_UNMATCHED"    

By default, unmatched tokens are omitted, but we can assign a custom term to unmatched tokens. What is more, we can use tokens_lookup() as a more sophisticated form of tokens_replace(): setting exclusive = FALSE in tokens_lookup() replaces dictionary matches but leaves the other features unaffected.

# replace dictionary matches and keep other features
tokens_lookup(toks_eu, 
              dictionary = dict_euinst,
              exclusive = FALSE)
Tokens consisting of 1 document.
text1 :
 [1] "The"            "EU_INSTITUTION" "is"             "based"         
 [5] "in"             "Brussels"       "and"            "the"           
 [9] "EU_INSTITUTION" "in"             "Frankfurt"      "."             

11.3.2 Pattern Matching: pattern, valuetype, and case_insensitive

Pattern matching is central when compounding or selecting tokens. Let’s consider the following example: we might want to keep only “president”, “president’s” and “presidential” in our tokens object. One option is to use fixed pattern matching and only keep the exact matches. We specify the patternand valuetype in the tokens_select() function and determine whether to treat patterns case-sensitive or case-insensitive.

Let’s go through this trio systematically. The pattern can be one ore more unigrams or multi-word sequences. When including multi-word sequences, make sure to use the phrase() function as described above. case_insensitive specifies whether or not to ignore the case of terms when matching a pattern. The valuetype can take one of three arguments: "glob" for “glob”-style wildcard expressions; "regex" for regular expressions; or "fixed" for exact matching.

We start explaining fixed pattern matching and the the behaviour of case_insensitive before moving to “glob”-style pattern matching and matching based on regular expressions. We refer readers to Chapter 4 and Appendix C for details about regular expressions.

# create tokens object
toks_president <- tokens("The President attended the presidential gala
                         where the president's policies were applauded.")

# fixed (literal) pattern matching
tokens_keep(toks_president, pattern = c("president", "presidential",
                                        "president's"),
            valuetype = "fixed")
Tokens consisting of 1 document.
text1 :
[1] "President"    "presidential" "president's" 

The default pattern match is case_insentitive = TRUE. Therefore, President remains part of the tokens object even though the pattern includes president in lower-case. We could change this behaviour by setting tokens_keep(x, case_insensitive = FALSE).

# fixed (literal) pattern matching: case-sensitive
tokens_keep(toks_president, pattern = c("president", "presidential",
                                        "president's"),
            valuetype = "fixed",
            case_insensitive = FALSE)
Tokens consisting of 1 document.
text1 :
[1] "presidential" "president's" 

Now only presidential and president's are kept in the tokens object while the term President is not capture since it does not match the term “president” when selecting tokens in a case-sensitive way.

* and ?: two “glob”-style matches to rule them all

Pattern matching in quanteda defaults to “glob”-style because it’s simpler than regular expression matching and suffices for the majority of user requirements. Moreover, it aligns with fixed pattern matching when wildcard characters (* and ?) aren’t utilised. The implementation in quanteda uses * to match any number of any characters including none, and ? to match any single character. Let’s take a look at a few examples to explain the behaviour of wildcard pattern matches.

# match the token "president" and all terms starting with "president"
tokens_keep(toks_president, pattern = "president*",
            valuetype = "glob")
Tokens consisting of 1 document.
text1 :
[1] "President"    "presidential" "president's" 
# match tokens ending on "ing*
tokens("buying buy paying pay playing laying lay") |> 
    tokens_keep(pattern = "*ing", valuetype = "glob")
Tokens consisting of 1 document.
text1 :
[1] "buying"  "paying"  "playing" "laying" 
# match tokens starting with "p" and ending on "ing"
tokens("buying buy paying pay playing laying lay") |> 
    tokens_keep(pattern = "p*ing", valuetype = "glob")
Tokens consisting of 1 document.
text1 :
[1] "paying"  "playing"
# match tokens starting with a character followed by "ay"
tokens("buying buy paying pay playing laying lay") |> 
    tokens_keep(pattern = "?ay", valuetype = "glob")
Tokens consisting of 1 document.
text1 :
[1] "pay" "lay"
# match tokens starting with a character, followed "ay" and none or more characters
tokens("buying buy paying pay playing laying lay") |> 
    tokens_keep(pattern = "?ay*", valuetype = "glob")
Tokens consisting of 1 document.
text1 :
[1] "paying" "pay"    "laying" "lay"   

If you want to have more control over pattern matches, we recommend regular expressions (valuetype = "regex").

11.3.3 N-Grams and Skip-Grams

You can create n-grams and skip-grams in various lengths using tokens_ngrams() and tokens_skipgrams(). While using these functions is fairly straightforward, users need to make decisions about removing patterns before concatenating tokens and need to determine the size of n-grams and/or skips. We describe these options below. First, we create n-grams and skip-grams of various sizes. Then, we combine skip-grams and n-grams in the same function, and finally show how the output changes if we process a tokens object before constructing n-grams.

# tokenise a sentence
toks_social <- tokens("We should consider increasing social welfare payments.")

# form n-grams of size 2
tokens_ngrams(toks_social, n = 2)
Tokens consisting of 1 document.
text1 :
[1] "We_should"           "should_consider"     "consider_increasing"
[4] "increasing_social"   "social_welfare"      "welfare_payments"   
[7] "payments_."         
# form n-grams of size 3
tokens_ngrams(toks_social, n = 3)
Tokens consisting of 1 document.
text1 :
[1] "We_should_consider"         "should_consider_increasing"
[3] "consider_increasing_social" "increasing_social_welfare" 
[5] "social_welfare_payments"    "welfare_payments_."        
# form n-grams of size 2 and 3
tokens_ngrams(toks_social, n = 2:3)
Tokens consisting of 1 document.
text1 :
 [1] "We_should"                  "should_consider"           
 [3] "consider_increasing"        "increasing_social"         
 [5] "social_welfare"             "welfare_payments"          
 [7] "payments_."                 "We_should_consider"        
 [9] "should_consider_increasing" "consider_increasing_social"
[11] "increasing_social_welfare"  "social_welfare_payments"   
[ ... and 1 more ]
# form skip-grams of size 2 and skip 1 token
tokens_skipgrams(toks_social, n = 2, skip = 1)
Tokens consisting of 1 document.
text1 :
[1] "We_consider"        "should_increasing"  "consider_social"   
[4] "increasing_welfare" "social_payments"    "welfare_."         
# form skip-grams of size 2 and skip 1 and 2 tokens
tokens_skipgrams(toks_social, n = 2, skip = 1:2)
Tokens consisting of 1 document.
text1 :
 [1] "We_consider"         "We_increasing"       "should_increasing"  
 [4] "should_social"       "consider_social"     "consider_welfare"   
 [7] "increasing_welfare"  "increasing_payments" "social_payments"    
[10] "social_."            "welfare_."          
# form n-grams of size 1 and skip-grams with a skip of 1 token
tokens_ngrams(toks_social, n = 2, skip = 1)
Tokens consisting of 1 document.
text1 :
[1] "We_consider"        "should_increasing"  "consider_social"   
[4] "increasing_welfare" "social_payments"    "welfare_."         
# remove stopwords and punctuation before creating n-grams
toks_social |> 
    tokens(remove_punct = TRUE) |> 
    tokens_remove(pattern = stopwords("en")) |> 
    tokens_ngrams(n = 2)
Tokens consisting of 1 document.
text1 :
[1] "consider_increasing" "increasing_social"   "social_welfare"     
[4] "welfare_payments"   

The example above underscore that several combinations do not add much value to the context in which words appear. Many types are simply combinations of tokens and stopwords. From our experience, creating skip-grams or n-grams for all documents without any processing decisions in advance does not improve our analysis or results.

It is worth keeping in mind that n-grams applied to larger corpora inflate the number of types. We showcase the increase in tokens based on our corpus of 59 US inaugural speeches.

# number of types with uni-grams and no processing
data_corpus_inaugural |> 
    tokens() |> 
    ntype() |> 
    sum()
[1] 47494
# number of types with n-grams of size 2 
# after removing stopwords and punctuation characters
data_corpus_inaugural |> 
    tokens(remove_punct = TRUE) |> 
    tokens_remove(pattern = stopwords("en")) |> 
    tokens_ngrams(n = 2) |> 
    ntype() |> 
    sum()
[1] 63971
# number of types with n-grams of size 2 and no processing
data_corpus_inaugural |> 
    tokens() |> 
    tokens_ngrams(n = 2) |> 
    ntype() |> 
    sum()
[1] 118939
# number of types with n-grams of size 3 and no processing
data_corpus_inaugural |> 
    tokens() |> 
    tokens_ngrams(n = 3) |> 
    ntype() |> 
    sum()
[1] 144825
When to pay attention to very sparse objects

An increase in types through n-grams increases the sparsity of a document-feature matrix, i.e., the proportion of cells that have zero counts. (Chapter 12). The sparsity of US inaugural debates (data_corpus_inaugural) increases from 92% to 96.9% when using bi-grams instead of uni-grams. While quanteda handles sparse document-feature matrices very efficiently, a very high sparsity might result in convergence issues for unsupervised scaling models (Chapter 22) or topic models (Chapter 24). Therefore, n-grams or skip-grams may be counterproductive for some research questions.

11.3.4 Compounding Tokens

Before transforming a tokens object into a document-feature matrix (Chapter 12), we often want or need to compound multi-word expressions. Compounded phrases will be treated as a single feature in subsequent analyses. Let’s explore how to compound the multi-word expressions “social welfare” and “social security.” As mentioned above, we need to explicitly declare multi-word expressions with the pattern() function.

# create tokens object for examples
toks_social <- tokens("We need to increase social welfare payments 
                      and improve social security.")

# compound the pattern "social welfare"
toks_social |> 
    tokens_compound(pattern = phrase("social welfare"))
Tokens consisting of 1 document.
text1 :
 [1] "We"             "need"           "to"             "increase"      
 [5] "social_welfare" "payments"       "and"            "improve"       
 [9] "social"         "security"       "."             
# note: compounding does not work without phrase()
toks_social |> 
    tokens_compound(pattern = "social welfare")
Tokens consisting of 1 document.
text1 :
 [1] "We"       "need"     "to"       "increase" "social"   "welfare" 
 [7] "payments" "and"      "improve"  "social"   "security" "."       

We can include several patterns in our character vector containing multi-word expressions.

# compound two patterns
tokens_compound(toks_social,
                pattern = phrase(c("social welfare",
                                   "social security")))
Tokens consisting of 1 document.
text1 :
 [1] "We"              "need"            "to"              "increase"       
 [5] "social_welfare"  "payments"        "and"             "improve"        
 [9] "social_security" "."              
Setting the concatenator

By default, compounded tokens are concatenated using an underscore (_). The default is recommended since underscores will not be removed during normal cleaning and tokenisation. Using an underscore as a separator also allows you to check whether compounding worked as expected.

# check whether compounding worked as expected 
# by extracting patterns containing underscores
toks_social |> 
    tokens_compound(pattern = phrase(c("social welfare", 
                                       "social security"))) |> 
    tokens_keep(pattern = "*_*") # keep patterns with underscores
Tokens consisting of 1 document.
text1 :
[1] "social_welfare"  "social_security"

You can also compound terms based on regular expressions (Chapter 4 and Appendix C) or “wild card” pattern matches. Below, we use the glob-style wildcard expression * to compound all multi-word expressions starting with “social” in US State of the Union speeches.

# tokenise SOTU speeches, remove punctuation and numbers
# before removing stopwords
toks_sotu <- TAUR::data_corpus_sotu |> 
    tokens(remove_punct = TRUE,
           remove_numbers = TRUE) |> 
    tokens_remove(pattern = stopwords("en"),
                  padding = TRUE)

# compound all phrases starting with "social"
toks_sotu_comp <- tokens_compound(toks_sotu, pattern = phrase("social *"))

# spot-check results by keeping all tokens starting 
# with social using "glob"-style wildcard pattern match
# and create dfm to check compounded terms
tokens_keep(toks_sotu_comp, pattern = "social_*") |> 
    dfm() |> 
    topfeatures(n = 15) # get 15 most frequent compounded tokens
    social_security     social_problems     social_services        social_needs 
                229                  16                  16                  13 
        social_life        social_order     social_progress      social_welfare 
                 10                   9                   9                   8 
    social_programs       social_system  social_intercourse    social_condition 
                  8                   7                   6                   6 
      social_change    social_insurance social_institutions 
                  6                   6                   5 

11.3.5 Selecting Tokens within Windows

Isolating specific tokens within a defined range of words can refine many research questions. For example, we could keep the term room and the context of ±4 tokens in the corpus of hotel reviews. This approach might provide a first descriptive insights into aspects the customers really (dis-)liked about their hotel room.

# tokenize and process the corpus of hotel reviews
toks_hotels <- tokens(TAUR::data_corpus_TAhotels,
                      remove_punct = TRUE,
                      remove_numbers = TRUE,
                      padding = TRUE)

# keep "room*" and its context of ±3 tokens
toks_room <- toks_hotels |> 
    tokens_remove(pattern = stopwords("en"),
                  padding = TRUE) |> 
    tokens_keep(pattern = "room*", 
                window = 4, padding = TRUE)

# inspect the first three hotel reviews
print(toks_room, max_ndoc = 4)
Tokens consisting of 20,491 documents and 1 docvar.
text1 :
 [1] "" "" "" "" "" "" "" "" "" "" "" ""
[ ... and 86 more ]

text2 :
 [1] "" "" "" "" "" "" "" "" "" "" "" ""
[ ... and 260 more ]

text3 :
 [1] "nice"       "rooms"      ""           ""           ""          
 [6] "experience" ""           ""           ""           ""          
[11] ""           ""          
[ ... and 226 more ]

text4 :
 [1] "" "" "" "" "" "" "" "" "" "" "" ""
[ ... and 92 more ]

[ reached max_ndoc ... 20,487 more documents ]
# transform tokens object into a document-feature matrix (dfm) and 
# get 30 most frequent words surrounding "room*" using topfeatures()
toks_room |> 
    dfm() |>
    dfm_remove(pattern = "") |> # remove padding placeholder
    topfeatures(n = 30)
       room       rooms       hotel       clean        nice       small 
      34404       12061        5520        4766        2908        2694 
      great       staff         n't     service       floor        view 
       2632        2474        2414        2343        2181        2153 
       good comfortable         bed    bathroom       large   breakfast 
       2107        1961        1675        1596        1595        1593 
      night      stayed    spacious        stay    location         got 
       1426        1415        1333        1314        1273        1267 
        day        just        size        beds      booked    friendly 
       1216        1180        1160        1050        1002         995 
To pad or not to pad?

Padding implies leaving an empty string where removed tokens previously existed. Padding can be useful when we want to remove certain patterns, but (1) still know the position of tokens that remain in the corpus or (2) if we select tokens and their context window. The examples below highlight the differences.

toks <- tokens("We're having a great time at the
               pool and lovely food in the restaurant.")

# keep great, lovely and a context window of ±1 tokens
# without padding
tokens_keep(toks, pattern = c("great", "lovely"),
            window = 1,
            padding = FALSE)
Tokens consisting of 1 document.
text1 :
[1] "a"      "great"  "time"   "and"    "lovely" "food"  
# keep great, lovely and a context window of ±1 tokens
# with padding
tokens_keep(toks, pattern = c("great", "lovely"),
            window = 1,
            padding = TRUE)
Tokens consisting of 1 document.
text1 :
 [1] ""       ""       "a"      "great"  "time"   ""       ""       ""      
 [9] "and"    "lovely" "food"   ""      
[ ... and 3 more ]

11.4 Advanced

Next, we provide an overview of advanced tokens operations: splitting and chunking tokens. Both can be useful in some contexts, but tend to be used less frequently than the operations discussed so far.

11.4.1 Splitting

Splitting tokens implies that we split one token into multiple replacements. The function tokens_split() splits a tokens by a separator pattern, effectively reversing the operation of tokens_compound(). The example below shows how to undo a compounding operation.

toks <- tokens("Value added tax is a multi-word expression.")

# compound value added tax
toks_vat <- tokens_compound(toks, 
                            pattern = phrase("value added tax*"), 
                            concatenator = "_")
toks_vat
Tokens consisting of 1 document.
text1 :
[1] "Value_added_tax" "is"              "a"               "multi-word"     
[5] "expression"      "."              
# reverse compounding using "_" as the separator 
# for splitting tokens
tokens_split(toks_vat, separator = "_")
Tokens consisting of 1 document.
text1 :
[1] "Value"      "added"      "tax"        "is"         "a"         
[6] "multi-word" "expression" "."         

11.4.2 Chunking

In some applications, we may be interested by dividing our texts into equally-sized segments or chunks. You might be working with a set of very long documents, which cannot be segmented into smaller units such as paragraphs or sentences due to missing delimiters (see Chapter 9 and using corpus_reshape() or corpus_segment()). Some methods, such as topic models (Chapter 24) work better when the documents have similar lengths. The function tokens_chunk() can be used to segment a tokens object by chunks of a given size. The overlap argument allows you to specify whether to take over tokens from the preceding chunk. We use the first hotel review of data_corpus_TAhotels and divide the reviews up into smaller chunks with and without overlaps.

# tokenize the first hotel review
toks_r1 <- tokens(TAUR::data_corpus_TAhotels[1])

# print first 15 tokens
print(toks_r1, max_ntoken = 15)
Tokens consisting of 1 document and 1 docvar.
text1 :
 [1] "nice"        "hotel"       "expensive"   "parking"     "got"        
 [6] "good"        "deal"        "stay"        "hotel"       "anniversary"
[11] ","           "arrived"     "late"        "evening"     "took"       
[ ... and 83 more ]
# chunk into chunks of 5 tokens without overlap
toks_r1_chunk <- tokens_chunk(toks_r1, size = 5)

# inspect chunked tokens object
print(toks_r1_chunk, max_ndoc = 3)
Tokens consisting of 20 documents and 1 docvar.
text1.1 :
[1] "nice"      "hotel"     "expensive" "parking"   "got"      

text1.2 :
[1] "good"        "deal"        "stay"        "hotel"       "anniversary"

text1.3 :
[1] ","       "arrived" "late"    "evening" "took"   

[ reached max_ndoc ... 17 more documents ]
# chunk into chunks of 5 tokens with overlap and inspect result
tokens_chunk(toks_r1, size = 5, overlap = 2) |> 
    print(max_ndoc = 3)
Tokens consisting of 33 documents and 1 docvar.
text1.1 :
[1] "nice"      "hotel"     "expensive" "parking"   "got"      

text1.2 :
[1] "parking" "got"     "good"    "deal"    "stay"   

text1.3 :
[1] "deal"        "stay"        "hotel"       "anniversary" ","          

[ reached max_ndoc ... 30 more documents ]

As always, this example serves only for illustration purposes. Usually, the selected chunks would be larger than five documents to mirror the length of “typical” documents, such as sentences or paragraphs.

11.5 Further Reading

  • Examples of targeted analyses
  • Part-of-speech tagging
  • spacy

11.6 Exercises

Add some here.


  1. tokens_lookup(), which handles phrases internally, is an exception to the rule.↩︎