12  Building Document-Feature Matrices

12.1 Objectives

Collecting documents in a text corpus, changing the unit of analysis in this corpus (if necessary), tokenising and processing texts are central aspects of each text analysis project. Having processed the tokens object, many statistical analysis of texts require a document-feature matrix (dfm), a mathematical matrix that describes the frequency of terms (e.g., words or phrases) that occur in a collection of documents.

In this chapter, we introduce you to the the assumptions we make when creating and analysing texts through document-feature matrices. We describe the nature of a dfm object and the importance of sparsity. You will learn how to create a dfm, how to subset or group a dfm, how to select features, and how to weight, trim, and smooth a dfm. We also show how to match dfms for machine learning approaches, and how to convert dfm objects for further use in other R packages. Finally, we explain the intuition behind feature co-occurrence matrices (fcm), when and why to use them, and how to construct an fcm from a quanteda tokens object. At the end of the chapter, readers will have a solid understanding of creating dfms and fcms, the workhorses for most statistical analyses of textual data.

12.2 Methods

The document-feature matrix is the core object of many analysis of texts. It is a structured representation of textual data in the form of a matrix. After tokenising texts, we treat “text as data” by tabulating the counts of features by documents. Each row in a dfm represents a document, while each column represents a unique feature from across all the documents. The cells of the matrix indicate the frequency of a word in each document. After converting raw texts into a matrix, we move textual data into the same format as many forms of quantitative data. This format allows us to apply various statistical techniques and machine learning tools. Many of these methods have well-understood properties that allows us to generate probability statements and calculate measures of uncertainty (Benoit 2020).

Why document-feature matrix rather than document-term matrix?

Many textbooks and software packages speak of “document-term matrices”. We prefer the name “document-feature matrix” since a dfm does not only contain terms, but can also contain features such as punctuation characters, numbers, symbols, or emojis. “Document-term matrix” implies that such features are uninformative. However, as we describe in Chapter 10, punctuation characters, numbers, or emojis can provide relevant information about texts and should remain in the matrix representation of the text corpus.

Document-feature matrices allow us to analyse texts systematically. Somewhat ironically, we first need to destroy the structure of texts and make it impossible to “read” the documents. While we still know the order and context of words in tokens objects, document-feature matrix only provide information about the frequencies of features in a document. We no longer know at what position in a given document a certain feature appeared. Yet, only this oversimplification of texts allows us to apply statistical methods and analyse “text as data.”

Note: Maybe add a few sentences on numerical representations of real-world phenomena, such as survey responses, economic indicators, or conflict data (see Benoit 2020: 464-465)?

Let us explain the structure of a dfm with a simple example. Consider two short documents:

  • Document 1: “I love quanteda. I love text analysis.”
  • Document 2: “Text analysis with quanteda is fun.”

A matrix representation of these documents would look as follows:

Table 12.1: A document-feature matrix without tokens removal, but lower-casing of all features
i love quanteda . text analysis with is fun
Doc 1 2 2 1 2 1 1 0 0 0
Doc 2 0 0 1 1 1 1 1 1 1

Table 12.1 shows a dfm consisting of two rows (one per document) and each feature has its own column. The numbers report how often a feature appeared a document. We can treat this dfm like any other quantitative data set. We cannot tell the precise position “love” appeared in Document 1 anymore, but we can immediately say that this feature appeared twice in the Document 1, and that love was not included in Document 2. Since we only know feature frequencies across documents but not their precise positions in a document, we call the process of transforming texts into a matrix “bag-of-words” approach.

Maybe add a few more general sentences about dfms here.

Let’s clarify a few key terms by inspecting the Table 12.1. First, the dfm consists of 2 document and 9 features. Document 1 contains 6 types (i.e., unique tokens). The types are i, love, quanteda, text, analysis, .. Document 1 consists of 9 tokens since the words i, love and . appear twice while love, quanteda, text, and analysis appear only once. In Document 2, the number of types is identical to the number of tokens: the processed document contains 7 tokens, and each token appears only once. The dfm has a sparsity of 27.7% since 5 out of the 18 cells have zero counts.

The number of types, tokens, features, and a dfm’s sparsity depends on the processing steps we implemented during tokenisation (Chapter 10). For example, if we conduct the common steps of removing stopwords and punctuation characters, the document feature matrix will be considerably smaller. Table 12.2 shows after removing punctuation characters and the words “i”, “with”, and “is”, which are included in most stopword lists. Grimmer, Roberts, and Stewart (2022, 57–59) provide more examples of why the “default” processing steps might be problematic.

Table 12.2: A document-feature matrix after removing punctuation characters and stopwords
love quanteda text analysis fun
Doc 1 2 1 1 1 0
Doc 2 0 1 1 1 1

The sparsity was reduced from 28% to 20%, and the number of columns changed from nine to five. At the same time, removing these stopwords and punctuation did not result in a considerable loss of information about the content of each document. It is important to repeat, however, that stopword removal is not always recommended and that you should inspect the features included in a stopword list. We removed the pronoun “I”. In many cases, the removal of pronouns will not be an issue, but in other cases pronouns can tell us a lot about a text. For instance, Crisp et al. (2021) show Japanese candidates mention first-person pronouns more often in their personal manifesto if they face stronger intra-party competition. The usage of pronouns is a signal of candidates’ campaign strategies and their degree of personalisation.

Selecting features in your tokens object rather than the dfm

We strongly recommend to conduct all processing steps involving the removal of features during the tokenisation step rather than applying the same functions to a dfm. Multi-word expressions can be easily detected in tokens object, but this information is not available in a dfm anymore. Similarly, we strongly recommend applying dictionaries (Chapter 16) to the tokens object to identify possible multi-word expressions correctly.

There are, however, a few processing steps that rely on feature frequencies. Users might want to remove word appearing in less or more than a certain number of percentage of documents. These so-called trimming operations, covered in detail in the Applications section, only work dfm objects since we need to the word counts to filter by frequncies.

The dfm in Table 12.1 reports counts of features. This is the default when a dfm is created. Oftentimes, we might prefer prefer weighting feature frequencies. For example, many users might want to normalise a dfm by calculating the proportion of feature counts within each document. This process is called document normalisation as it homogenises the counts for each document. As such, it addresses the (potential) issue of varied document length in a corpus, allowing for comparing relative frequencies as opposed to raw counts. For example, two mentions of “terrible” in a short hotel review consisting of 100 words have a larger weight than three mentions of “terrible” in review of 500 words. When normalising the document, the relative frequency of “terrible” would be 0.02 (or 2%) in the short review, and 0.006 (0.6%) in the longer review, highlighting the much higher prevalence of this negative term in the first review.

Boolean weighting is another type of weighting. Applying boolean weights implies that all non-zero counts are recoded as 1. This weighting scheme is useful when users simply want to know whether or not a feature appeared in a document without being intersted in the absolute frequencies of the term.

A popular, and slightly more complex, weighting scheme is called tf-idf or term frequency-inverse document frequency. Tf-idf is a very popular approach in information retrieval and often used in machine learning (Chapter 23). Tf-idf up-weights terms that appear more often in a given document and down-weight terms that are common across documents. As a term appears in more documents, its weight approaches zero. While tf-idf is a very useful approach for identifying “unique” words that define a given document, tf-idf may be problematic when working with a topic-specific texts corpus. Tf-idf will assign low weights to words appearing across documents. When we apply td-idf to a corpus of debates on climate protection, td-idf will eliminate climate-related features even if they appear in different frequencies across documents. To sum up, while tf-idf can be useful for certain classification tasks, users should proceed with great caution when applying tf-idf weighting.

Let’s go through a simple example to understand how to calculate tf-idf scores. Going back to the example above, Document 1 contains the word “horrible” 2 times and consists of 100 tokens. Document 2 contains the word horrible 3 times but consists of 500 tokens. Let’s use quanteda’s default calculation for an example.

To calculate tf-idf we need to know term frequencies and inverse-docuemnt frequencies.

\[ \text{tf-idf} = \text{tf} \times \text{idf} \]

When using the counts of the term frequencies for “horrible”, Document 1 has a \(tf = 2\), while the \(tf\) of “horrible” for Document 2 equals \(3\).

Note that we can also calculate tf with normalised frequencies, i.e.,

\[ \text{tf} = \frac{\text{Number of times term appears in the document}}{\text{Total number of terms in the document}} \]

In this case, the \(tf\) for Document 1 would be \(for \frac{2}{100}\) and \(\frac{3}{500}\) for Document 2. The scheme_tf argument in quanteda’s dfm_tfidf() allows you to select count or prop.

Next we calculate the inverse document frequency (idf) as:

\[ \text{idf} = \log\left(\frac{\text{Total number of documents}}{\text{Number of documents with the term}}\right) \]

Since “horrible” appears in both Document 1 and Document 2:

\[ \text{idf for "horrible"} = \log\left(\frac{2}{2}\right) = \log(1) = 0 \]

In the last step, we multiply each document’s \(tf\) for “horrible” with the \(idf\) of the same term.

\[ \text{tf-idf} = \text{tf} \times \text{idf} \]

For Document 1: \(\text{tf-idf for "horrible" in Document 1} = 2 \times 0 = 0\)

For Document 2: \(\text{tf-idf for "horrible" in Document 2} = 3 \times 0 = 0\)

If Document 1 contained “terrible” 2 times, but Document 2 did not mention “horrible” at all, \(idf\) would change to: \(\text{idf for "horrible"} = \log\left(\frac{2}{1}\right)\) because “terrible” is included in only one of our two documents. tf-idf for Document 2 would remain 0, but tf-idf for “horrible” in Document 1 changes to \(2 \times \log(2) = 1.386\). The example shows highights that terms appearing in all documents always receive a tf-idf score of 0, no matter how (in)frequent they are. Keyness analysis (Chapter 17) is another approach of identifying words more “unique” to one group of documents, and–in contrast to tf-idf–features appearing across all documents are not down-weighted to 0 if their expected frequencies differ.

The trimming of a document-feature matrix is another important step to remove the number of features and sparsity of the dfm. The dfm is reduced in size based on the document frequency or term frequency. Usually, we trim features based on minimum frequencies, but we can also exclude features in terms of maximum frequencies. Combining trimming operations for minimum and maximum frequencies will return features that fall within this range. Let’s consider the following example. If we have a large text corpus and want to remove very infrequent terms we could trim based on the minimum term frequencies and document frequencies. We could, for instance only keep terms occurring at least 10 times across all documents and also appear in at least 2 documents. In this case, the minimum term frequency equals \(10\) and the minimum document frequency equals \(2\). Of course, we can also trim features based on relative frequencies. The Application section below provides several examples of trimming dfms.

Smoothing is another frequently used form of weighting a dfm. It implies that zero counts are changed to a constant other than zero. If we set a smoothing parameter of 1, the constant of 1 is added to all cells in the dfm: zero counts are changed to 1, cells with the value 1 get the value 2 etc.

Smoothing a dfm is required for statistical models that struggle with zero probabilities. For example, a zero probability of a feature in a Naive Bayes classifier (Chapter 23) would change the entire document’s probably to 0, even if other terms indicate it belongs to that class.

Next, we move to the feature co-occurrence matrix, which measure the co-occurrence of features. Feature co-occurrence matrices are the input for many word embedding models, which will be covered extensively in Chapter 26. The intuition behind fcms is simple: instead of counting how often a feature appears in a document (the intuition behind dfms) we want to know which feature appear together within a user-defined context window.

Add content on fcm here.

12.3 Applications

In this section, you will learn how to create a dfm, how to get identify the number of types and tokens, and assess it sparsity. You will learn how to weight, trim, group, and subset a document-feature matrix, and how to create and modify a feature co-occurrence matrix.

12.3.1 Creating a Document-Feature Matrix from a Tokens Object

We start with replicating the example mentioned above. Suppose we have two documents in a text corpus. We first tokenise the corpus and then create a dfm().

# create example corpus
corp <- corpus(
    c(doc1 = "I love quanteda. I love text analysis.",
      doc2 = "Text analysis with quanteda is fun."))

# tokenize corpus without making processing 
toks <- tokens(corp)

# inspect tokens object
print(toks)
Tokens consisting of 2 documents.
doc1 :
[1] "I"        "love"     "quanteda" "."        "I"        "love"     "text"    
[8] "analysis" "."       

doc2 :
[1] "Text"     "analysis" "with"     "quanteda" "is"       "fun"      "."       
# create a document-feature matrix
dfmat <- dfm(toks)

# inspect dfm: first three documents, first four features
print(dfmat, max_ndoc = 3, max_nfeat = 4)
Document-feature matrix of: 2 documents, 9 features (27.78% sparse) and 0 docvars.
      features
docs   i love quanteda .
  doc1 2    2        1 2
  doc2 0    0        1 1
[ reached max_nfeat ... 5 more features ]

12.3.2 Types, Tokens, Features, and Sparsity

We can access the number of features, tokens, types, and the sparsity of a dfm with in-built quanteda functions. We convert the corpus of US inaugural speeches to a document-feature matrix retrieve these statistics.

dfmat_inaug <- data_corpus_inaugural |> 
    tokens() |> 
    dfm()

# sparsity() and nfeat() are reported on the dfm-level
sparsity(dfmat_inaug)
[1] 0.9183668
nfeat(dfmat_inaug)
[1] 9437
# ntype() and ntoken() are reported on the document-level
# show the number of types for first five documents
ntype(dfmat_inaug) |> head(n = 5)
1789-Washington 1793-Washington      1797-Adams  1801-Jefferson  1805-Jefferson 
            603              95             801             687             781 
# show the number of tokens for first five documents
ntoken(dfmat_inaug) |> head(n = 5)
1789-Washington 1793-Washington      1797-Adams  1801-Jefferson  1805-Jefferson 
           1537             147            2577            1923            2380 
dfm()’s default behaviour

When converting a tokens object into a document-feature matrix, the dfm() considers all features included in the tokens object. It does not remove any features, but, by default, all tokens are transformed to lower case, even if you have not used tokens_tolower().

toks_example <- tokens("Transforming a tokens object to a dfm.")

# by default, dfm() transforms all tokens to lowercase
dfm(toks_example)
Document-feature matrix of: 1 document, 7 features (0.00% sparse) and 0 docvars.
       features
docs    transforming a tokens object to dfm .
  text1            1 2      1      1  1   1 1
# can change this behaviour by using dfm(x, tolower = FALSE)
dfm(toks_example, tolower = FALSE)
Document-feature matrix of: 1 document, 7 features (0.00% sparse) and 0 docvars.
       features
docs    Transforming a tokens object to dfm .
  text1            1 2      1      1  1   1 1

In almost all applications we can think of, lower-casing tokens is a reasonable and recommended processing choice. The sparsity of a dfm and the number of types (unique tokens) increases drastically when you treat upper- and lower-case tokens separately. If you want to make sure certain tokens (e.g., Labour vs. labour) are treated separately, we suggest replacing these tokens during the tokenisation process (Chapter 11) rather than preserving the original case of all tokens.

# transform all features to lowercase
dfmat_lower <- data_corpus_inaugural |> 
    tokens() |> 
    dfm(tolower = TRUE) # default

# sum of features
nfeat(dfmat_lower)
[1] 9437
# do not change tokens to lowercase
dfmat_unchanged <- data_corpus_inaugural |> 
    tokens() |> 
    dfm(tolower = FALSE) # keep original case 

# sum of features
nfeat(dfmat_unchanged)
[1] 10162

The number of unique tokens increased from 9,437 to 10,162 when not changing tokens to its lower-cased form. In the second case, upper-case tokens will be treated as a different feature than lower-case token.

12.3.3 Feature Weighting

When you have created a document-feature matrix, you can apply weights using dfm_weight() and adjust the scheme argument. Let’s weight the document-feature matrix of inaugural speeches by proportion to normalise the documents.

dfmat_inaug <- data_corpus_inaugural |> 
    tokens() |> 
    dfm()

# document normalisation using dfm(x, scheme = prop)
dfmat_inaug_prop <- dfm_weight(dfmat_inaug, scheme = "prop")

# inspect dfm: first three documents, first four features
print(dfmat_inaug_prop, max_ndoc = 3, max_nfeat = 4)
Document-feature matrix of: 59 documents, 9,437 features (91.84% sparse) and 4 docvars.
                 features
docs              fellow-citizens         of        the       senate
  1789-Washington    0.0006506181 0.04619388 0.07547170 0.0006506181
  1793-Washington    0            0.07482993 0.08843537 0           
  1797-Adams         0.0011641444 0.05432674 0.06325184 0.0003880481
[ reached max_ndoc ... 56 more documents, reached max_nfeat ... 9,433 more features ]

We can transform non-zero counts to 1 with dfm_weight(x, scheme = "boolean").

# apply boolean weighting
dfmat_inaug_boolen <- dfm_weight(dfmat_inaug, scheme = "boolean")

# inspect dfm: first five documents, first four features
print(dfmat_inaug_boolen, max_ndoc = 3, max_nfeat = 4)
Document-feature matrix of: 59 documents, 9,437 features (91.84% sparse) and 4 docvars.
                 features
docs              fellow-citizens of the senate
  1789-Washington               1  1   1      1
  1793-Washington               0  1   1      0
  1797-Adams                    1  1   1      1
[ reached max_ndoc ... 56 more documents, reached max_nfeat ... 9,433 more features ]

After applying dfm_weight(x, scheme = "boolean"), the cells contain only 0 (document does not include feature) or 1 (document contains feature at least once) rather than absolute or relative frequencies.

You can apply tf-idf weighting with dfm_tfidf(). Let’s transform dfmat_inaug to tf-idf by using counts instead of normalised frequencies first for the term frequencies.

# apply dfm-idf weighting; use count-based measure of term frequencies
dfmat_tfidfcount <- dfm_tfidf(dfmat_inaug, scheme_tf = "count")

# inspect dfm: first five documents, first four features
print(dfmat_tfidfcount, max_nfeat = 5, max_ndoc = 4)
Document-feature matrix of: 59 documents, 9,437 features (91.84% sparse) and 4 docvars.
                 features
docs              fellow-citizens of the    senate and
  1789-Washington       0.4920984  0   0 0.8166095   0
  1793-Washington       0          0   0 0           0
  1797-Adams            1.4762952  0   0 0.8166095   0
  1801-Jefferson        0.9841968  0   0 0           0
[ reached max_ndoc ... 55 more documents, reached max_nfeat ... 9,432 more features ]

The output reveals that features appearing in all documents, including of and the receive a score of 0. This “feature” of tf-idf will down-weight all terms appearing across documents, which might be problematic since it might also include meaningful terms. For example, it is very likely that every politicians will mention “climate” in parliamentary debates about environmental protection. Tf-idf, in turn, would assign the word “climate” a score of 0 for all speakers.

You can adjust the way quanteda calculates tf-idf, for example by using harmonised term frequencies (dfm_tfidf(x, scheme_tf = "prop")).

# use harmonised counts (=prop) for term frequencies
dfmat_tfidfprop <- dfm_tfidf(dfmat_inaug, scheme_tf = "prop")

You can also change the base for the logarithms. quanteda’s default base 10, which mirrors the tf-idf implementation in Manning, Raghavan, and Schütze (2008). Other R packages use different default weights. tidytext (Silge and Robinson 2016) applies the natural log, whereas tm (Feinerer, Hornik, and Meyer 2008a) uses base 2. Below we show how to replicate other packages’ tf-idf scores.

# mirror tidytext's implementation
dfm_tfidf(dfmat_inaug, 
          scheme_tf = "prop",
          scheme_df = "inverse", 
          base = exp(1))

# mirror tm's implementation
dfm_tfidf(dfmat_inaug, 
          scheme_tf = "prop", 
          scheme_df = "inverse", 
          base = 2)

12.3.4 Trimming

As mentioned above, most dfms are very sparse. Sparsity above 90% is the norm in most text analysis applications, especially if documents are short. Most features simply do not appear in a document. Very high levels of sparsity can result in convergence issues for topic models or unsupervised scaling approaches such as Wordfish (Slapin and Proksch 2008). In turn, if we keep the most frequent terms in our dfm, many models might be heavily influenced by these very frequent terms, and potential differences across documents will not be uncovered. Let’s take the following example: as we have seen above, a large proportion of documents are punctuation characters and stopwords. If we keep these features in our dfm, these terms will drive the results and, possibly, underestimate differences in topic prevalence or policy positions. Chapter 22 and Chapter 24 cover these issues in more detail.

The function dfm_trim() allows you to exclude very infrequent or very frequent terms from your dfm before proceeding with the analysis to avoid these issues or (at least) to assess the validity of the results when you exclude very frequent and infrequent terms. We can features in terms of absolute frequencies, proportions, ranks, or quantiles. In addition, we can reduce the size based on document frequencies (“In how many documents does a feature appear?”) or term frequencies (“How often does a feature appear across all documents?”). We explain all of these arguments in the example below. We use the corpus of hotel review as an example and always return the number of types and tokens to provide an overview of the reduction.

# tokenise corpus of hotel reviews without any tokens removal
toks_hotels <- tokens(TAUR::data_corpus_TAhotels)

# no processing apart from lower-casing terms (quanteda's default)
dfmat_hotels <- dfm(tokes)

# keep only features occurring at least 10 times (min_termfreq = 20) 
# and in >= 10 documents (min_docfreq = 10)
dfm_trim(dfmat_hotels, min_termfreq = 10, min_docfreq = 10)

# keep only features occurring at least 10 times (min_termfreq = 10)
# and in at least 40% of the documents (min_docfreq = 0.4)
dfm_trim(dfmat_hotels, min_termfreq = 10, min_docfreq = 0.4)

# keep only features occurring at most 10 times (max_termfreq = 10) 
# and in at most 200 documents (max_docfreq = 200)
dfm_trim(dfmat_hotels, max_termfreq = 10, max_docfreq = 200)

# keep only features occurring at most 10 times and 
# in at most 3/4 of the documents
dfm_trim(dfmat_hotels, max_termfreq = 10, max_docfreq = 0.75)

# keep only words occurring frequently (top 20% -> min_termfreq = 0.2) 
# and in at most 2 documents
dfm_trim(dfmat, min_termfreq = 0.2, max_docfreq = 2, 
         termfreq_type = "quantile")

12.3.5 Grouping and Subsetting

By default, a dfm contains as many documents as the input tokens object. Yet, in some cases you might want to combine documents in a dfm by a grouping variable. Instead of analysing each speech by a delivered by a politician, we might want to aggregated speeches to the level of parties. If we want to understand differences between hotel reviews with low and high rankings, we could group our dfm by ranking rather than review.

The function dfm_group() sums up cell frequencies within groups, determined by a document level variable and creates new “documents” with the group labels. Let’s introduce dfm_group() with an example: we transform the corpus of 20,491 hotel reviews (data_corpus_TAhotels) to a dfm, and then group this dfm by the “Rating” document-level variable, indicating the reviewer’s satisfaction with the hotel (ranging from 1 to 5 stars). The grouped dfm should therefore consist of five “documents”, with each document summing up cell frequencies of all documents with the same rating.

# tokenise the corpus and create a document feature matrix in one go
dfmat_hotels <- TAUR::data_corpus_TAhotels |> 
    tokens(remove_punct = TRUE) |> # remove punctuation
    tokens_remove(pattern = stopwords("en")) |> # remove stopwords
    dfm() # create dfm

# inspect dfm: first five documents and first five features
print(dfmat_hotels, max_ndoc = 5, max_nfeat = 5)
Document-feature matrix of: 20,491 documents, 78,553 features (99.90% sparse) and 1 docvar.
       features
docs    nice hotel expensive parking got
  text1    5     2         1       3   1
  text2    1     5         0       0   3
  text3    3     3         0       0   2
  text4    2     4         0       0   0
  text5    0     2         0       1   1
[ reached max_ndoc ... 20,486 more documents, reached max_nfeat ... 78,548 more features ]
# now group this dfm consisting of 20,491 review
# based on Rating document-level variable
dfmat_hotels_grouped <- dfm_group(dfmat_hotels, groups = Rating)

# inspect the grouped dfm: all documents and first five features
print(dfmat_hotels_grouped, max_ndoc = 5, max_nfeat = 5)
Document-feature matrix of: 5 documents, 78,553 features (64.64% sparse) and 1 docvar.
    features
docs nice hotel expensive parking  got
   1  367  3604        97     180  630
   2 1055  4187       195     177  844
   3 1751  5078       269     232  812
   4 4839 14183       669     567 1900
   5 4424 21895       694     473 2003
[ reached max_nfeat ... 78,548 more features ]

The grouped dfm consists of only 5 documents, while the number of features is the same as in the review-level dfm (dfmat_hotels). The number of frequencies does not change because dfm_group() simply sums up the cell counts. Inspecting the output of the first five features, we see that the nice appears 367 times in 1-star reviews, while the term appears 4,424 times in 5-star reviews.

In Chapter 9, we introduced corpus_subset() to filter documents based on one or more document-level variables. We can do the same with dfm’s by using dfm_subset(). For example, we can filter only 1-star reviews to investigate word frequencies in very negative reviews.

# get overview of reviews by rating
table(dfmat_hotels$Rating)

   1    2    3    4    5 
1421 1793 2184 6039 9054 
# subset dfmat_hotels by keeping only 1-star reviews
dfmat_hotels_1star <- dfm_subset(dfmat_hotels, Rating == "1")

# check that subsetting worked as expected by 
# creating a cross-table of the Rating docvars
table(dfmat_hotels_1star$Rating)

   1 
1421 
Subset your objects wisely

Tokenisation is the bottleneck operation in processing texts for quantitative analyses. If you know that you want to exclude certain documents from all analyses (for instance, all documents published prior to a specific date), we recommend filtering out these documents before from the corpus object using corpus_subset() rather than excluding the documents from the tokens object or dfm. Having said that, keep in mind that excluding documents may lead to selection biases and other problems (Grimmer, Roberts, and Stewart 2022: ch. 3).

12.3.6 Converting a dfm Object for Further Use

Document-feature matrices are the input for many statistical analyses of textual data. The quanteda package infrastructure includes many text models that work directly with a quanteda dfm object. Other packages require a dfm in a slightly different format. The function convert() provides easy conversion from a dfm to document-term representations used in all other text analysis packages.

quanteda’s convert() function allows users to convert a dfm object to the following formats.

  • lda: a list with components “documents” and “vocab” as needed by the function lda.collapsed.gibbs.sampler from the lda package (Chang 2015).
  • tm: a DocumentTermMatrix from the tm package (Feinerer, Hornik, and Meyer 2008b).
  • stm: the format for the stm package (Roberts, Stewart, and Tingley 2019) for structural topic models. Note: the stm package also allows you to input a quanteda object directly, and no conversion is required.
  • topicmodels: the dtm format as used by the topicmodels package (Grün and Hornik 2011) for Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM)

In addition, the dfm can be converted to a data.frame and tripletlist consisting of document, feature, and frequency.

Many recently developed R packages, such as keyATM (Eshima, Imai, and Sakasi 2023) for keyword-assisted topic models, LSX (Watanabe 2023) for latent semantic scaling, or conText (Rodriguez, Spirling, and Stewart 2023) for ‘a la Carte’ on Text Embedding Regression, rely on quanteda and allow you to input processed tokens() or dfm() objects.

The popular tidytext package (Silge and Robinson 2016) includes the function cast_dfm() which allows you to cast a data frame to a quanteda dfm (see also Chapter 28).

12.3.7 Creating and Processing a Feature Co-Occurrence Matrix

To be added. Also reference Chapter 26 which will explain how to create word embeddings based on an fcm object.

12.4 Advanced

We introduce two more operations for dfm objects, which are useful—and required—for many analyses, but slightly more advanced: smoothing and matching dfms.

12.4.1 Smoothing

As described in the Methods section above, smoothing a dfm implies that zero counts in a dfm are changed to a constant other than zero. Smoothing is required for models that do not work with zero probabilities, for instance Naive Bayes classifiers. We return to the minimal dfm object created above to show how to smooth your dfm().

# print dfm object
print(dfmat)
Document-feature matrix of: 2 documents, 9 features (27.78% sparse) and 0 docvars.
      features
docs   i love quanteda . text analysis with is fun
  doc1 2    2        1 2    1        1    0  0   0
  doc2 0    0        1 1    1        1    1  1   1
# smooth dfm by adding a constant of 1 to zero cells
dfm_smooth(dfmat, smoothing = 1)
Document-feature matrix of: 2 documents, 9 features (0.00% sparse) and 0 docvars.
      features
docs   i love quanteda . text analysis with is fun
  doc1 3    3        2 3    2        2    1  1   1
  doc2 1    1        2 2    2        2    2  2   2
# equivalent to:
dfm_weight(dfmat, smoothing = 1)
# smooth dfm by adding a constant of 0.5 to zero cells
dfm_smooth(dfmat, smoothing = 0.5)
Document-feature matrix of: 2 documents, 9 features (0.00% sparse) and 0 docvars.
      features
docs     i love quanteda   . text analysis with  is fun
  doc1 2.5  2.5      1.5 2.5  1.5      1.5  0.5 0.5 0.5
  doc2 0.5  0.5      1.5 1.5  1.5      1.5  1.5 1.5 1.5

12.4.2 Matching Two Document-Feature Matrices

Finally, we show how to match the features of two dfms. Matching features is required for some machine learning classifiers that can only consider features occurring in the training set for predictions of documents in a second dfm. More details are provided in Chapter 23.

Let’s split the corpus of US inaugural speeches into two objects: one corpus containing speeches delivered between 1945 and 1990, and another one including speeches delivered since 1990. Afterwards, we match the feature set of one dfm with the features existing in the other dfm.

# corpus consisting of speeches between 1945 and 1990
corp_pre1990 <- corpus_subset(data_corpus_inaugural, Year > 1945 & Year <= 1990)

# check that subsetting worked as expected
# by inspecting the Year document-level variable
docvars(corp_pre1990, "Year")
 [1] 1949 1953 1957 1961 1965 1969 1973 1977 1981 1985 1989
# corpus consisting of speeches since 1990
corp_post1990 <- corpus_subset(data_corpus_inaugural, Year > 1990)

# check that subsetting worked as expected
# by inspecting the Year document-level variable
docvars(corp_post1990, "Year")
[1] 1993 1997 2001 2005 2009 2013 2017 2021
# create two dfms
dfmat_pre1990 <- corp_pre1990 |> 
    tokens() |> 
    dfm()

dfmat_post1990 <- corp_post1990 |> 
    tokens() |> 
    dfm()

# match dfms: only keep features in dfmat_pre1990
# that appear in dfmat_post1990
dfmat_pre1990matched <- dfm_match(dfmat_pre1990, features = featnames(dfmat_post1990))

# inspect dfm
print(dfmat_pre1990matched)
Document-feature matrix of: 11 documents, 2,691 features (82.54% sparse) and 4 docvars.
                 features
docs              my fellow citizens   , today we celebrate the mystery  of
  1949-Truman      1      1        1 105     2 59         0 141       0  96
  1953-Eisenhower  4      2        3 126     1 66         0 171       0 142
  1957-Eisenhower  4      0        0 121     2 51         0 114       0  96
  1961-Kennedy     3      4        5  84     3 30         0  86       0  65
  1965-Johnson     3      4        1  95     2 34         0  77       0  57
  1969-Nixon       7      2        1 134     5 65         4 136       0  94
[ reached max_ndoc ... 5 more documents, reached max_nfeat ... 2,681 more features ]
# match dfms: only keep features in dfmat_post1990
# that appear in dfmat_pre1990
dfmat_post1990matched <- dfm_match(dfmat_post1990, features = featnames(dfmat_pre1990))

# inspect dfm
print(dfmat_post1990matched)
Document-feature matrix of: 8 documents, 3,331 features (84.20% sparse) and 4 docvars.
              features
docs           mr   . vice president   , chief justice and fellow citizens
  1993-Clinton  0  81    0         2 139     0       0  66      5        2
  1997-Clinton  0 108    0         1 131     0       1  94      7        7
  2001-Bush     0  96    1         3 110     0       3  82      1        9
  2005-Bush     1  98    1         4 120     1       6 108      3        6
  2009-Obama    0 118    0         1 130     0       0 111      1        1
  2013-Obama    1  89    1         2  99     1       2  89      3        6
[ reached max_ndoc ... 2 more documents, reached max_nfeat ... 3,321 more features ]

12.5 Further Reading

  • One of the first, and possibly most famous bag-of-words applications: Mosteller and Wallace (1963)
  • A deep dive into weighting document-feature matrices: Manning, Raghavan, and Schütze (2008, ch. 6)
  • Document-feature matrices, processing, and why the default options are often not ideal: Grimmer, Roberts, and Stewart (2022, ch. 5)

12.6 Exercises

Add some here.