Introduction
This vignette provides a brief introduction to the
quanteda.llm
package, which is designed to facilitate the
use of large language models (LLMs) in text analysis workflows. The
package integrates with the quanteda
framework,
allowing users to leverage LLMs for various text processing tasks. The
package relies on the ellmer
package for LLM interactions,
providing a seamless interface for users to work with different LLM
providers. For more information on the ellmer
package and
supported LLM interactions, please refer to its documentation here.
Basic usage
To get started with quanteda.llm
, you first need to
install the package from GitHub. Then, you can load the package and
begin using its functions.
library(quanteda.llm)
#> Loading required package: ellmer
Analysing texts
The quanteda.llm
package provides functions to analyse
large amounts of texts using LLMs. This is similar to manual
annotations, but it automates the process using LLMs. The package
includes functions for summarization, salience rating, scaling, and
other text analysis tasks.
Structuring LLM responses
The package allows you to structure the responses from LLMs in a way
that is compatible with quanteda
’s corpus principles and
useful for common text analysis tasks. This means you can easily
integrate LLM-generated data into your text analysis workflows. For
example, you can ask an LLM to summarize all documents in a corpus
(ai_summary()
) and store the summaries as document
variables, or you can classify documents into topics
(ai_salience()
) or scale them based on predefined criteria
(ai_scale()
) and store the results as document
variables.
If you need more flexibility in how the LLM generates its output, you
can use the ai_text()
function to define custom prompts and
response structures. With ai_text()
and the help of the
type_object()
argument from the ellmer
package, you can define how the LLM should format its output, such as
specifying the fields to include in the response or the format of the
response itself. This flexibility enables you to tailor the LLM’s output
to your analysis requirements, making it easier to integrate
LLM-generated data into your text analysis workflows.
Example uses
This vignette provides a brief overview of how to use the
quanteda.llm
package for analysing texts with LLMs by
briefly describing the main functions and their purposes. For more
detailed examples including code snippets, please refer
to the section Examples.
Summarizing documents
The ai_summary()
function allows you to summarize
documents using an LLM. It generates a summary for each document in a
character vector and stores it as a new character vector which can be
added as a document variable in a quanteda
corpus element.
The function uses a predefined type_object
argument from
ellmer
to structure the LLM’s response, producing succinct
summaries of each document. Users need to provide a character vector of
documents to summarize and choose the LLM provider they want to use for
summarization.
Salience rating of topics in documents
The ai_salience()
function allows you to classify
documents based on their relevance to predefined topics. The function
uses a predefined type_object
argument from
ellmer
to structure the LLM’s response, producing a list of
topics and their salience scores for each document. This function is
particularly useful for analysing large corpora where manual
classification would be impractical. Users need to provide a character
vector of documents and a list of topics to classify. The LLM will then
analyse each document and assign a salience score to each topic,
indicating how relevant the document is to that topic.
Scoring documents on a predefined scale
The ai_score()
function allows you to score documents
based on a predefined scale. The function uses a predefined
type_object
argument from ellmer
to structure
the LLM’s response, producing a score for each document based on the
specified scale as well as a short justification for the score. This
function is useful for evaluating documents against specific criteria or
benchmarks. Users need to provide a character vector of documents and a
scale to score against. The LLM will then analyse each document and
assign a score based on the provided scale, along with a brief
explanation of the reasoning behind the score.
Manually checking and validating LLM responses
The ai_validate()
function allows users to manually
check and validate the responses generated by the LLM with a
user-friendly Shiny app. Such manual checks are essential for ensuring
the quality and accuracy of the LLM’s output. The function can be used
to review the scores and justifications generated by the LLM, and users
can also highlight and save examples from the original texts that
support the validated text classifications. The saved examples can be
used for further qualitative analyses or to built a labelled dataset for
fine-tuning open-source LLMs to receive improved performance on similar
tasks.
Customizing the structure of LLM responses
The quanteda.llm
package allows you to customize the
structure of LLM responses to fit your specific analysis needs. You can
define how the LLM should format its output, such as specifying the
fields to include in the response or the format of the response itself.
This flexibility enables you to tailor the LLM’s output to your analysis
requirements, making it easier to integrate LLM-generated data into your
text analysis workflows.
For such more advanced text analysis tasks, you can use the
ai_text()
function to define custom prompts and response
structures. This function allows you to specify how the LLM should
generate its output, including the format and content of the response.
By using type_object()
from the ellmer
package, you can define the structure of the LLM’s response, making it
easier to integrate LLM-generated data into your text analysis
workflows.
Conclusion
The quanteda.llm
package provides a powerful and
flexible framework for integrating large language models into text
analysis workflows. By leveraging LLMs, users can automate various text
processing tasks, such as summarization, classification, and scoring,
while maintaining compatibility with the quanteda
framework. The package’s ability to structure LLM responses and
customize output formats makes it a valuable tool for researchers and
analysts working with large text corpora.