Getting started with quanteda.llm • quanteda.llm

Introduction

This vignette provides a brief introduction to the quanteda.llm package, which is designed to facilitate the use of large language models (LLMs) in text analysis workflows. The package integrates with the quanteda framework, allowing users to leverage LLMs for various text processing tasks. The package relies on the ellmer package for LLM interactions, providing a seamless interface for users to work with different LLM providers. For more information on the ellmer package and supported LLM interactions, please refer to its documentation here.

Basic usage

To get started with quanteda.llm, you first need to install the package from GitHub. Then, you can load the package and begin using its functions.

library(quanteda.llm)
#> Loading required package: ellmer

Analysing texts

The quanteda.llm package provides functions to analyse large amounts of texts using LLMs. This is similar to manual annotations, but it automates the process using LLMs. The package includes functions for summarization, salience rating, scaling, and other text analysis tasks.

Structuring LLM responses

The package allows you to structure the responses from LLMs in a way that is compatible with quanteda’s corpus principles and useful for common text analysis tasks. This means you can easily integrate LLM-generated data into your text analysis workflows. For example, you can ask an LLM to summarize all documents in a corpus (ai_summary()) and store the summaries as document variables, or you can classify documents into topics (ai_salience()) or scale them based on predefined criteria (ai_scale()) and store the results as document variables.

If you need more flexibility in how the LLM generates its output, you can use the ai_text() function to define custom prompts and response structures. With ai_text() and the help of the type_object() argument from the ellmer package, you can define how the LLM should format its output, such as specifying the fields to include in the response or the format of the response itself. This flexibility enables you to tailor the LLM’s output to your analysis requirements, making it easier to integrate LLM-generated data into your text analysis workflows.

Example uses

This vignette provides a brief overview of how to use the quanteda.llm package for analysing texts with LLMs by briefly describing the main functions and their purposes. For more detailed examples including code snippets, please refer to the section Examples.

Summarizing documents

The ai_summary() function allows you to summarize documents using an LLM. It generates a summary for each document in a character vector and stores it as a new character vector which can be added as a document variable in a quanteda corpus element. The function uses a predefined type_object argument from ellmer to structure the LLM’s response, producing succinct summaries of each document. Users need to provide a character vector of documents to summarize and choose the LLM provider they want to use for summarization.

Salience rating of topics in documents

The ai_salience() function allows you to classify documents based on their relevance to predefined topics. The function uses a predefined type_object argument from ellmer to structure the LLM’s response, producing a list of topics and their salience scores for each document. This function is particularly useful for analysing large corpora where manual classification would be impractical. Users need to provide a character vector of documents and a list of topics to classify. The LLM will then analyse each document and assign a salience score to each topic, indicating how relevant the document is to that topic.

Scoring documents on a predefined scale

The ai_score() function allows you to score documents based on a predefined scale. The function uses a predefined type_object argument from ellmer to structure the LLM’s response, producing a score for each document based on the specified scale as well as a short justification for the score. This function is useful for evaluating documents against specific criteria or benchmarks. Users need to provide a character vector of documents and a scale to score against. The LLM will then analyse each document and assign a score based on the provided scale, along with a brief explanation of the reasoning behind the score.

Manually checking and validating LLM responses

The ai_validate() function allows users to manually check and validate the responses generated by the LLM with a user-friendly Shiny app. Such manual checks are essential for ensuring the quality and accuracy of the LLM’s output. The function can be used to review the scores and justifications generated by the LLM, and users can also highlight and save examples from the original texts that support the validated text classifications. The saved examples can be used for further qualitative analyses or to built a labelled dataset for fine-tuning open-source LLMs to receive improved performance on similar tasks.

Customizing the structure of LLM responses

The quanteda.llm package allows you to customize the structure of LLM responses to fit your specific analysis needs. You can define how the LLM should format its output, such as specifying the fields to include in the response or the format of the response itself. This flexibility enables you to tailor the LLM’s output to your analysis requirements, making it easier to integrate LLM-generated data into your text analysis workflows.

For such more advanced text analysis tasks, you can use the ai_text() function to define custom prompts and response structures. This function allows you to specify how the LLM should generate its output, including the format and content of the response. By using type_object() from the ellmer package, you can define the structure of the LLM’s response, making it easier to integrate LLM-generated data into your text analysis workflows.

Conclusion

The quanteda.llm package provides a powerful and flexible framework for integrating large language models into text analysis workflows. By leveraging LLMs, users can automate various text processing tasks, such as summarization, classification, and scoring, while maintaining compatibility with the quanteda framework. The package’s ability to structure LLM responses and customize output formats makes it a valuable tool for researchers and analysts working with large text corpora.