--- title: "Tidy pipelines and structured output" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Tidy pipelines and structured output} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = identical(tolower(Sys.getenv("LLMR_RUN_VIGNETTES", "false")), "true") ) ``` We will show both unstructured and structured pipelines, using open models: - deepseek-chat (DeepSeek) - llama-3.1-8b-instant (Groq) - openai/gpt-oss-20b (Groq) You will need environment variables DEEPSEEK_API_KEY and GROQ_API_KEY. ```{r} library(LLMR) library(dplyr) cfg_ds <- llm_config("deepseek", "deepseek-chat") cfg_groq1 <- llm_config("groq", "llama-3.1-8b-instant") cfg_groq <- llm_config("groq", "openai/gpt-oss-20b") ``` ## llm_fn: unstructured (DeepSeek) ```{r} words <- c("excellent", "awful", "fine") out <- llm_fn( words, prompt = "Classify '{x}' as Positive, Negative, or Neutral.", .config = cfg_ds, .return = "columns" ) out ``` ## llm_fn: unstructured (Groq) ```{r} out_groq <- llm_fn( words, prompt = "Classify '{x}' as Positive, Negative, or Neutral.", .config = cfg_groq1, .return = "columns" ) out_groq ``` ## llm_fn_structured: schema-first (DeepSeek) ```{r} schema <- list( type = "object", properties = list( label = list(type = "string", description = "Sentiment label"), score = list(type = "number", description = "Confidence 0..1") ), required = list("label", "score"), additionalProperties = FALSE ) out_s <- llm_fn_structured( x = words, prompt = "Classify '{x}' as Positive, Negative, or Neutral with confidence.", .config = cfg_ds, .schema = schema, .fields = c("label", "score") ) out_s ``` ## llm_mutate: unstructured (Groq) ```{r} df <- tibble::tibble( id = 1:3, text = c("Cats are great pets", "The weather is bad", "I like tea") ) df_u <- df |> llm_mutate( answer = "Give a short category for: {text}", .config = cfg_groq, .return = "columns" ) df_u ``` ## llm_mutate: shorthand syntax The shorthand lets you combine output column and prompt in one argument: ```{r} df |> llm_mutate( category = "Give a short category for: {text}", .config = cfg_groq ) # Equivalent to: llm_mutate(category, prompt = "Give...", .config = cfg_groq) ``` Or with multi-turn messages: ```{r} df |> llm_mutate( classified = c( system = "You are a text classifier. One word only.", user = "Category for: {text}" ), .config = cfg_ds ) ``` ## llm_mutate with .structured flag You can now enable structured output directly in `llm_mutate()` using `.structured = TRUE`: ```{r} schema <- list( type = "object", properties = list( category = list(type = "string"), confidence = list(type = "number") ), required = list("category", "confidence") ) # Using .structured = TRUE (equivalent to calling llm_mutate_structured) df |> llm_mutate( structured_result = "{text}", .config = cfg_ds, .structured = TRUE, .schema = schema ) ``` This is equivalent to calling `llm_mutate_structured()` and supports all the same shorthand syntax. ## Soft structured output with tags When a strict JSON schema is unnecessary, request simple XML-like tags and let LLMR parse them into columns. In the ordinary one-row-per-call mode below, tags should be flat (not nested); the row-batching mode further down deliberately introduces one level of nesting and is documented there. ```{r} cities <- tibble::tibble(city = c("Cairo", "Lima", "Seoul")) cities |> llm_mutate( geo = "Where is {city}? Give country and continent in their own tags.", .config = cfg_groq1, .system_prompt = paste( "Use XML tags to specify different parts of the answer, but do not nest tags.", "Return ... and ...." ), .tags = c("country", "continent") ) ``` The result includes `tags_ok`, `tags_data`, and one column per requested tag. Use `llm_parse_tags_col()` to parse an existing response column. ## Row batching: many rows per call By default LLMR sends one request per row. With `.batch_size > 1`, several rows are packed into a single request: each row's prompt is wrapped in a numbered tag (`...`, `...`, ...), the block is appended to the message, and the model is asked to answer each item inside a matching numbered tag. LLMR splits the reply back into the original rows. `.batch_size = Inf` sends the whole frame in one call. ```{r} cities |> llm_mutate( geo = "Where is {city}? Give country and continent in their own tags.", .config = cfg_groq1, .tags = c("country", "continent"), .batch_size = 3 ) ``` A few points worth keeping in mind: - **Two notions of "batch".** This generative row batching is unrelated to `get_batched_embeddings()`, which splits many texts across several *embedding* calls. The `.batch_size` argument applies only to generative calls. - **One level of nesting in tag mode.** Inside each `` block the model emits the requested field tags, so batched tag output is intentionally nested one level. This is the opposite of the flat-tag guidance for single-row calls; LLMR adjusts the instruction automatically. - **Structured output.** `.structured = TRUE` together with `.batch_size > 1` asks for a single JSON object `{"results":[{"row":i, ...}]}` and maps each element back by its integer `row`. It emits a one-time warning, because it relies on the model following the protocol and replaces strict provider-side schema validation with local parsing. - **Fault tolerance.** Rows that the model drops, reorders, duplicates, or truncates are detected and re-issued according to `.batch_recovery` (by default the unresolved rows are retried at half the batch size, recursively, down to single rows). Unrecoverable rows are returned as `NA` with a diagnostic finish reason. - **Cost.** Batching reduces the number of requests and the repeated system-prompt overhead, but it only pays off when the model reliably follows the wrapping protocol. Prefer capable models at `temperature = 0`, and modest batch sizes. - **Diagnostics.** When batching actually groups rows, `llm_mutate()` adds `_batch`, `_bn`, and `_bi` columns identifying the batch, its size, and the row's position within it. Token counts and wall-clock duration are attributed once per batch (on its first resolved row) so that summing those columns is correct. One caveat: when a batch reply is entirely unusable and its rows succeed only through recovery calls, the failed call's spend has no successful row to land on, so sums can slightly undercount in heavy-recovery runs. ## Preview before you spend, summarize after `llm_preview()` renders exactly what `llm_fn()` / `llm_mutate()` would send, without any API call and without reading or encoding files. It flags problems up front: missing files, a `"file"` role combined with `.batch_size > 1`, an embedding config with row batching, and so on. The batch plan columns show how rows would be grouped into calls. ```{r} df <- data.frame(text = c("good", "bad", "fine"), stringsAsFactors = FALSE) LLMR::llm_preview(df, prompt = "Sentiment of: {text}", .batch_size = 2) ``` After a run, `llm_usage()` summarizes outcomes and token totals, and `llm_failures()` lists the rows that failed or were truncated. Both read the diagnostic columns that `llm_mutate()` and `call_llm_par()` already produce. `llm_usage()` reports tokens, not dollars: multiply by your provider's current per-token prices yourself. ```{r eval=FALSE} out <- df |> llm_mutate(sentiment = "One-word sentiment for: {text}", .config = cfg_groq) llm_usage(out) # counts + sent/received/total/reasoning tokens llm_failures(out) # which rows failed or were truncated, and why ``` For a `call_llm_par()` result you can re-run only the failures with `llm_par_resume()`. ## llm_mutate_structured: structured with shorthand (Groq) ```{r} schema2 <- list( type = "object", properties = list( category = list(type = "string"), rationale = list(type = "string") ), required = list("category", "rationale"), additionalProperties = FALSE ) # Traditional call df_s <- df |> llm_mutate_structured( annot, prompt = "Extract category and a one-sentence rationale for: {text}", .config = cfg_groq, .schema = schema2 # Because a schema is present, fields auto-hoist; you can also pass: # .fields = c("category", "rationale") ) df_s # Or use shorthand df |> llm_mutate_structured( annot = "Extract category and rationale for: {text}", .config = cfg_groq, .schema = schema2 ) ```