--- title: "Interactive calls: tools, streaming, and logprobs" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Interactive calls: tools, streaming, and logprobs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = identical(tolower(Sys.getenv("LLMR_RUN_VIGNETTES", "false")), "true") ) ``` Three capabilities make a model call more than a question and an answer: **tools** let the model consult your R session while it reasons; **streaming** shows the reply as it is generated; **logprobs** report how confident the model was in each token it chose. This vignette covers all three. (For data-frame pipelines see `vignette("tidy-and-structured")`; for enforced JSON shapes, `vignette("about-schema")`; for logging, replication, caching, and batch jobs, `vignette("reproducibility-and-cost")`.) One caveat up front: provider support varies. Tool calling works on the OpenAI-compatible providers and Anthropic; streaming works across the major providers; logprobs are the patchiest -- OpenAI and DeepSeek expose them, Anthropic does not, and several hosts reject the flag model by model. The chunks below use models that were tested when this vignette was written. ```{r setup} library(LLMR) ``` ## Tools: the model consults your R session A tool is an R function the model may call. The division of labor matters: the model *proposes* a call with arguments; LLMR executes the registered function and feeds the result back; the model continues with real data it had no way of knowing. The classic use in research code is grounding: a classifier or assistant that must quote your data rather than guess. `llm_tool()` wraps a function with the JSON-Schema description the model sees. Keep tools small, deterministic, and free of side effects -- the model decides when and how often to call them. ```{r tool-def} survey <- data.frame( group = rep(c("treatment", "control"), each = 4), support = c(6, 7, 5, 7, 4, 3, 5, 4) ) group_stats <- llm_tool( function(group) { rows <- survey[survey$group == group, ] if (!nrow(rows)) return(paste0("No group called ", group)) sprintf("n = %d, mean support = %.2f", nrow(rows), mean(rows$support)) }, name = "group_stats", description = "Sample size and mean support (1-7 scale) for one experimental group.", parameters = list(group = list(type = "string", description = "Group name: treatment or control")) ) ``` `call_llm_tools()` runs the whole loop: it sends the tool definitions, executes whatever the model calls, returns the results to the model, and repeats until the model answers in plain text. ```{r tool-loop} cfg <- llm_config("groq", "openai/gpt-oss-20b", temperature = 0) r <- call_llm_tools( cfg, "Which group reports higher support, and by how much? Use the tool.", tools = group_stats ) r ``` The answer quotes numbers the model could only have obtained by calling the function. Every execution is on the record: ```{r tool-history} attr(r, "tool_history") ``` Two accounting details deserve attention. First, a tool loop is *several* model calls, so `tokens(r)` -- which describes the final call only -- would undercount it. The aggregate is attached to the response: ```{r tool-loop-spend} attr(r, "tool_loop") ``` Second, a loop can run away: a confused model may call tools again and again. `max_tool_calls` caps total executions; exceeding it raises a typed condition (`llmr_tool_limit`) instead of spending further. Together with `max_rounds` this bounds the worst case. Note that `finish_reason(x)` equal to `"tool"` marks an *intermediate* state -- a response asking for tools, not a final answer; `call_llm_tools()` handles those for you, and you only meet them if you drive the loop yourself with `tool_calls()`. A privacy note: `tool_history` (and the audit log, if enabled) records tool arguments and results verbatim. If tools touch sensitive data, treat those records with the same care as the data. ## Streaming: watch the reply arrive `call_llm_stream()` is `call_llm()` over a different transport: the same request shaping (messages, parameters, hooks), but the reply arrives in chunks. For long generations this keeps sessions responsive and avoids HTTP timeouts. By default each chunk is printed as it arrives: ```{r stream-basic} r <- call_llm_stream(cfg, "In two sentences: why do surveys weight responses?") tokens(r) ``` A custom `callback` receives each chunk; keep it fast, since it runs inside the receive loop. Collecting chunks is one line: ```{r stream-callback} seen <- character(0) r <- call_llm_stream(cfg, "Count from one to five, words only.", callback = function(chunk) seen <<- c(seen, chunk)) length(seen) # the reply arrived in this many pieces as.character(r) # and assembled into the usual llmr_response ``` ## Logprobs: the model's confidence as data When a provider exposes log-probabilities, each generated token comes with the probability the model assigned to it, and optionally the `top_logprobs` most likely alternatives at that position. For measurement work this turns a classification into a graded judgment: the probability of the answer token is a soft label you can threshold, calibrate, or carry into a downstream model. Request them at config time; extract them tidily with `llm_logprobs()`. The demo uses `deepseek-chat`, which supports them. ```{r logprobs} cfg_lp <- llm_config("deepseek", "deepseek-chat", temperature = 0, logprobs = TRUE, top_logprobs = 5, max_tokens = 4) r <- call_llm(cfg_lp, c( system = "Classify the sentiment of the review. Reply with exactly one word: positive or negative.", user = "The plot was predictable, but I cried at the end.")) lp <- llm_logprobs(r) data.frame(token = lp$token, p = exp(lp$logprob)) ``` The alternatives at the first position show what probability mass, if any, went to the competing label: ```{r logprobs-alts} alts <- lp$top_logprobs[[1]] transform(alts, p = exp(logprob))[, c("token", "p")] ``` Two cautions keep this honest. Logprobs are *token-level*, not semantic: the figure is the probability of that token in that position, which tracks "confidence in the label" only when the prompt constrains the answer to a single label token -- hence the one-word instruction above. For multi-token labels, multiply the per-token probabilities (or redesign the labels). And a high probability means the model was sure, not that it was right; on well-posed items the interesting observations are precisely the low-`p` cases, which are natural candidates for human review. ## Where the other machinery lives These three features compose with everything else: a tool loop streams its final answer no differently, a logprobs request travels through `call_llm_par()` like any config, and all of it lands in the audit log when `llm_log_enable()` is on. For that log, replication helpers, cost accounting, prompt caching, and the half-price batch APIs, continue with `vignette("reproducibility-and-cost")`.