knitr::opts_chunk$set(
collapse = TRUE, comment = "#>",
eval = identical(tolower(Sys.getenv("LLMR_RUN_VIGNETTES", "false")), "true")
)Three capabilities make a model call more than a question and an
answer: tools let the model consult your R session
while it reasons; streaming shows the reply as it is
generated; logprobs report how confident the model was
in each token it chose. This vignette covers all three. (For data-frame
pipelines see vignette("tidy-and-structured"); for enforced
JSON shapes, vignette("about-schema"); for logging,
replication, caching, and batch jobs,
vignette("reproducibility-and-cost").)
One caveat up front: provider support varies. Tool calling works on the OpenAI-compatible providers and Anthropic; streaming works across the major providers; logprobs are the patchiest – OpenAI and DeepSeek expose them, Anthropic does not, and several hosts reject the flag model by model. The chunks below use models that were tested when this vignette was written.
A tool is an R function the model may call. The division of labor matters: the model proposes a call with arguments; LLMR executes the registered function and feeds the result back; the model continues with real data it had no way of knowing. The classic use in research code is grounding: a classifier or assistant that must quote your data rather than guess.
llm_tool() wraps a function with the JSON-Schema
description the model sees. Keep tools small, deterministic, and free of
side effects – the model decides when and how often to call them.
survey <- data.frame(
group = rep(c("treatment", "control"), each = 4),
support = c(6, 7, 5, 7, 4, 3, 5, 4)
)
group_stats <- llm_tool(
function(group) {
rows <- survey[survey$group == group, ]
if (!nrow(rows)) return(paste0("No group called ", group))
sprintf("n = %d, mean support = %.2f", nrow(rows), mean(rows$support))
},
name = "group_stats",
description = "Sample size and mean support (1-7 scale) for one experimental group.",
parameters = list(group = list(type = "string",
description = "Group name: treatment or control"))
)call_llm_tools() runs the whole loop: it sends the tool
definitions, executes whatever the model calls, returns the results to
the model, and repeats until the model answers in plain text.
cfg <- llm_config("groq", "openai/gpt-oss-20b", temperature = 0)
r <- call_llm_tools(
cfg,
"Which group reports higher support, and by how much? Use the tool.",
tools = group_stats
)
rThe answer quotes numbers the model could only have obtained by calling the function. Every execution is on the record:
Two accounting details deserve attention. First, a tool loop is
several model calls, so tokens(r) – which
describes the final call only – would undercount it. The aggregate is
attached to the response:
Second, a loop can run away: a confused model may call tools again
and again. max_tool_calls caps total executions; exceeding
it raises a typed condition (llmr_tool_limit) instead of
spending further. Together with max_rounds this bounds the
worst case. Note that finish_reason(x) equal to
"tool" marks an intermediate state – a response
asking for tools, not a final answer; call_llm_tools()
handles those for you, and you only meet them if you drive the loop
yourself with tool_calls().
A privacy note: tool_history (and the audit log, if
enabled) records tool arguments and results verbatim. If tools touch
sensitive data, treat those records with the same care as the data.
call_llm_stream() is call_llm() over a
different transport: the same request shaping (messages, parameters,
hooks), but the reply arrives in chunks. For long generations this keeps
sessions responsive and avoids HTTP timeouts. By default each chunk is
printed as it arrives:
A custom callback receives each chunk; keep it fast,
since it runs inside the receive loop. Collecting chunks is one
line:
When a provider exposes log-probabilities, each generated token comes
with the probability the model assigned to it, and optionally the
top_logprobs most likely alternatives at that position. For
measurement work this turns a classification into a graded judgment: the
probability of the answer token is a soft label you can threshold,
calibrate, or carry into a downstream model.
Request them at config time; extract them tidily with
llm_logprobs(). The demo uses deepseek-chat,
which supports them.
cfg_lp <- llm_config("deepseek", "deepseek-chat", temperature = 0,
logprobs = TRUE, top_logprobs = 5, max_tokens = 4)
r <- call_llm(cfg_lp, c(
system = "Classify the sentiment of the review. Reply with exactly one word: positive or negative.",
user = "The plot was predictable, but I cried at the end."))
lp <- llm_logprobs(r)
data.frame(token = lp$token, p = exp(lp$logprob))The alternatives at the first position show what probability mass, if any, went to the competing label:
Two cautions keep this honest. Logprobs are token-level, not
semantic: the figure is the probability of that token in that position,
which tracks “confidence in the label” only when the prompt constrains
the answer to a single label token – hence the one-word instruction
above. For multi-token labels, multiply the per-token probabilities (or
redesign the labels). And a high probability means the model was sure,
not that it was right; on well-posed items the interesting observations
are precisely the low-p cases, which are natural candidates
for human review.
These three features compose with everything else: a tool loop
streams its final answer no differently, a logprobs request travels
through call_llm_par() like any config, and all of it lands
in the audit log when llm_log_enable() is on. For that log,
replication helpers, cost accounting, prompt caching, and the half-price
batch APIs, continue with
vignette("reproducibility-and-cost").