--- title: "Reproducibility and cost: logging, replication, caching, batch jobs" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Reproducibility and cost: logging, replication, caching, batch jobs} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = identical(tolower(Sys.getenv("LLMR_RUN_VIGNETTES", "false")), "true") ) ``` A language model is an instrument, and instruments demand calibration records. Reviewers now ask, reasonably: which model, at which settings, on which date, with what prompt, how stable are the labels, and what did it cost? This article shows how LLMR answers each question with one or two lines of code. Examples run on the open-weight `gpt-oss-20b` served by Groq, so they are inexpensive to reproduce; set `GROQ_API_KEY` and `LLMR_RUN_VIGNETTES=true` to knit them live. ```{r setup} library(LLMR) cfg <- llm_config("groq", "openai/gpt-oss-20b", temperature = 0, seed = 110) ``` ## 1. An audit trail for every call `llm_log_enable()` starts a session-wide log: every call made through LLMR, whether by `call_llm()`, `llm_mutate()`, a parallel run, or a chat session, appends one JSON record with the full request, the reply, token usage, the served model version, the request id, and timing. The file is JSONL, so it can be archived as supplementary material and read back as a data frame. ```{r audit} log_path <- tempfile(fileext = ".jsonl") llm_log_enable(log_path) r <- call_llm(cfg, "In one word, the capital of Senegal?") llm_log_disable() jsonlite::stream_in(file(log_path), verbose = FALSE)[ , c("provider", "model", "model_version", "finish_reason", "status")] ``` Two details matter for sensitive projects. The log never contains API keys. And `llm_log_enable(path, include_messages = FALSE)` keeps only metadata, parameters, and usage, for prompts that must not leave the analysis machine in clear text. ## 2. Replication and reliability A single model run is one draw from a stochastic process; treat repeated runs the way you treat multiple human coders. `llm_replicate()` collects the draws and `llm_agreement()` reports per-item majority labels along with the statistic reviewers ask for, Krippendorff's alpha. ```{r replicate} reviews <- tibble::tibble(text = c( "The course changed how I think.", "Lectures were fine, assignments tedious.", "A complete waste of an afternoon." )) cfg_warm <- llm_config("groq", "openai/gpt-oss-20b", temperature = 1) reps <- llm_replicate( reviews, sentiment, prompt = "Sentiment of '{text}'. Answer with exactly one word: positive, negative, or neutral.", .config = cfg_warm, .times = 5 ) ag <- llm_agreement(reps, prefix = "sentiment") ag ag$by_row ``` Low alpha is itself a finding: the instrument disagrees with itself on this construct, and downstream estimates should carry that uncertainty rather than hide it. ## 3. A methods paragraph you can edit `llm_methods_text()` drafts the transparency paragraph from what a result frame actually records, marking anything unknown as unknown. ```{r methods} res <- call_llm_par( build_factorial_experiments( configs = cfg, user_prompts = c("Classify: 'great work'", "Classify: 'do better'") ) ) cat(llm_methods_text(res, task = "to classify short feedback messages")) ``` ## 4. What did it cost? `llm_usage()` reports tokens, and tokens only, because bundled price tables go stale and then mislead silently. If you want money, hand it your own current prices (per million tokens) and it applies them, billing cached prompt tokens at the cached rate when your table has one: ```{r usage} llm_usage(res) my_prices <- data.frame( model = "openai/gpt-oss-20b", input = 0.10, # $ per million input tokens; check your provider's page output = 0.50, cached = 0.05 ) llm_usage(res, price_table = my_prices)$cost_estimate ``` ## 5. Prompt caching Annotation runs repeat a long instruction prefix thousands of times, which is exactly the shape prompt caches reward. OpenAI, DeepSeek, and Gemini cache long prefixes automatically; for Anthropic, add `cache = TRUE` to the config and LLMR marks the system prompt and tool definitions as cacheable. Either way, cached tokens show up in `tokens(x)$cached` and in the `cached_tokens` column that `llm_usage()` sums, so the saving is visible rather than assumed. ## 6. Batch jobs at half price When results can wait (minutes to 24 hours), the provider batch APIs price tokens at roughly half the live rate. LLMR wraps them in three verbs, and the job object survives the R session: ```{r batch, eval = FALSE} job <- llm_batch_submit( cfg, c("Classify: 'superb'", "Classify: 'awful'", "Classify: 'fine, I guess'"), state_path = "sentiment_batch.rds" ) llm_batch_status(job) # hours later, in a fresh session: res_batch <- llm_batch_fetch("sentiment_batch.rds") llm_usage(res_batch) ``` The fetched tibble has the same diagnostic columns as a live `call_llm_par()` run, so `llm_parse_structured_col()`, `llm_usage()`, and `llm_failures()` work on it unchanged. Batch jobs are supported for OpenAI, Groq, Anthropic, and Gemini. ## 7. Seeds, versions, and honest limits Pass `seed` in `llm_config()` and LLMR forwards it where the provider supports one. Treat it as a stabilizer, not a guarantee: providers update serving stacks behind fixed model names. That is why every response records `model_version`, the identifier the server reports having served, and why the audit log stores it per call. Determinism you cannot have; attribution you can.