Schema-validated output in LLMR

knitr::opts_chunk$set(
  collapse = TRUE, comment = "#>",
  eval = identical(tolower(Sys.getenv("LLMR_RUN_VIGNETTES", "false")), "true") )

Overview

JSON mode asks the model for a JSON object; it is easy to request but gives weak guarantees about shape. Schema output supplies a JSON Schema and requests strict validation, which is more reliable when the provider enforces it. Enforcement and request shapes differ across providers, so this vignette relies on defensive parsing and local validation rather than trusting any one provider’s guarantee.

What the major providers actually support

OpenAI-compatible (OpenAI, Groq, Together, x.ai, DeepSeek, Alibaba/Qwen, Zhipu, Moonshot, Xiaomi)
Chat Completions accept a response_format (e.g., {"type":"json_object"} or a JSON-Schema payload). Enforcement varies by provider but the interface is OpenAI-shaped.
See OpenAI Structured Outputs cookbook, Groq API (OpenAI-compatible), Together: OpenAI compatibility, x.ai: OpenAI API schema, DeepSeek: OpenAI-compatible endpoint
Anthropic (Claude)
No global “JSON mode.” Instead, you define a tool with an input_schema (JSON Schema) and force it via tool_choice, so the model must return a JSON object that validates the schema.
See Anthropic Messages API: tools & input_schema
Google Gemini (REST)
Set responseMimeType = "application/json" in generationConfig to request JSON. Gemini 2.5+ models also accept responseJsonSchema (standard JSON Schema), which enable_structured_output() sends by default when you supply a schema; set gemini_enable_response_schema = FALSE in the config for an older model that rejects it.
See Gemini documentation —

Why prefer schema output?

Deterministic downstream code: predictable keys/types enable typed transforms.
Safer integrations: strict mode avoids extra keys, missing fields, or textual preambles.
Faster failure: invalid generations fail early, where retry/backoff is easy to manage.

Why JSON-only still matters

Broadest support across models/providers/proxies.
Low ceremony for exploration, labeling, and quick prototypes.

Quirks you will hit in practice

Models often wrap JSON in code fences or add pre/post text.
Arrays/objects appear where you expected scalars; ints vs doubles vary by provider/sample.
Safety/length caps can truncate output; detect and handle “finish_reason = length/filter.”

LLMR helpers for common parsing failures

llm_parse_structured() strips fences and extracts the largest balanced {...} or [...] before parsing.
llm_parse_structured_col() hoists fields (supports dot/bracket paths and JSON Pointer) and keeps non-scalars as list-columns.
llm_validate_structured_col() validates locally via jsonvalidate (AJV).
enable_structured_output() flips the right provider switch (OpenAI-compat response_format, Anthropic tool + input_schema, Gemini responseMimeType + responseJsonSchema).

Minimal patterns (guarded code)

All chunks use a tiny helper so your document knits even without API keys.

safe <- function(expr) tryCatch(expr, error = function(e) {message("ERROR: ", e$message); NULL})

1) JSON mode, no schema (works across OpenAI-compatible providers)

safe({
  library(LLMR)
  cfg <- llm_config(
    provider = "deepseek",              
    model    = "deepseek-chat",
    temperature = 0
  )

  # Flip JSON mode on (OpenAI-compat shape)
  cfg_json <- enable_structured_output(cfg, schema = NULL)

  res    <- call_llm(cfg_json, 'Give me a JSON object {"ok": true, "n": 3}.')
  parsed <- llm_parse_structured(res)

  cat("Raw text:\n", as.character(res), "\n\n")
  str(parsed)
})

What could still fail? Proxies labeled “OpenAI-compatible” sometimes accept response_format but don’t strictly enforce it; LLMR’s parser recovers from fences or pre/post text.

2) Schema mode that actually works (Groq, open model)

Groq serves openai/gpt-oss-20b with OpenAI-compatible APIs. Their Structured Outputs feature enforces JSON Schema and (notably) expects all properties to be listed under required.

safe({
  library(LLMR); library(dplyr)

  # Schema: make every property required to satisfy Groq's stricter check
  schema <- list(
    type = "object",
    additionalProperties = FALSE,
    properties = list(
      title = list(type = "string"),
      year  = list(type = "integer"),
      tags  = list(type = "array", items = list(type = "string"))
    ),
    required = list("title","year","tags")
  )

  cfg <- llm_config(
    provider = "groq",
    model    = "llama-3.1-8b-instant",
    temperature = 0
  )
  cfg_strict <- enable_structured_output(cfg, schema = schema, strict = TRUE)

  df  <- tibble(x = c("BERT paper", "Vision Transformers"))
  out <- llm_fn_structured(
    df,
    prompt   = "Return JSON about '{x}' with fields title, year, tags.",
    .config  = cfg_strict,
    .schema  = schema,          # send schema to provider
    .fields  = c("title","year","tags"),
    .validate_local = TRUE
  )

  out |>
    select(structured_ok, structured_valid, title, year, tags) |>
    print(n = Inf)
})

If your key is set, you should see structured_ok = TRUE, structured_valid = TRUE, plus parsed columns.

Common gotcha: If Groq returns a 400 error complaining about required, ensure all properties are listed in the required array. Groq’s structured output implementation is stricter than OpenAI’s.

3) DeepSeek: JSON-object mode with local validation

safe({
  library(LLMR)
  schema <- list(
    type="object",
    properties=list(answer=list(type="string"), confidence=list(type="number")),
    required=list("answer","confidence"),
    additionalProperties=FALSE
  )

  cfg <- llm_config("deepseek", "deepseek-chat")
  cfg <- enable_structured_output(cfg, schema = schema, name = "llmr_schema")

  res <- call_llm(cfg, c(
    system = "Return only the JSON object that matches the schema.",
    user   = "Answer: capital of Japan; include confidence in [0,1]."
  ))

  parsed <- llm_parse_structured(res)
  str(parsed)
})

4) Groq: another structured output example

safe({
  library(LLMR)

  cfg <- llm_config(
    "groq", "openai/gpt-oss-20b"
  )

  schema <- list(
    type = "object",
    properties = list(name = list(type = "string"),
                      score = list(type = "number")),
    required = list("name", "score"),
    additionalProperties = FALSE
  )
  cfg_json <- enable_structured_output(cfg, schema = schema)

  res <- call_llm(cfg_json, c(
    system = "Reply as JSON only.",
    user   = "Produce fields name and score about 'MNIST'."
  ))
  str(llm_parse_structured(res))
})

Defensive patterns (no API calls)

safe({
  library(LLMR); library(tibble)

  messy <- c(
    '```json\n{"x": 1, "y": [1,2,3]}\n```',
    'Sure! Here is JSON: {"x":"1","y":"oops"} trailing words',
    '{"x":1, "y":[2,3,4]}'
  )

  tibble(response_text = messy) |>
    llm_parse_structured_col(
      fields = c(x = "x", y = "/y/0")   # dot/bracket or JSON Pointer
    ) |>
    print(n = Inf)
})

Why this helps Works when outputs arrive fenced, with pre/post text, or when arrays sneak in. Non-scalars become list-columns (set allow_list = FALSE to force scalars only).

Parallel execution with schema validation

For production ETL workflows, combine schema validation with parallelization:

library(LLMR); library(dplyr)

cfg_with_schema = llm_config('deepseek','deepseek-chat')
  
setup_llm_parallel(workers = 10)

### Assuming there is a large data frame large_df

large_df |>
  llm_mutate_structured(
    result,
    prompt = "Extract: {text}",
    .config = cfg_with_schema,
    .schema = schema,
    .fields = c("label", "score"),
    tries = 3  # auto-retry failures
  )

reset_llm_parallel()

This processes thousands of rows efficiently with automatic retries and validation.

Choosing the mode

Reporting / ETL / metrics: Schema mode; fail fast and retry.
Exploration / ad-hoc: JSON mode + recovery parser.
Lightweight field extraction: Tag mode with .tags; useful when strict schema support is unavailable or unnecessarily heavy.
Cross-provider code: Always wrap provider toggles with enable_structured_output() and run llm_parse_structured() + local validation.

References

OpenAI: Structured Outputs cookbook: https://cookbook.openai.com/examples/structured_outputs_intro
Groq: Structured Outputs: https://console.groq.com/docs/structured-outputs
Together: Structured Output: https://docs.together.ai/docs/json-mode
x.ai: Structured Output: https://docs.x.ai/docs/guides/structured-outputs
DeepSeek: JSON Mode: https://api-docs.deepseek.com/guides/json_mode
Anthropic: Messages API, tools & input_schema: https://platform.claude.com/docs/en/api/messages#body-tool-choice
Google Gemini: Structured Output: https://ai.google.dev/gemini-api/docs/structured-output