Architecture chapter 10

providers/transports

Every LLM call goes through a ProviderHandle. The runtime stays wire-agnostic — request shapes are LlmRequest, response shapes are LlmResponse — and each provider crate owns the schema normalization, auth flow, transport, model policy, and cache-marker placement specific to its vendor.

Provider Boundary

The runtime hands every turn's LlmRequest to whichever ProviderHandle the session was opened with and expects back a normalized LlmResponse stream. Anything wire-specific stays inside the provider crate.

  • Schema normalization (Anthropic message blocks → unified content shape, OpenAI tool-call ids → canonical record).
  • Reasoning-detail replay (OpenAI Responses reasoning items, Anthropic thinking blocks).
  • Cache-marker placement (Anthropic cache_control, OpenAI prompt_cache_key).
  • Token-field interpretation (cached-read vs cache-write deltas, reasoning tokens, audio tokens).
  • Auth flow (API key, OAuth PKCE, device code).
  • Model policy (variant aliases, structured-output capability detection, thinking exposure).

Modes (standard, rlm) and canonical tool definitions never see provider-specific JSON. They operate on LlmRequest, LlmResponse, and ToolDefinition only.

First-Party Providers

Four provider crates ship with the workspace, registering five provider kinds — lash-provider-openai contributes both the direct Responses path and the generic Chat Completions path. The CLI registers all of them at boot via lash-providers-builtin; app hosts pick the ones they need.

ProviderKindTransport / Auth
OpenAI API (direct)openaiBearer API-key auth against https://api.openai.com/v1/responses. Responses-only path; owns Responses reasoning replay and prompt-cache fields.
OpenAI-compatibleopenai-compatibleBearer API-key auth against a caller-supplied base_url; posts Chat Completions to {base_url}/chat/completions. Used for OpenRouter, Together, vLLM, etc.
Codex subscriptioncodexOAuth device-code flow against ChatGPT Codex Responses backend with Codex-specific headers.
AnthropicanthropicAPI-key auth against /v1/messages with Anthropic version header and beta flags.
Google Gemini / Code Assistgoogle_oauthGoogle OAuth PKCE / manual-code flow against Code Assist generateContent / streamGenerateContent.

lash_providers_builtin::register_all() is the one-call aggregator the CLI and app hosts use to register all five factories with the global provider registry at process start.

OpenAI Provider Split

OpenAI ships as two distinct provider kinds because the Responses API and the Chat Completions API are different enough to deserve different code paths.

openai

Direct OpenAI Responses. Posts to https://api.openai.com/v1/responses, keeps Responses reasoning replay, and maps shared ProviderOptions.cache_retention to prompt_cache_key derived from the Lash session id. Long retention adds prompt_cache_retention where the API supports it. No base_url accepted.

openai-compatible

Generic Chat Completions. Requires base_url. Converts LlmRequest to a messages array, emits Chat Completions tools, maps structured output to response_format, and preserves OpenRouter reasoning effort through the reasoning.effort request field. Used for OpenRouter, vLLM, Together, Groq, etc.

Claude Cache Markers

For Anthropic and OpenRouter Claude on Chat Completions, shared ProviderOptions.cache_retention controls Anthropic-style cache_control markers in the request:

none

No markers emitted. Each request is treated as fresh.

short

Emits {"type":"ephemeral"} at the canonical breakpoints. Default 5-minute cache lifetime per Anthropic semantics.

long

Adds "ttl":"1h" to the ephemeral marker for longer-lived caching.

Breakpoints are placed at:

  1. The first system/developer text message.
  2. The last tool definition in the request.
  3. Any explicit LlmContentBlock::Text.cache_breakpoint the runtime asks for.

When no explicit breakpoint is set, the provider falls back to the last user/assistant text content so prompt caching still works for sessions without explicit cache instrumentation.

RLM Prompt Caching

RLM projects chronological history as append-only chat-shaped messages: user inputs remain user messages, prior RLM steps become assistant messages, tool observations become user messages, and the mutable current-iteration/finalization prompt stays as the final user message. The rolling cache_breakpoint is placed on the last stable history text block, so OpenRouter Claude caches a real prefix instead of a rewritten history blob each turn.

The result: long RLM sessions get the same cache-hit rate as native multi-turn chats, even though Lashlang-driven reasoning regenerates a fresh prompt on every iteration.

Usage And Cost Inputs

Every provider returns a normalized LlmUsage to the runtime usage ledger after each completion. Chat-parsing handles both streaming and non-streaming usage chunks, including OpenRouter cache fields. Cache-write tokens are not counted as cached reads, so downstream cost/export code doesn't overstate cache hits.

pub struct LlmUsage {
    pub input_tokens: u64,
    pub output_tokens: u64,
    pub cached_input_tokens: u64,  // Reads from cache
    pub reasoning_tokens: u64,
    // …provider-specific extras flow through the extended trace
}

Adding A Provider

New providers implement five focused components and a factory:

State

Per-handle config: API key, base URL, default model, options like thinking exposure.

Auth

Bearer header, OAuth refresh, device-code flow — whatever the vendor requires. Auth state is opaque to the runtime.

Readiness

Optional pre-flight check (token refresh, capability probe) that runs once per session.

Transport

The actual HTTP call. Translates LlmRequest to the wire format, streams the response, normalizes back to LlmResponse chunks.

Model policy

Maps user-facing model + variant names to provider-native ids, declares structured-output / tool-call / thinking capabilities per model.

Factory

Registers with the global provider registry at process start; ProviderHandle::new(components) assembles the five pieces into a handle the runtime can use.

See lash-provider-openai/src/lib.rs as the most general template — it handles both the direct Responses path and the generic Chat Completions path in one crate.