CachingLLMClient

org.llm4s.llmconnect.caching.CachingLLMClient
class CachingLLMClient(baseClient: LLMClient, embeddingClient: EmbeddingClient, embeddingModel: EmbeddingModelConfig, config: CacheConfig, tracing: Tracing, clock: Clock) extends LLMClient

Semantic caching wrapper for LLMClient.

Caches LLM completions based on the semantic similarity of the prompt request. Useful for reducing costs and latency for repetitive or similar queries.

== Usage ==

val cachingClient = new CachingLLMClient(
 baseClient = openAIClient,
 embeddingClient = embeddingClient,
 embeddingModel = EmbeddingModelConfig("text-embedding-3-small", 1536),
 config = CacheConfig(
   similarityThreshold = 0.95,
   ttl = 1.hour,
   maxSize = 1000
 ),
 tracing = tracing
)

== Behavior ==

  • Computes embedding for the user/system prompt.
  • Searches cache for entries within similarityThreshold.
  • Validates additional constraints:
  • Entry must be within TTL.
  • Entry CompletionOptions must strictly match the request options.
  • On Hit: Returns cached Completion and updates LRU order. Emits cache_hit trace event.
  • On Miss: Delegating to baseClient, caches the result, and emits cache_miss trace event.

== Limitations ==

  • streamComplete requests bypass the cache entirely.
  • Cache is in-memory and lost on restart.
  • Cache lookup involves a linear scan (O(n)) of all entries to calculate cosine similarity. Performance may degrade with very large maxSize.

Value parameters

baseClient

The underlying LLM client to delegate to on cache miss.

clock

Clock for TTL verification (defaults to UTC).

config

Cache configuration (threshold, TTL, max size).

embeddingClient

Client to generate embeddings for prompts.

embeddingModel

Configuration for the embedding model used.

tracing

Tracing instance for observability.

Attributes

Graph
Supertypes
trait LLMClient
class Object
trait Matchable
class Any

Members list

Value members

Concrete methods

override def close(): Unit

Releases resources and closes connections to the LLM provider.

Releases resources and closes connections to the LLM provider.

Call when the client is no longer needed. After calling close(), the client should not be used. Default implementation is a no-op; override if managing resources like connections or thread pools.

Attributes

Definition Classes
override def complete(conversation: Conversation, options: CompletionOptions): Result[Completion]

Executes a blocking completion request and returns the full response.

Executes a blocking completion request and returns the full response.

Sends the conversation to the LLM and waits for the complete response. Use when you need the entire response at once or when streaming is not required.

Value parameters

conversation

conversation history including system, user, assistant, and tool messages

options

configuration including temperature, max tokens, tools, etc. (default: CompletionOptions())

Attributes

Returns

Right(Completion) with the model's response, or Left(LLMError) on failure

Definition Classes
override def getContextWindow(): Int

Returns the maximum context window size supported by this model in tokens.

Returns the maximum context window size supported by this model in tokens.

The context window is the total tokens (prompt + completion) the model can process in a single request, including all conversation messages and the generated response.

Attributes

Returns

total context window size in tokens (e.g., 4096, 8192, 128000)

Definition Classes
override def getReserveCompletion(): Int

Returns the number of tokens reserved for the model's completion response.

Returns the number of tokens reserved for the model's completion response.

This value is subtracted from the context window when calculating available tokens for prompts. Corresponds to the max_tokens or completion token limit configured for the model.

Attributes

Returns

number of tokens reserved for completion

Definition Classes
override def streamComplete(conversation: Conversation, options: CompletionOptions, onChunk: StreamedChunk => Unit): Result[Completion]

Executes a streaming completion request, invoking a callback for each chunk as it arrives.

Executes a streaming completion request, invoking a callback for each chunk as it arrives.

Streams the response incrementally, calling onChunk for each token/chunk received. Enables real-time display of responses. Returns the final accumulated completion on success.

Value parameters

conversation

conversation history including system, user, assistant, and tool messages

onChunk

callback invoked for each chunk; called synchronously, avoid blocking operations

options

configuration including temperature, max tokens, tools, etc. (default: CompletionOptions())

Attributes

Returns

Right(Completion) with the complete accumulated response, or Left(LLMError) on failure

Definition Classes
override def validate(): Result[Unit]

Validates client configuration and connectivity to the LLM provider.

Validates client configuration and connectivity to the LLM provider.

May perform checks such as verifying API credentials, testing connectivity, and validating configuration. Default implementation returns success; override for provider-specific validation.

Attributes

Returns

Right(()) if validation succeeds, Left(LLMError) with details on failure

Definition Classes

Inherited methods

Calculates available token budget for prompts after accounting for completion reserve and headroom.

Calculates available token budget for prompts after accounting for completion reserve and headroom.

Formula: (contextWindow - reserveCompletion) * (1 - headroom)

Headroom provides a safety margin for tokenization variations and message formatting overhead.

Value parameters

headroom

safety margin as percentage of prompt budget (default: HeadroomPercent.Standard ~10%)

Attributes

Returns

maximum tokens available for prompt content

Inherited from:
LLMClient