org.llm4s.llmconnect.caching

Members list

Type members

Classlikes

sealed abstract case class CacheConfig

Configuration for the CachingLLMClient semantic cache.

Configuration for the CachingLLMClient semantic cache.

Uses the sealed-abstract-case-class pattern to prevent direct construction and disable the generated copy method; always construct via CacheConfig.create, which validates all fields and returns a typed error on invalid input.

Value parameters

maxSize

Maximum number of entries in the in-memory cache. When the limit is reached the least-recently-used entry is evicted automatically.

similarityThreshold

Minimum cosine similarity [0.0, 1.0] for a cache hit. A value of 1.0 requires near-identical queries; lower values allow semantically similar but textually different queries to share a cached response.

ttl

Maximum age of a CacheEntry before it is considered expired and the cache is bypassed. Must be positive.

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object CacheConfig

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
case class CacheEntry(embedding: Seq[Double], response: Completion, timestamp: Instant, options: CompletionOptions)

An entry in the CachingLLMClient semantic cache.

An entry in the CachingLLMClient semantic cache.

Stores the embedding vector of the original query alongside the cached org.llm4s.llmconnect.model.Completion, so that later queries can be matched by cosine similarity rather than exact string equality.

Value parameters

embedding

L2-normalised embedding vector of the query that produced this entry. Used for cosine-similarity lookup against new queries. Dimensionality matches the configured embedding model.

options

The org.llm4s.llmconnect.model.CompletionOptions used to produce response. A cache hit requires an exact match on options; mismatched options (e.g. different temperature or tool set) result in a OptionsMismatch miss and bypass the cache.

response

The org.llm4s.llmconnect.model.Completion returned by the LLM for the original query. This is the value returned to the caller on a cache hit.

timestamp

Wall-clock time when this entry was inserted. Compared against CacheConfig.ttl to determine whether the entry has expired.

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Utility object for generating cache keys using secure hashing. Keys should be deterministic (same inputs always produce same key) and collision-resistant (different inputs shouldn't produce same key).

Utility object for generating cache keys using secure hashing. Keys should be deterministic (same inputs always produce same key) and collision-resistant (different inputs shouldn't produce same key).

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type
case class CacheStats(size: Int, hits: Long, misses: Long, totalRequests: Long, hitRatePercent: Double)

Value parameters

hitRatePercent

Percentage of requests served from cache (0.0 to 100.0).

hits

Total number of successful cache lookups.

misses

Total number of lookups that required a new embedding.

size

Current number of entries in the cache.

totalRequests

Combined sum of hits and misses.

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
class CachedEmbeddingClient(baseClient: EmbeddingClient, cache: EmbeddingCache[Seq[Double]], keyGenerator: (String, String) => String)

A caching decorator for EmbeddingClient that avoids redundant provider calls.

A caching decorator for EmbeddingClient that avoids redundant provider calls.

On each embed call, input texts are checked against the cache first. Only texts that are not cached are forwarded to the base client — in a single batched request. Results are then stored in the cache and merged with cached hits before being returned, preserving the original input order.

Errors returned by the base client are never cached, so transient failures are retried on the next call.

'''Note''': EmbeddingClient is a concrete class rather than a trait, so this wrapper cannot be used as a drop-in substitute for EmbeddingClient in APIs that require that type. A follow-up issue should extract an embedding service interface to allow proper decorator substitution.

Value parameters

baseClient

The underlying client used to generate embeddings on cache misses.

cache

The storage backend for the embedding vectors.

keyGenerator

Function that maps (text, modelName) to a cache key (defaults to SHA-256).

Attributes

Supertypes
class Object
trait Matchable
class Any
class CachingLLMClient(baseClient: LLMClient, embeddingClient: EmbeddingClient, embeddingModel: EmbeddingModelConfig, config: CacheConfig, tracing: Tracing, clock: Clock) extends LLMClient

Semantic caching wrapper for LLMClient.

Semantic caching wrapper for LLMClient.

Caches LLM completions based on the semantic similarity of the prompt request. Useful for reducing costs and latency for repetitive or similar queries.

== Usage ==

val cachingClient = new CachingLLMClient(
 baseClient = openAIClient,
 embeddingClient = embeddingClient,
 embeddingModel = EmbeddingModelConfig("text-embedding-3-small", 1536),
 config = CacheConfig(
   similarityThreshold = 0.95,
   ttl = 1.hour,
   maxSize = 1000
 ),
 tracing = tracing
)

== Behavior ==

  • Computes embedding for the user/system prompt.
  • Searches cache for entries within similarityThreshold.
  • Validates additional constraints:
  • Entry must be within TTL.
  • Entry CompletionOptions must strictly match the request options.
  • On Hit: Returns cached Completion and updates LRU order. Emits cache_hit trace event.
  • On Miss: Delegating to baseClient, caches the result, and emits cache_miss trace event.

== Limitations ==

  • streamComplete requests bypass the cache entirely.
  • Cache is in-memory and lost on restart.
  • Cache lookup involves a linear scan (O(n)) of all entries to calculate cosine similarity. Performance may degrade with very large maxSize.

Value parameters

baseClient

The underlying LLM client to delegate to on cache miss.

clock

Clock for TTL verification (defaults to UTC).

config

Cache configuration (threshold, TTL, max size).

embeddingClient

Client to generate embeddings for prompts.

embeddingModel

Configuration for the embedding model used.

tracing

Tracing instance for observability.

Attributes

Supertypes
trait LLMClient
trait AutoCloseable
class Object
trait Matchable
class Any
trait EmbeddingCache[Embedding]

Generic trait for embedding storage backends.

Generic trait for embedding storage backends.

Type parameters

Embedding

The type of the embedding representation (usually Seq[Double]).

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes
class InMemoryEmbeddingCache[Embedding]
class InMemoryEmbeddingCache[Embedding](maxSize: Int, ttl: Option[FiniteDuration], clock: () => Long) extends EmbeddingCache[Embedding]

Thread-safe in-memory implementation of EmbeddingCache with LRU eviction.

Thread-safe in-memory implementation of EmbeddingCache with LRU eviction.

Type parameters

Embedding

The embedding type (usually Seq[Double]).

Value parameters

maxSize

The maximum number of embeddings to store before evicting the oldest.

ttl

Optional Time-To-Live for cache entries. Expired entries are lazily evicted on access.

Attributes

Supertypes
trait EmbeddingCache[Embedding]
class Object
trait Matchable
class Any