org.llm4s.chunking.SemanticChunker
See theSemanticChunker companion object
class SemanticChunker(embeddingClient: EmbeddingClient, modelConfig: EmbeddingModelConfig, similarityThreshold: Double, batchSize: Int) extends DocumentChunker
Semantic document chunker using embeddings.
Splits text at topic boundaries by:
- Breaking text into sentences
- Computing embeddings for each sentence
- Calculating cosine similarity between consecutive sentences
- Splitting where similarity drops below threshold
This produces the highest quality chunks because it understands semantic meaning, but requires an embedding provider.
Usage:
val (provider, embeddingProviderCfg) = /* load embedding provider config */
val embeddingClient = EmbeddingClient.from(provider, embeddingProviderCfg).getOrElse(???)
val modelConfig = EmbeddingModelConfig("text-embedding-3-small", 1536)
val chunker = SemanticChunker(embeddingClient, modelConfig, similarityThreshold = 0.5)
val chunks = chunker.chunk(documentText, ChunkingConfig(targetSize = 800))
chunks.foreach { c =>
println(s"[$${c.index}] $${c.content.take(50)}...")
}
Value parameters
- batchSize
-
Number of sentences to embed at once
- embeddingClient
-
Client for generating embeddings
- modelConfig
-
Model configuration for embeddings
- similarityThreshold
-
Minimum similarity to stay in same chunk (0.0-1.0)
Attributes
- Companion
- object
- Graph
-
- Supertypes
Members list
In this article