SemanticChunker

org.llm4s.chunking.SemanticChunker
See theSemanticChunker companion object
class SemanticChunker(embeddingClient: EmbeddingClient, modelConfig: EmbeddingModelConfig, similarityThreshold: Double, batchSize: Int) extends DocumentChunker

Semantic document chunker using embeddings.

Splits text at topic boundaries by:

  1. Breaking text into sentences
  2. Computing embeddings for each sentence
  3. Calculating cosine similarity between consecutive sentences
  4. Splitting where similarity drops below threshold

This produces the highest quality chunks because it understands semantic meaning, but requires an embedding provider.

Usage:

val (provider, embeddingProviderCfg) = /* load embedding provider config */
val embeddingClient = EmbeddingClient.from(provider, embeddingProviderCfg).getOrElse(???)
val modelConfig = EmbeddingModelConfig("text-embedding-3-small", 1536)
val chunker = SemanticChunker(embeddingClient, modelConfig, similarityThreshold = 0.5)
val chunks = chunker.chunk(documentText, ChunkingConfig(targetSize = 800))

chunks.foreach { c =>
 println(s"[$${c.index}] $${c.content.take(50)}...")
}

Value parameters

batchSize

Number of sentences to embed at once

embeddingClient

Client for generating embeddings

modelConfig

Model configuration for embeddings

similarityThreshold

Minimum similarity to stay in same chunk (0.0-1.0)

Attributes

Companion
object
Graph
Supertypes
class Object
trait Matchable
class Any

Members list

Value members

Concrete methods

override def chunk(text: String, config: ChunkingConfig): Seq[DocumentChunk]

Split text into chunks.

Split text into chunks.

Value parameters

config

Chunking configuration

text

Input text to chunk

Attributes

Returns

Sequence of document chunks

Definition Classes

Inherited methods

def chunkWithSource(text: String, sourceFile: String, config: ChunkingConfig): Seq[DocumentChunk]

Split text into chunks with source file metadata.

Split text into chunks with source file metadata.

Value parameters

config

Chunking configuration

sourceFile

Source file name for metadata

text

Input text to chunk

Attributes

Returns

Sequence of document chunks with source metadata

Inherited from:
DocumentChunker