org.llm4s.chunking
Members list
Type members
Classlikes
Metadata about a chunk's origin and structure.
Metadata about a chunk's origin and structure.
Value parameters
- endOffset
-
End character offset
- headings
-
Heading hierarchy (e.g., Seq("Chapter 1", "Section 2"))
- isCodeBlock
-
Whether chunk is from a code block
- language
-
Code language if applicable
- sourceFile
-
Original file name
- startOffset
-
Character offset in source
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
ChunkMetadata.type
Factory for creating document chunkers.
Factory for creating document chunkers.
Provides convenient factory methods for creating different chunking strategies. Each strategy has different trade-offs between quality and performance.
Usage:
// Simple character-based chunking (fastest)
val simple = ChunkerFactory.simple()
// Sentence-aware chunking (recommended for most use cases)
val sentence = ChunkerFactory.sentence()
// Markdown-aware chunking (preserves structure)
val markdown = ChunkerFactory.markdown()
// Semantic chunking (highest quality, requires embeddings)
val modelConfig = EmbeddingModelConfig("text-embedding-3-small", 1536)
val semantic = ChunkerFactory.semantic(embeddingClient, modelConfig)
// Auto-detect based on content
val auto = ChunkerFactory.auto(text)
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
ChunkerFactory.type
Configuration for document chunking.
Configuration for document chunking.
Value parameters
- maxSize
-
Maximum chunk size (hard limit, will force split)
- minChunkSize
-
Minimum size for a chunk (smaller chunks are merged)
- overlap
-
Characters to overlap between chunks
- preserveCodeBlocks
-
Keep code blocks intact if possible
- preserveHeadings
-
Include heading context in metadata
- targetSize
-
Target chunk size in characters (soft limit)
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
ChunkingConfig.type
A chunk of a document with metadata.
A chunk of a document with metadata.
Value parameters
- content
-
The chunk text
- index
-
Position in original document (0-indexed)
- metadata
-
Preserved structure information
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Document chunking strategy.
Document chunking strategy.
Implementations split text into manageable chunks for embedding and retrieval. Different strategies optimize for different content types:
- SimpleChunker: Basic character-based splitting
- SentenceChunker: Respects sentence boundaries
- MarkdownChunker: Preserves markdown structure
- SemanticChunker: Splits at topic boundaries using embeddings
Usage:
val chunker = ChunkerFactory.sentence()
val config = ChunkingConfig(targetSize = 800, overlap = 150)
val chunks = chunker.chunk(documentText, config)
chunks.foreach { chunk =>
println(s"Chunk $${chunk.index}: $${chunk.content.take(50)}...")
}
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Known subtypes
Markdown-aware document chunker.
Markdown-aware document chunker.
Preserves markdown structure by:
- Respecting heading boundaries (# through ######)
- Keeping code blocks intact when possible
- Tracking heading hierarchy in chunk metadata
- Preserving list structure
This chunker produces higher quality chunks for markdown content because it understands document structure.
Usage:
val chunker = MarkdownChunker()
val chunks = chunker.chunk(markdownText, ChunkingConfig(targetSize = 800))
chunks.foreach { c =>
val headingPath = c.metadata.headings.mkString(" > ")
println(s"[$$headingPath] $${c.content.take(50)}...")
}
Attributes
- Companion
- object
- Supertypes
Attributes
- Companion
- class
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
MarkdownChunker.type
Semantic document chunker using embeddings.
Semantic document chunker using embeddings.
Splits text at topic boundaries by:
- Breaking text into sentences
- Computing embeddings for each sentence
- Calculating cosine similarity between consecutive sentences
- Splitting where similarity drops below threshold
This produces the highest quality chunks because it understands semantic meaning, but requires an embedding provider.
Usage:
val (provider, embeddingProviderCfg) = /* load embedding provider config */
val embeddingClient = EmbeddingClient.from(provider, embeddingProviderCfg).getOrElse(???)
val modelConfig = EmbeddingModelConfig("text-embedding-3-small", 1536)
val chunker = SemanticChunker(embeddingClient, modelConfig, similarityThreshold = 0.5)
val chunks = chunker.chunk(documentText, ChunkingConfig(targetSize = 800))
chunks.foreach { c =>
println(s"[$${c.index}] $${c.content.take(50)}...")
}
Value parameters
- batchSize
-
Number of sentences to embed at once
- embeddingClient
-
Client for generating embeddings
- modelConfig
-
Model configuration for embeddings
- similarityThreshold
-
Minimum similarity to stay in same chunk (0.0-1.0)
Attributes
- Companion
- object
- Supertypes
Attributes
- Companion
- class
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
SemanticChunker.type
Sentence-aware document chunker.
Sentence-aware document chunker.
Splits text at sentence boundaries to preserve semantic coherence. Uses pattern matching for sentence detection (periods, question marks, etc.) while handling edge cases like abbreviations and decimal numbers.
This chunker produces higher quality chunks than simple character-based splitting because it never breaks in the middle of a sentence.
Usage:
val chunker = SentenceChunker()
val chunks = chunker.chunk(text, ChunkingConfig(targetSize = 800))
// Sentences are kept intact
chunks.foreach { c =>
println(s"[$${c.index}] $${c.content}")
}
Attributes
- Companion
- object
- Supertypes
Attributes
- Companion
- class
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
SentenceChunker.type
Simple character-based chunker wrapping legacy ChunkingUtils.
Simple character-based chunker wrapping legacy ChunkingUtils.
Provides compatibility with existing code while conforming to the new DocumentChunker interface. Splits text into fixed-size chunks without semantic awareness.
Use this chunker when:
- You need maximum compatibility with existing code
- Content has no clear sentence structure
- Performance is more important than quality
For better quality chunks, consider:
- SentenceChunker: Respects sentence boundaries
- MarkdownChunker: Preserves markdown structure
- SemanticChunker: Uses embedding similarity
Usage:
val chunker = SimpleChunker()
val chunks = chunker.chunk(text, ChunkingConfig(targetSize = 800, overlap = 150))
Attributes
- Companion
- object
- Supertypes
Attributes
- Companion
- class
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
SimpleChunker.type