org.llm4s.chunking

Metadata about a chunk's origin and structure.

Value parameters

endOffset: End character offset
headings: Heading hierarchy (e.g., Seq("Chapter 1", "Section 2"))
isCodeBlock: Whether chunk is from a code block
language: Code language if applicable
sourceFile: Original file name
startOffset: Character offset in source

Attributes

Companion: object
Supertypes: trait Serializable

trait Product

trait Equals

class Object

trait Matchable

class Any
Show all

Attributes

Companion: class
Supertypes: trait Product

trait Mirror

class Object

trait Matchable

class Any
Self type: ChunkMetadata.type

Factory for creating document chunkers.

Provides convenient factory methods for creating different chunking strategies. Each strategy has different trade-offs between quality and performance.

Usage:

// Simple character-based chunking (fastest)
val simple = ChunkerFactory.simple()

// Sentence-aware chunking (recommended for most use cases)
val sentence = ChunkerFactory.sentence()

// Markdown-aware chunking (preserves structure)
val markdown = ChunkerFactory.markdown()

// Semantic chunking (highest quality, requires embeddings)
val modelConfig = EmbeddingModelConfig("text-embedding-3-small", 1536)
val semantic = ChunkerFactory.semantic(embeddingClient, modelConfig)

// Auto-detect based on content
val auto = ChunkerFactory.auto(text)

Attributes

Supertypes: class Object

trait Matchable

class Any
Self type: ChunkerFactory.type

Configuration for document chunking.

Value parameters

maxSize: Maximum chunk size (hard limit, will force split)
minChunkSize: Minimum size for a chunk (smaller chunks are merged)
overlap: Characters to overlap between chunks
preserveCodeBlocks: Keep code blocks intact if possible
preserveHeadings: Include heading context in metadata
targetSize: Target chunk size in characters (soft limit)

Attributes

Companion: object
Supertypes: trait Serializable

trait Product

trait Equals

class Object

trait Matchable

class Any
Show all

Attributes

Companion: class
Supertypes: trait Product

trait Mirror

class Object

trait Matchable

class Any
Self type: ChunkingConfig.type

A chunk of a document with metadata.

Value parameters

content: The chunk text
index: Position in original document (0-indexed)
metadata: Preserved structure information

Attributes

Supertypes: trait Serializable

trait Product

trait Equals

class Object

trait Matchable

class Any
Show all

Document chunking strategy.

Implementations split text into manageable chunks for embedding and retrieval. Different strategies optimize for different content types:

SimpleChunker: Basic character-based splitting
SentenceChunker: Respects sentence boundaries
MarkdownChunker: Preserves markdown structure
SemanticChunker: Splits at topic boundaries using embeddings

Usage:

val chunker = ChunkerFactory.sentence()
val config = ChunkingConfig(targetSize = 800, overlap = 150)
val chunks = chunker.chunk(documentText, config)

chunks.foreach { chunk =>
 println(s"Chunk $${chunk.index}: $${chunk.content.take(50)}...")
}

Attributes

Supertypes: class Object

trait Matchable

class Any
Known subtypes: class MarkdownChunker

class SemanticChunker

class SentenceChunker

class SimpleChunker

Markdown-aware document chunker.

Preserves markdown structure by:

Respecting heading boundaries (# through ######)
Keeping code blocks intact when possible
Tracking heading hierarchy in chunk metadata
Preserving list structure

This chunker produces higher quality chunks for markdown content because it understands document structure.

Usage:

val chunker = MarkdownChunker()
val chunks = chunker.chunk(markdownText, ChunkingConfig(targetSize = 800))

chunks.foreach { c =>
 val headingPath = c.metadata.headings.mkString(" > ")
 println(s"[$$headingPath] $${c.content.take(50)}...")
}

Attributes

Companion: object
Supertypes: trait DocumentChunker

class Object

trait Matchable

class Any

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: MarkdownChunker.type

Semantic document chunker using embeddings.

Splits text at topic boundaries by:

Breaking text into sentences
Computing embeddings for each sentence
Calculating cosine similarity between consecutive sentences
Splitting where similarity drops below threshold

This produces the highest quality chunks because it understands semantic meaning, but requires an embedding provider.

Usage:

val (provider, embeddingProviderCfg) = /* load embedding provider config */
val embeddingClient = EmbeddingClient.from(provider, embeddingProviderCfg).getOrElse(???)
val modelConfig = EmbeddingModelConfig("text-embedding-3-small", 1536)
val chunker = SemanticChunker(embeddingClient, modelConfig, similarityThreshold = 0.5)
val chunks = chunker.chunk(documentText, ChunkingConfig(targetSize = 800))

chunks.foreach { c =>
 println(s"[$${c.index}] $${c.content.take(50)}...")
}

Value parameters

batchSize: Number of sentences to embed at once
embeddingClient: Client for generating embeddings
modelConfig: Model configuration for embeddings
similarityThreshold: Minimum similarity to stay in same chunk (0.0-1.0)

Attributes

Companion: object
Supertypes: trait DocumentChunker

class Object

trait Matchable

class Any

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: SemanticChunker.type

Sentence-aware document chunker.

Splits text at sentence boundaries to preserve semantic coherence. Uses pattern matching for sentence detection (periods, question marks, etc.) while handling edge cases like abbreviations and decimal numbers.

This chunker produces higher quality chunks than simple character-based splitting because it never breaks in the middle of a sentence.

Usage:

val chunker = SentenceChunker()
val chunks = chunker.chunk(text, ChunkingConfig(targetSize = 800))

// Sentences are kept intact
chunks.foreach { c =>
 println(s"[$${c.index}] $${c.content}")
}

Attributes

Companion: object
Supertypes: trait DocumentChunker

class Object

trait Matchable

class Any

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: SentenceChunker.type

Simple character-based chunker wrapping legacy ChunkingUtils.

Provides compatibility with existing code while conforming to the new DocumentChunker interface. Splits text into fixed-size chunks without semantic awareness.

Use this chunker when:

You need maximum compatibility with existing code
Content has no clear sentence structure
Performance is more important than quality

For better quality chunks, consider:

SentenceChunker: Respects sentence boundaries
MarkdownChunker: Preserves markdown structure
SemanticChunker: Uses embedding similarity

Usage:

val chunker = SimpleChunker()
val chunks = chunker.chunk(text, ChunkingConfig(targetSize = 800, overlap = 150))

Attributes

Companion: object
Supertypes: trait DocumentChunker

class Object

trait Matchable

class Any

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: SimpleChunker.type

org.llm4s.chunking

Members list

Type members

Classlikes

Value parameters

Attributes

Attributes

Attributes

Value parameters

Attributes

Attributes

Value parameters

Attributes

Attributes

Attributes

Attributes

Value parameters

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes