org.llm4s.chunking

Members list

Type members

Classlikes

final case class ChunkMetadata(sourceFile: Option[String], startOffset: Option[Int], endOffset: Option[Int], headings: Seq[String], isCodeBlock: Boolean, language: Option[String])

Metadata about a chunk's origin and structure.

Metadata about a chunk's origin and structure.

Value parameters

endOffset

End character offset

headings

Heading hierarchy (e.g., Seq("Chapter 1", "Section 2"))

isCodeBlock

Whether chunk is from a code block

language

Code language if applicable

sourceFile

Original file name

startOffset

Character offset in source

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object ChunkMetadata

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type

Factory for creating document chunkers.

Factory for creating document chunkers.

Provides convenient factory methods for creating different chunking strategies. Each strategy has different trade-offs between quality and performance.

Usage:

// Simple character-based chunking (fastest)
val simple = ChunkerFactory.simple()

// Sentence-aware chunking (recommended for most use cases)
val sentence = ChunkerFactory.sentence()

// Markdown-aware chunking (preserves structure)
val markdown = ChunkerFactory.markdown()

// Semantic chunking (highest quality, requires embeddings)
val modelConfig = EmbeddingModelConfig("text-embedding-3-small", 1536)
val semantic = ChunkerFactory.semantic(embeddingClient, modelConfig)

// Auto-detect based on content
val auto = ChunkerFactory.auto(text)

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type
final case class ChunkingConfig(targetSize: Int, maxSize: Int, overlap: Int, minChunkSize: Int, preserveCodeBlocks: Boolean, preserveHeadings: Boolean)

Configuration for document chunking.

Configuration for document chunking.

Value parameters

maxSize

Maximum chunk size (hard limit, will force split)

minChunkSize

Minimum size for a chunk (smaller chunks are merged)

overlap

Characters to overlap between chunks

preserveCodeBlocks

Keep code blocks intact if possible

preserveHeadings

Include heading context in metadata

targetSize

Target chunk size in characters (soft limit)

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class DocumentChunk(content: String, index: Int, metadata: ChunkMetadata)

A chunk of a document with metadata.

A chunk of a document with metadata.

Value parameters

content

The chunk text

index

Position in original document (0-indexed)

metadata

Preserved structure information

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Document chunking strategy.

Document chunking strategy.

Implementations split text into manageable chunks for embedding and retrieval. Different strategies optimize for different content types:

  • SimpleChunker: Basic character-based splitting
  • SentenceChunker: Respects sentence boundaries
  • MarkdownChunker: Preserves markdown structure
  • SemanticChunker: Splits at topic boundaries using embeddings

Usage:

val chunker = ChunkerFactory.sentence()
val config = ChunkingConfig(targetSize = 800, overlap = 150)
val chunks = chunker.chunk(documentText, config)

chunks.foreach { chunk =>
 println(s"Chunk $${chunk.index}: $${chunk.content.take(50)}...")
}

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes

Markdown-aware document chunker.

Markdown-aware document chunker.

Preserves markdown structure by:

  • Respecting heading boundaries (# through ######)
  • Keeping code blocks intact when possible
  • Tracking heading hierarchy in chunk metadata
  • Preserving list structure

This chunker produces higher quality chunks for markdown content because it understands document structure.

Usage:

val chunker = MarkdownChunker()
val chunks = chunker.chunk(markdownText, ChunkingConfig(targetSize = 800))

chunks.foreach { c =>
 val headingPath = c.metadata.headings.mkString(" > ")
 println(s"[$$headingPath] $${c.content.take(50)}...")
}

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
class SemanticChunker(embeddingClient: EmbeddingClient, modelConfig: EmbeddingModelConfig, similarityThreshold: Double, batchSize: Int) extends DocumentChunker

Semantic document chunker using embeddings.

Semantic document chunker using embeddings.

Splits text at topic boundaries by:

  1. Breaking text into sentences
  2. Computing embeddings for each sentence
  3. Calculating cosine similarity between consecutive sentences
  4. Splitting where similarity drops below threshold

This produces the highest quality chunks because it understands semantic meaning, but requires an embedding provider.

Usage:

val (provider, embeddingProviderCfg) = /* load embedding provider config */
val embeddingClient = EmbeddingClient.from(provider, embeddingProviderCfg).getOrElse(???)
val modelConfig = EmbeddingModelConfig("text-embedding-3-small", 1536)
val chunker = SemanticChunker(embeddingClient, modelConfig, similarityThreshold = 0.5)
val chunks = chunker.chunk(documentText, ChunkingConfig(targetSize = 800))

chunks.foreach { c =>
 println(s"[$${c.index}] $${c.content.take(50)}...")
}

Value parameters

batchSize

Number of sentences to embed at once

embeddingClient

Client for generating embeddings

modelConfig

Model configuration for embeddings

similarityThreshold

Minimum similarity to stay in same chunk (0.0-1.0)

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type

Sentence-aware document chunker.

Sentence-aware document chunker.

Splits text at sentence boundaries to preserve semantic coherence. Uses pattern matching for sentence detection (periods, question marks, etc.) while handling edge cases like abbreviations and decimal numbers.

This chunker produces higher quality chunks than simple character-based splitting because it never breaks in the middle of a sentence.

Usage:

val chunker = SentenceChunker()
val chunks = chunker.chunk(text, ChunkingConfig(targetSize = 800))

// Sentences are kept intact
chunks.foreach { c =>
 println(s"[$${c.index}] $${c.content}")
}

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type

Simple character-based chunker wrapping legacy ChunkingUtils.

Simple character-based chunker wrapping legacy ChunkingUtils.

Provides compatibility with existing code while conforming to the new DocumentChunker interface. Splits text into fixed-size chunks without semantic awareness.

Use this chunker when:

  • You need maximum compatibility with existing code
  • Content has no clear sentence structure
  • Performance is more important than quality

For better quality chunks, consider:

  • SentenceChunker: Respects sentence boundaries
  • MarkdownChunker: Preserves markdown structure
  • SemanticChunker: Uses embedding similarity

Usage:

val chunker = SimpleChunker()
val chunks = chunker.chunk(text, ChunkingConfig(targetSize = 800, overlap = 150))

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any
object SimpleChunker

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type