DocumentChunker

org.llm4s.chunking.DocumentChunker

Document chunking strategy.

Implementations split text into manageable chunks for embedding and retrieval. Different strategies optimize for different content types:

  • SimpleChunker: Basic character-based splitting
  • SentenceChunker: Respects sentence boundaries
  • MarkdownChunker: Preserves markdown structure
  • SemanticChunker: Splits at topic boundaries using embeddings

Usage:

val chunker = ChunkerFactory.sentence()
val config = ChunkingConfig(targetSize = 800, overlap = 150)
val chunks = chunker.chunk(documentText, config)

chunks.foreach { chunk =>
 println(s"Chunk $${chunk.index}: $${chunk.content.take(50)}...")
}

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Known subtypes

Members list

Value members

Abstract methods

def chunk(text: String, config: ChunkingConfig): Seq[DocumentChunk]

Split text into chunks.

Split text into chunks.

Value parameters

config

Chunking configuration

text

Input text to chunk

Attributes

Returns

Sequence of document chunks

Concrete methods

def chunkWithSource(text: String, sourceFile: String, config: ChunkingConfig): Seq[DocumentChunk]

Split text into chunks with source file metadata.

Split text into chunks with source file metadata.

Value parameters

config

Chunking configuration

sourceFile

Source file name for metadata

text

Input text to chunk

Attributes

Returns

Sequence of document chunks with source metadata