ChunkerFactory

org.llm4s.chunking.ChunkerFactory

Factory for creating document chunkers.

Provides convenient factory methods for creating different chunking strategies. Each strategy has different trade-offs between quality and performance.

Usage:

// Simple character-based chunking (fastest)
val simple = ChunkerFactory.simple()

// Sentence-aware chunking (recommended for most use cases)
val sentence = ChunkerFactory.sentence()

// Markdown-aware chunking (preserves structure)
val markdown = ChunkerFactory.markdown()

// Semantic chunking (highest quality, requires embeddings)
val modelConfig = EmbeddingModelConfig("text-embedding-3-small", 1536)
val semantic = ChunkerFactory.semantic(embeddingClient, modelConfig)

// Auto-detect based on content
val auto = ChunkerFactory.auto(text)

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Self type

Members list

Type members

Classlikes

object Strategy

Attributes

Companion
trait
Supertypes
trait Sum
trait Mirror
class Object
trait Matchable
class Any
Self type
Strategy.type
sealed trait Strategy

Chunking strategy type

Chunking strategy type

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any
Known subtypes
object Markdown
object Semantic
object Sentence
object Simple

Value members

Concrete methods

def auto(text: String): DocumentChunker

Auto-detect the best chunker based on content.

Auto-detect the best chunker based on content.

Analyzes the text to determine if it's markdown or plain text, then returns an appropriate chunker.

Value parameters

text

Content to analyze

Attributes

Returns

Appropriate DocumentChunker

def create(strategy: String): Option[DocumentChunker]

Create a chunker by strategy name.

Create a chunker by strategy name.

Value parameters

strategy

Strategy name: "simple", "sentence", "markdown", "semantic"

Attributes

Returns

DocumentChunker or None if strategy unknown Note: "semantic" strategy requires an EmbeddingProvider and returns a SentenceChunker as fallback. Use semantic() method for proper semantic chunking.

def create(strategy: Strategy): DocumentChunker

Create a chunker based on strategy enum.

Create a chunker based on strategy enum.

Value parameters

strategy

Chunking strategy

Attributes

Returns

DocumentChunker

Create a markdown-aware chunker.

Create a markdown-aware chunker.

Preserves markdown structure including:

  • Heading boundaries and hierarchy
  • Code blocks (keeps them intact)
  • List structure

Best for markdown documentation and README files.

Attributes

def semantic(embeddingClient: EmbeddingClient, modelConfig: EmbeddingModelConfig, similarityThreshold: Double, batchSize: Int): DocumentChunker

Create a semantic chunker using embeddings.

Create a semantic chunker using embeddings.

Splits text at topic boundaries by analyzing semantic similarity between consecutive sentences. Produces the highest quality chunks but requires an embedding client.

Value parameters

batchSize

Number of sentences to embed at once (default: 50)

embeddingClient

Client for generating embeddings

modelConfig

Model configuration for embeddings

similarityThreshold

Minimum similarity to stay in same chunk (0.0-1.0, default: 0.5)

Attributes

Create a sentence-aware chunker.

Create a sentence-aware chunker.

Respects sentence boundaries for better quality chunks. Recommended for most text content.

Attributes

Create a simple character-based chunker.

Create a simple character-based chunker.

Fast but doesn't respect semantic boundaries. Use for content without clear sentence structure.

Attributes

Concrete fields

Get the default chunker (sentence-aware).

Get the default chunker (sentence-aware).

Attributes