ChunkerFactory

org.llm4s.chunking.ChunkerFactory

Factory for creating document chunkers.

Provides convenient factory methods for creating different chunking strategies. Each strategy has different trade-offs between quality and performance.

Usage:

// Simple character-based chunking (fastest)
val simple = ChunkerFactory.simple()

// Sentence-aware chunking (recommended for most use cases)
val sentence = ChunkerFactory.sentence()

// Markdown-aware chunking (preserves structure)
val markdown = ChunkerFactory.markdown()

// Semantic chunking (highest quality, requires embeddings)
val modelConfig = EmbeddingModelConfig("text-embedding-3-small", 1536)
val semantic = ChunkerFactory.semantic(embeddingClient, modelConfig)

// Auto-detect based on content
val auto = ChunkerFactory.auto(text)

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Self type

Members list

Type members

Classlikes

object Strategy

Attributes

Companion
trait
Supertypes
trait Sum
trait Mirror
class Object
trait Matchable
class Any
Self type
Strategy.type
sealed trait Strategy

Chunking strategy type

Chunking strategy type

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any
Known subtypes
object Markdown
object Semantic
object Sentence
object Simple

Value members

Concrete methods

def auto(text: String): DocumentChunker

Auto-detect the best chunker based on content.

Auto-detect the best chunker based on content.

Analyzes the text to determine if it's markdown or plain text, then returns an appropriate chunker.

Value parameters

text

Content to analyze

Attributes

Returns

Appropriate DocumentChunker

def create(strategy: String): Option[DocumentChunker]

Create a chunker by strategy name.

Create a chunker by strategy name.

Value parameters

strategy

Strategy name: "simple", "sentence", "markdown", "semantic"

Attributes

Returns

DocumentChunker or None if strategy unknown Note: The "semantic" strategy has a special fallback behavior:

  • When requested via create("semantic"), a SentenceChunker is returned as fallback
  • This is because semantic chunking requires an embedding client (not available via factory)
  • To use true semantic chunking with embeddings, use semantic(embeddingClient, modelConfig)
  • The fallback ensures graceful degradation instead of failing when embeddings aren't configured This design allows applications to specify "semantic" as a strategy preference without requiring embedding setup at construction time, while still providing a usable result.
def create(strategy: Strategy): DocumentChunker

Create a chunker based on strategy enum.

Create a chunker based on strategy enum.

Value parameters

strategy

Chunking strategy

Attributes

Returns

DocumentChunker Note on Semantic Strategy: When Strategy.Semantic is passed, this method returns a SentenceChunker as a fallback. This is intentional and provides several benefits:

  1. Graceful Degradation: Applications can safely request semantic chunking without requiring embedding configuration, falling back to sentence-based chunking.
  2. Type Safety: Unlike create(String), this method always returns a DocumentChunker (never fails), which is important for code that must execute.
  3. Ease of Testing: Tests can specify Strategy.Semantic without needing to mock embedding clients, while still verifying that a chunker is returned. For actual semantic chunking with embeddings, use the semantic(embeddingClient, modelConfig) method directly, which provides full semantic chunking capabilities.

Create a markdown-aware chunker.

Create a markdown-aware chunker.

Preserves markdown structure including:

  • Heading boundaries and hierarchy
  • Code blocks (keeps them intact)
  • List structure

Best for markdown documentation and README files.

Attributes

def semantic(embeddingClient: EmbeddingClient, modelConfig: EmbeddingModelConfig, similarityThreshold: Double, batchSize: Int): DocumentChunker

Create a semantic chunker using embeddings.

Create a semantic chunker using embeddings.

Splits text at topic boundaries by analyzing semantic similarity between consecutive sentences. Produces the highest quality chunks but requires an embedding client.

Value parameters

batchSize

Number of sentences to embed at once (default: 50)

embeddingClient

Client for generating embeddings

modelConfig

Model configuration for embeddings

similarityThreshold

Minimum similarity to stay in same chunk (0.0-1.0, default: 0.5)

Attributes

Create a sentence-aware chunker.

Create a sentence-aware chunker.

Respects sentence boundaries for better quality chunks. Recommended for most text content.

Attributes

Create a simple character-based chunker.

Create a simple character-based chunker.

Fast but doesn't respect semantic boundaries. Use for content without clear sentence structure.

Attributes

Concrete fields

Get the default chunker (sentence-aware).

Get the default chunker (sentence-aware).

Attributes