org.llm4s.rag.loader

Members list

Type members

Classlikes

final case class DirectoryLoader(path: Path, extensions: Set[String], recursive: Boolean, metadata: Map[String, String], maxDepth: Int) extends DocumentLoader

Load all documents from a directory.

Load all documents from a directory.

Supports recursive directory traversal and file filtering by extension. Each file is loaded using FileLoader, inheriting its version and hint detection.

Value parameters

extensions

File extensions to include (without leading dot)

maxDepth

Maximum recursion depth (0 = current directory only)

metadata

Additional metadata to attach to all documents

path

Path to the directory

recursive

Whether to recurse into subdirectories

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class Document(id: String, content: String, metadata: Map[String, String], hints: Option[DocumentHints], version: Option[DocumentVersion])

A document ready for RAG ingestion.

A document ready for RAG ingestion.

Documents represent content from any source (files, URLs, databases, APIs) in a normalized form ready for chunking and embedding.

Value parameters

content

The text content of the document

hints

Optional processing hints suggested by the loader

id

Unique identifier for this document

metadata

Key-value metadata (source, author, timestamp, etc.)

version

Optional version for change detection

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object Document

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
Document.type
final case class DocumentHints(chunkingStrategy: Option[Strategy], chunkingConfig: Option[ChunkingConfig], batchSize: Option[Int], priority: Int, skipReason: Option[String], customHints: Map[String, String])

Processing hints that loaders can suggest to the RAG pipeline.

Processing hints that loaders can suggest to the RAG pipeline.

Hints are optional suggestions - the pipeline may ignore them based on global configuration or other factors. They allow loaders to provide domain-specific optimization recommendations.

Value parameters

batchSize

Suggested batch size for embedding (for rate limiting)

chunkingConfig

Suggested chunking configuration

chunkingStrategy

Suggested chunking strategy for this document type

customHints

Additional loader-specific hints

priority

Processing priority (higher = process first)

skipReason

If set, suggests this document should be skipped with reason

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object DocumentHints

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type

Abstraction for loading documents from any source into the RAG pipeline.

Abstraction for loading documents from any source into the RAG pipeline.

DocumentLoader provides a unified interface for various document sources:

  • Files and directories
  • URLs and web content
  • Cloud storage (S3, GCS, Azure Blob)
  • Databases and APIs
  • Custom sources

Key design principles:

  • Streaming support via Iterator for large document sets
  • Graceful error handling with LoadResult for partial failures
  • Optional hints for processing optimization
  • Composability through the ++ operator

Usage:

// At build time - pre-ingest documents
val rag = RAG.builder()
 .withDocuments(DirectoryLoader("./docs"))
 .build()

// At ingest time - add documents later
rag.ingest(UrlLoader(urls))

// Combine loaders
val combined = DirectoryLoader("./docs") ++ UrlLoader(urls)

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes

Factory and combinators for DocumentLoaders.

Factory and combinators for DocumentLoaders.

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type

Registry for tracking indexed documents.

Registry for tracking indexed documents.

Used by sync operations to determine which documents have been indexed, their versions, and to detect changes (adds, updates, deletes).

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes
final case class DocumentVersion(contentHash: String, timestamp: Option[Long], etag: Option[String])

Version information for change detection.

Version information for change detection.

Used by sync operations to determine if a document has changed since it was last indexed.

Value parameters

contentHash

SHA-256 hash of the content

etag

Optional HTTP ETag for URL sources

timestamp

Optional last modified timestamp (epoch ms)

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class FileLoader(path: Path, metadata: Map[String, String]) extends DocumentLoader

Load a single file as a document.

Load a single file as a document.

Supports all file types handled by UniversalExtractor:

  • Text files (.txt, .md, .json, .xml, .html)
  • PDF documents
  • Word documents (.docx)

Automatically detects appropriate chunking hints based on file extension. Includes version information (content hash + file timestamp) for sync operations.

Value parameters

metadata

Additional metadata to attach

path

Path to the file

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object FileLoader

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
FileLoader.type

In-memory implementation of DocumentRegistry.

In-memory implementation of DocumentRegistry.

Suitable for development and testing. Data is lost on restart.

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
sealed trait LoadResult

Result of loading a single document.

Result of loading a single document.

Represents either a successfully loaded document, a loading error, or an intentionally skipped document.

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any
Known subtypes
class Failure
class Skipped
class Success
object LoadResult

Attributes

Companion
trait
Supertypes
trait Sum
trait Mirror
class Object
trait Matchable
class Any
Self type
LoadResult.type
final case class LoadStats(totalAttempted: Int, successful: Int, failed: Int, skipped: Int, errors: Seq[(String, LLMError)])

Aggregated loading statistics.

Aggregated loading statistics.

Value parameters

errors

List of error details for debugging

failed

Number that failed

skipped

Number intentionally skipped

successful

Number successfully loaded

totalAttempted

Total documents attempted to load

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object LoadStats

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
LoadStats.type
final case class LoadingConfig(failFast: Boolean, useHints: Boolean, skipEmptyDocuments: Boolean, enableVersioning: Boolean, parallelism: Int, batchSize: Int)

Configuration for document loading behavior.

Configuration for document loading behavior.

Controls how documents are loaded, processed, and tracked by the RAG pipeline.

Value parameters

batchSize

Documents per embedding batch

enableVersioning

Track versions for sync operations

failFast

Stop on first error vs continue and collect all errors

parallelism

Maximum concurrent document processing

skipEmptyDocuments

Whether to skip documents with empty content

useHints

Whether to use loader hints for processing

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object LoadingConfig

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class SyncStats(added: Int, updated: Int, deleted: Int, unchanged: Int)

Statistics for sync operations.

Statistics for sync operations.

Value parameters

added

New documents added

deleted

Documents removed

unchanged

Documents with no changes

updated

Existing documents updated

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object SyncStats

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
SyncStats.type
final case class TextLoader(documents: Seq[Document]) extends DocumentLoader

Load documents from raw text content.

Load documents from raw text content.

Useful for:

  • Programmatically created content
  • Database records
  • API responses
  • Testing

Value parameters

documents

Documents to load

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object TextLoader

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
TextLoader.type

Builder for constructing TextLoader fluently.

Builder for constructing TextLoader fluently.

Attributes

Supertypes
class Object
trait Matchable
class Any
final case class UrlLoader(urls: Seq[String], headers: Map[String, String], timeoutMs: Int, metadata: Map[String, String], retryCount: Int) extends DocumentLoader

Load documents from URLs.

Load documents from URLs.

Supports HTTP/HTTPS URLs with configurable timeouts and headers. Includes ETag-based version detection for efficient sync operations.

Value parameters

headers

HTTP headers to send with requests

metadata

Additional metadata to attach

retryCount

Number of retry attempts for failed requests

timeoutMs

Connection and read timeout in milliseconds

urls

URLs to load

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object UrlLoader

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
UrlLoader.type