org.llm4s.rag.loader

Members list

Type members

Classlikes

final case class CrawlerConfig(maxDepth: Int, maxPages: Int, followPatterns: Seq[String], excludePatterns: Seq[String], respectRobotsTxt: Boolean, delayMs: Int, timeoutMs: Int, userAgent: String, maxQueueSize: Int, includeQueryParams: Boolean, sameDomainOnly: Boolean, acceptContentTypes: Set[String])

Configuration for web crawling.

Configuration for web crawling.

Controls how the WebCrawlerLoader discovers and fetches web pages.

Value parameters

acceptContentTypes

Content types to process (others are skipped)

delayMs

Delay between requests in milliseconds (rate limiting)

excludePatterns

URL patterns to exclude (glob syntax)

followPatterns

URL patterns to follow (glob syntax with asterisk wildcards)

includeQueryParams

Whether to treat URLs with different query params as distinct pages

maxDepth

Maximum link depth to follow from seed URLs (0 = seed URLs only)

maxPages

Maximum total pages to crawl

maxQueueSize

Maximum number of URLs to queue (prevents unbounded memory usage)

respectRobotsTxt

Whether to respect robots.txt directives

sameDomainOnly

Whether to restrict crawling to the same domain as seed URLs

timeoutMs

HTTP request timeout in milliseconds

userAgent

User agent string for HTTP requests

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object CrawlerConfig

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class DirectoryLoader(path: Path, extensions: Set[String], recursive: Boolean, metadata: Map[String, String], maxDepth: Int) extends DocumentLoader

Load all documents from a directory.

Load all documents from a directory.

Supports recursive directory traversal and file filtering by extension. Each file is loaded using FileLoader, inheriting its version and hint detection.

Value parameters

extensions

File extensions to include (without leading dot)

maxDepth

Maximum recursion depth (0 = current directory only)

metadata

Additional metadata to attach to all documents

path

Path to the directory

recursive

Whether to recurse into subdirectories

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class Document(id: String, content: String, metadata: Map[String, String], hints: Option[DocumentHints], version: Option[DocumentVersion])

A document ready for RAG ingestion.

A document ready for RAG ingestion.

Documents represent content from any source (files, URLs, databases, APIs) in a normalized form ready for chunking and embedding.

Value parameters

content

The text content of the document

hints

Optional processing hints suggested by the loader

id

Unique identifier for this document

metadata

Key-value metadata (source, author, timestamp, etc.)

version

Optional version for change detection

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object Document

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
Document.type
final case class DocumentHints(chunkingStrategy: Option[Strategy], chunkingConfig: Option[ChunkingConfig], batchSize: Option[Int], priority: Int, skipReason: Option[String], customHints: Map[String, String])

Processing hints that loaders can suggest to the RAG pipeline.

Processing hints that loaders can suggest to the RAG pipeline.

Hints are optional suggestions - the pipeline may ignore them based on global configuration or other factors. They allow loaders to provide domain-specific optimization recommendations.

Value parameters

batchSize

Suggested batch size for embedding (for rate limiting)

chunkingConfig

Suggested chunking configuration

chunkingStrategy

Suggested chunking strategy for this document type

customHints

Additional loader-specific hints

priority

Processing priority (higher = process first)

skipReason

If set, suggests this document should be skipped with reason

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object DocumentHints

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type

Abstraction for loading documents from any source into the RAG pipeline.

Abstraction for loading documents from any source into the RAG pipeline.

DocumentLoader provides a unified interface for various document sources:

  • Files and directories
  • URLs and web content
  • Cloud storage (S3, GCS, Azure Blob)
  • Databases and APIs
  • Custom sources

Key design principles:

  • Streaming support via Iterator for large document sets
  • Graceful error handling with LoadResult for partial failures
  • Optional hints for processing optimization
  • Composability through the ++ operator

Usage:

// At build time - pre-ingest documents
val rag = RAG.builder()
 .withDocuments(DirectoryLoader("./docs"))
 .build()

// At ingest time - add documents later
rag.ingest(UrlLoader(urls))

// Combine loaders
val combined = DirectoryLoader("./docs") ++ UrlLoader(urls)

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes
class FileLoader
class TextLoader
class UrlLoader
Show all

Factory and combinators for DocumentLoaders.

Factory and combinators for DocumentLoaders.

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type
final case class DocumentRef(id: String, path: String, metadata: Map[String, String], contentLength: Option[Long], lastModified: Option[Long], etag: Option[String])

Reference to a document in a source (S3, filesystem, database, etc.)

Reference to a document in a source (S3, filesystem, database, etc.)

DocumentRef contains the document's identity and metadata from the source, but not the document content itself. Use a DocumentSource to read the content.

Value parameters

contentLength

Size of the document in bytes, if known

etag

Content hash or version identifier, if available (e.g., S3 ETag)

id

Unique identifier within the source (e.g., S3 object key)

lastModified

Last modification timestamp (epoch ms), if available

metadata

Source-specific metadata (bucket, region, content-type, etc.)

path

Human-readable path or location in the source

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Registry for tracking indexed documents.

Registry for tracking indexed documents.

Used by sync operations to determine which documents have been indexed, their versions, and to detect changes (adds, updates, deletes).

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes

Abstraction for document sources (S3, GCS, Azure Blob, filesystem, etc.)

Abstraction for document sources (S3, GCS, Azure Blob, filesystem, etc.)

DocumentSource separates "where documents come from" from "how documents are processed". This enables:

  • Using the same extraction logic for documents from any source
  • Easy addition of new sources without modifying extraction code
  • Source-specific optimizations (e.g., S3 pagination, streaming)

DocumentSource provides raw document bytes; text extraction is handled by DocumentExtractor.

To use a DocumentSource with the RAG system, convert it to a DocumentLoader using SourceBackedLoader:

val s3Source = S3DocumentSource(bucket, prefix)
val loader = SourceBackedLoader(s3Source)
rag.sync(loader)

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any
Known subtypes

Attributes

Companion
trait
Supertypes
class Object
trait Matchable
class Any
Self type
final case class DocumentVersion(contentHash: String, timestamp: Option[Long], etag: Option[String])

Version information for change detection.

Version information for change detection.

Used by sync operations to determine if a document has changed since it was last indexed.

Value parameters

contentHash

SHA-256 hash of the content

etag

Optional HTTP ETag for URL sources

timestamp

Optional last modified timestamp (epoch ms)

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class FileLoader(path: Path, metadata: Map[String, String]) extends DocumentLoader

Load a single file as a document.

Load a single file as a document.

Supports all file types handled by UniversalExtractor:

  • Text files (.txt, .md, .json, .xml, .html)
  • PDF documents
  • Word documents (.docx)

Automatically detects appropriate chunking hints based on file extension. Includes version information (content hash + file timestamp) for sync operations.

Value parameters

metadata

Additional metadata to attach

path

Path to the file

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object FileLoader

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
FileLoader.type

In-memory implementation of DocumentRegistry.

In-memory implementation of DocumentRegistry.

Suitable for development and testing. Data is lost on restart.

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
sealed trait LoadResult

Result of loading a single document.

Result of loading a single document.

Represents either a successfully loaded document, a loading error, or an intentionally skipped document.

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any
Known subtypes
class Failure
class Skipped
class Success
object LoadResult

Attributes

Companion
trait
Supertypes
trait Sum
trait Mirror
class Object
trait Matchable
class Any
Self type
LoadResult.type
final case class LoadStats(totalAttempted: Int, successful: Int, failed: Int, skipped: Int, errors: Seq[(String, LLMError)])

Aggregated loading statistics.

Aggregated loading statistics.

Value parameters

errors

List of error details for debugging

failed

Number that failed

skipped

Number intentionally skipped

successful

Number successfully loaded

totalAttempted

Total documents attempted to load

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object LoadStats

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
LoadStats.type
final case class LoadingConfig(failFast: Boolean, useHints: Boolean, skipEmptyDocuments: Boolean, enableVersioning: Boolean, parallelism: Int, batchSize: Int)

Configuration for document loading behavior.

Configuration for document loading behavior.

Controls how documents are loaded, processed, and tracked by the RAG pipeline.

Value parameters

batchSize

Documents per embedding batch

enableVersioning

Track versions for sync operations

failFast

Stop on first error vs continue and collect all errors

parallelism

Maximum concurrent document processing

skipEmptyDocuments

Whether to skip documents with empty content

useHints

Whether to use loader hints for processing

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object LoadingConfig

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class RawDocument(ref: DocumentRef, content: Array[Byte])

Raw document content retrieved from a source.

Raw document content retrieved from a source.

Value parameters

content

Raw bytes of the document

ref

The document reference (identity and metadata)

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
final case class SourceBackedLoader(source: DocumentSource, extractor: DocumentExtractor, additionalMetadata: Map[String, String], defaultHints: Option[DocumentHints]) extends DocumentLoader

Bridge between DocumentSource and DocumentLoader.

Bridge between DocumentSource and DocumentLoader.

SourceBackedLoader converts any DocumentSource into a DocumentLoader, enabling documents from S3, GCS, databases, or any custom source to be used with the RAG pipeline.

The loader:

  1. Lists documents from the source
  2. Reads document content (bytes)
  3. Extracts text using DocumentExtractor
  4. Creates Document objects with appropriate metadata and hints

Usage:

// From S3
val s3Source = S3DocumentSource("my-bucket", "docs/")
val loader = SourceBackedLoader(s3Source)
rag.sync(loader)

// With custom extractor
val loader = SourceBackedLoader(source, customExtractor)

Value parameters

additionalMetadata

Extra metadata to add to all documents

defaultHints

Default processing hints for documents

extractor

Document extractor for text extraction (default: DefaultDocumentExtractor)

source

The document source to load from

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class SyncStats(added: Int, updated: Int, deleted: Int, unchanged: Int)

Statistics for sync operations.

Statistics for sync operations.

Value parameters

added

New documents added

deleted

Documents removed

unchanged

Documents with no changes

updated

Existing documents updated

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object SyncStats

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
SyncStats.type

DocumentSource that supports change detection for incremental sync.

DocumentSource that supports change detection for incremental sync.

SyncableSource extends DocumentSource with version information, enabling RAG.sync() to detect which documents have changed.

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes
final case class TextLoader(documents: Seq[Document]) extends DocumentLoader

Load documents from raw text content.

Load documents from raw text content.

Useful for:

  • Programmatically created content
  • Database records
  • API responses
  • Testing

Value parameters

documents

Documents to load

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object TextLoader

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
TextLoader.type

Builder for constructing TextLoader fluently.

Builder for constructing TextLoader fluently.

Attributes

Supertypes
class Object
trait Matchable
class Any
final case class UrlLoader(urls: Seq[String], headers: Map[String, String], timeoutMs: Int, metadata: Map[String, String], retryCount: Int) extends DocumentLoader

Load documents from URLs.

Load documents from URLs.

Supports HTTP/HTTPS URLs with configurable timeouts and headers. Includes ETag-based version detection for efficient sync operations.

Value parameters

headers

HTTP headers to send with requests

metadata

Additional metadata to attach

retryCount

Number of retry attempts for failed requests

timeoutMs

Connection and read timeout in milliseconds

urls

URLs to load

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object UrlLoader

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
UrlLoader.type
final case class WebCrawlerLoader(seedUrls: Seq[String], config: CrawlerConfig, metadata: Map[String, String]) extends DocumentLoader

Load documents by crawling from seed URLs.

Load documents by crawling from seed URLs.

Features:

  • Breadth-first link discovery
  • Domain/pattern restrictions
  • robots.txt support
  • Rate limiting
  • HTML to text conversion
  • Deduplication by URL

Value parameters

config

Crawler configuration

metadata

Additional metadata for all documents

seedUrls

Starting URLs to crawl from

Attributes

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type