org.llm4s.rag.loader
Members list
Type members
Classlikes
Configuration for web crawling.
Configuration for web crawling.
Controls how the WebCrawlerLoader discovers and fetches web pages.
Value parameters
- acceptContentTypes
-
Content types to process (others are skipped)
- delayMs
-
Delay between requests in milliseconds (rate limiting)
- excludePatterns
-
URL patterns to exclude (glob syntax)
- followPatterns
-
URL patterns to follow (glob syntax with asterisk wildcards)
- includeQueryParams
-
Whether to treat URLs with different query params as distinct pages
- maxDepth
-
Maximum link depth to follow from seed URLs (0 = seed URLs only)
- maxPages
-
Maximum total pages to crawl
- maxQueueSize
-
Maximum number of URLs to queue (prevents unbounded memory usage)
- respectRobotsTxt
-
Whether to respect robots.txt directives
- sameDomainOnly
-
Whether to restrict crawling to the same domain as seed URLs
- timeoutMs
-
HTTP request timeout in milliseconds
- userAgent
-
User agent string for HTTP requests
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
CrawlerConfig.type
Load all documents from a directory.
Load all documents from a directory.
Supports recursive directory traversal and file filtering by extension. Each file is loaded using FileLoader, inheriting its version and hint detection.
Value parameters
- extensions
-
File extensions to include (without leading dot)
- maxDepth
-
Maximum recursion depth (0 = current directory only)
- metadata
-
Additional metadata to attach to all documents
- path
-
Path to the directory
- recursive
-
Whether to recurse into subdirectories
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait DocumentLoaderclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
DirectoryLoader.type
A document ready for RAG ingestion.
A document ready for RAG ingestion.
Documents represent content from any source (files, URLs, databases, APIs) in a normalized form ready for chunking and embedding.
Value parameters
- content
-
The text content of the document
- hints
-
Optional processing hints suggested by the loader
- id
-
Unique identifier for this document
- metadata
-
Key-value metadata (source, author, timestamp, etc.)
- version
-
Optional version for change detection
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Processing hints that loaders can suggest to the RAG pipeline.
Processing hints that loaders can suggest to the RAG pipeline.
Hints are optional suggestions - the pipeline may ignore them based on global configuration or other factors. They allow loaders to provide domain-specific optimization recommendations.
Value parameters
- batchSize
-
Suggested batch size for embedding (for rate limiting)
- chunkingConfig
-
Suggested chunking configuration
- chunkingStrategy
-
Suggested chunking strategy for this document type
- customHints
-
Additional loader-specific hints
- priority
-
Processing priority (higher = process first)
- skipReason
-
If set, suggests this document should be skipped with reason
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
DocumentHints.type
Abstraction for loading documents from any source into the RAG pipeline.
Abstraction for loading documents from any source into the RAG pipeline.
DocumentLoader provides a unified interface for various document sources:
- Files and directories
- URLs and web content
- Cloud storage (S3, GCS, Azure Blob)
- Databases and APIs
- Custom sources
Key design principles:
- Streaming support via Iterator for large document sets
- Graceful error handling with LoadResult for partial failures
- Optional hints for processing optimization
- Composability through the ++ operator
Usage:
// At build time - pre-ingest documents
val rag = RAG.builder()
.withDocuments(DirectoryLoader("./docs"))
.build()
// At ingest time - add documents later
rag.ingest(UrlLoader(urls))
// Combine loaders
val combined = DirectoryLoader("./docs") ++ UrlLoader(urls)
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Known subtypes
-
class DirectoryLoaderclass FileLoaderclass SourceBackedLoaderclass TextLoaderclass UrlLoaderclass WebCrawlerLoaderShow all
Factory and combinators for DocumentLoaders.
Factory and combinators for DocumentLoaders.
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
DocumentLoaders.type
Reference to a document in a source (S3, filesystem, database, etc.)
Reference to a document in a source (S3, filesystem, database, etc.)
DocumentRef contains the document's identity and metadata from the source, but not the document content itself. Use a DocumentSource to read the content.
Value parameters
- contentLength
-
Size of the document in bytes, if known
- etag
-
Content hash or version identifier, if available (e.g., S3 ETag)
- id
-
Unique identifier within the source (e.g., S3 object key)
- lastModified
-
Last modification timestamp (epoch ms), if available
- metadata
-
Source-specific metadata (bucket, region, content-type, etc.)
- path
-
Human-readable path or location in the source
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Registry for tracking indexed documents.
Registry for tracking indexed documents.
Used by sync operations to determine which documents have been indexed, their versions, and to detect changes (adds, updates, deletes).
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Known subtypes
-
class InMemoryDocumentRegistry
Abstraction for document sources (S3, GCS, Azure Blob, filesystem, etc.)
Abstraction for document sources (S3, GCS, Azure Blob, filesystem, etc.)
DocumentSource separates "where documents come from" from "how documents are processed". This enables:
- Using the same extraction logic for documents from any source
- Easy addition of new sources without modifying extraction code
- Source-specific optimizations (e.g., S3 pagination, streaming)
DocumentSource provides raw document bytes; text extraction is handled by DocumentExtractor.
To use a DocumentSource with the RAG system, convert it to a DocumentLoader using SourceBackedLoader:
val s3Source = S3DocumentSource(bucket, prefix)
val loader = SourceBackedLoader(s3Source)
rag.sync(loader)
Attributes
- Companion
- object
- Supertypes
-
class Objecttrait Matchableclass Any
- Known subtypes
-
trait SyncableSourceclass S3DocumentSource
Attributes
- Companion
- trait
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
DocumentSource.type
Version information for change detection.
Version information for change detection.
Used by sync operations to determine if a document has changed since it was last indexed.
Value parameters
- contentHash
-
SHA-256 hash of the content
- etag
-
Optional HTTP ETag for URL sources
- timestamp
-
Optional last modified timestamp (epoch ms)
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
DocumentVersion.type
Load a single file as a document.
Load a single file as a document.
Supports all file types handled by UniversalExtractor:
- Text files (.txt, .md, .json, .xml, .html)
- PDF documents
- Word documents (.docx)
Automatically detects appropriate chunking hints based on file extension. Includes version information (content hash + file timestamp) for sync operations.
Value parameters
- metadata
-
Additional metadata to attach
- path
-
Path to the file
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait DocumentLoaderclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
FileLoader.type
In-memory implementation of DocumentRegistry.
In-memory implementation of DocumentRegistry.
Suitable for development and testing. Data is lost on restart.
Attributes
- Companion
- object
- Supertypes
Attributes
- Companion
- class
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
Result of loading a single document.
Attributes
- Companion
- trait
- Supertypes
-
trait Sumtrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
LoadResult.type
Aggregated loading statistics.
Aggregated loading statistics.
Value parameters
- errors
-
List of error details for debugging
- failed
-
Number that failed
- skipped
-
Number intentionally skipped
- successful
-
Number successfully loaded
- totalAttempted
-
Total documents attempted to load
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Configuration for document loading behavior.
Configuration for document loading behavior.
Controls how documents are loaded, processed, and tracked by the RAG pipeline.
Value parameters
- batchSize
-
Documents per embedding batch
- enableVersioning
-
Track versions for sync operations
- failFast
-
Stop on first error vs continue and collect all errors
- parallelism
-
Maximum concurrent document processing
- skipEmptyDocuments
-
Whether to skip documents with empty content
- useHints
-
Whether to use loader hints for processing
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
LoadingConfig.type
Raw document content retrieved from a source.
Raw document content retrieved from a source.
Value parameters
- content
-
Raw bytes of the document
- ref
-
The document reference (identity and metadata)
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Bridge between DocumentSource and DocumentLoader.
Bridge between DocumentSource and DocumentLoader.
SourceBackedLoader converts any DocumentSource into a DocumentLoader, enabling documents from S3, GCS, databases, or any custom source to be used with the RAG pipeline.
The loader:
- Lists documents from the source
- Reads document content (bytes)
- Extracts text using DocumentExtractor
- Creates Document objects with appropriate metadata and hints
Usage:
// From S3
val s3Source = S3DocumentSource("my-bucket", "docs/")
val loader = SourceBackedLoader(s3Source)
rag.sync(loader)
// With custom extractor
val loader = SourceBackedLoader(source, customExtractor)
Value parameters
- additionalMetadata
-
Extra metadata to add to all documents
- defaultHints
-
Default processing hints for documents
- extractor
-
Document extractor for text extraction (default: DefaultDocumentExtractor)
- source
-
The document source to load from
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait DocumentLoaderclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
SourceBackedLoader.type
Statistics for sync operations.
Statistics for sync operations.
Value parameters
- added
-
New documents added
- deleted
-
Documents removed
- unchanged
-
Documents with no changes
- updated
-
Existing documents updated
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
DocumentSource that supports change detection for incremental sync.
DocumentSource that supports change detection for incremental sync.
SyncableSource extends DocumentSource with version information, enabling RAG.sync() to detect which documents have changed.
Attributes
- Supertypes
- Known subtypes
-
class S3DocumentSource
Load documents from raw text content.
Load documents from raw text content.
Useful for:
- Programmatically created content
- Database records
- API responses
- Testing
Value parameters
- documents
-
Documents to load
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait DocumentLoaderclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
TextLoader.type
Builder for constructing TextLoader fluently.
Builder for constructing TextLoader fluently.
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
Load documents from URLs.
Load documents from URLs.
Supports HTTP/HTTPS URLs with configurable timeouts and headers. Includes ETag-based version detection for efficient sync operations.
Value parameters
- headers
-
HTTP headers to send with requests
- metadata
-
Additional metadata to attach
- retryCount
-
Number of retry attempts for failed requests
- timeoutMs
-
Connection and read timeout in milliseconds
- urls
-
URLs to load
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait DocumentLoaderclass Objecttrait Matchableclass AnyShow all
Load documents by crawling from seed URLs.
Load documents by crawling from seed URLs.
Features:
- Breadth-first link discovery
- Domain/pattern restrictions
- robots.txt support
- Rate limiting
- HTML to text conversion
- Deduplication by URL
Value parameters
- config
-
Crawler configuration
- metadata
-
Additional metadata for all documents
- seedUrls
-
Starting URLs to crawl from
Attributes
- Companion
- object
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait DocumentLoaderclass Objecttrait Matchableclass AnyShow all
Attributes
- Companion
- class
- Supertypes
-
trait Producttrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
WebCrawlerLoader.type