DocumentSource

org.llm4s.rag.loader.DocumentSource

See theDocumentSource companion object

Abstraction for document sources (S3, GCS, Azure Blob, filesystem, etc.)

DocumentSource separates "where documents come from" from "how documents are processed". This enables:

DocumentSource provides raw document bytes; text extraction is handled by DocumentExtractor.

To use a DocumentSource with the RAG system, convert it to a DocumentLoader using SourceBackedLoader:

val s3Source = S3DocumentSource(bucket, prefix)
val loader = SourceBackedLoader(s3Source)
rag.sync(loader)

Attributes

Human-readable description of this source.

Used for logging and debugging (e.g., "S3(my-bucket/docs/)")

List all document references in this source.

Returns an iterator for streaming large document sets. Each element is either a successful DocumentRef or an error.

Read document content into memory.

Estimated number of documents in this source, if known.

Used for progress reporting. Return None if unknown or expensive to compute.

Read document content as a stream.

Use this for large documents to avoid loading everything into memory. The caller is responsible for closing the returned stream.

Default implementation wraps readDocument; override for true streaming.

In this article

Generated with