DocumentSource

org.llm4s.rag.loader.DocumentSource
See theDocumentSource companion object

Abstraction for document sources (S3, GCS, Azure Blob, filesystem, etc.)

DocumentSource separates "where documents come from" from "how documents are processed". This enables:

  • Using the same extraction logic for documents from any source
  • Easy addition of new sources without modifying extraction code
  • Source-specific optimizations (e.g., S3 pagination, streaming)

DocumentSource provides raw document bytes; text extraction is handled by DocumentExtractor.

To use a DocumentSource with the RAG system, convert it to a DocumentLoader using SourceBackedLoader:

val s3Source = S3DocumentSource(bucket, prefix)
val loader = SourceBackedLoader(s3Source)
rag.sync(loader)

Attributes

Companion
object
Graph
Supertypes
class Object
trait Matchable
class Any
Known subtypes

Members list

Value members

Abstract methods

def description: String

Human-readable description of this source.

Human-readable description of this source.

Used for logging and debugging (e.g., "S3(my-bucket/docs/)")

Attributes

def listDocuments(): Iterator[Result[DocumentRef]]

List all document references in this source.

List all document references in this source.

Returns an iterator for streaming large document sets. Each element is either a successful DocumentRef or an error.

Attributes

Read document content into memory.

Read document content into memory.

Value parameters

ref

Document reference from listDocuments()

Attributes

Returns

Raw document bytes or an error

Concrete methods

def estimatedCount: Option[Int]

Estimated number of documents in this source, if known.

Estimated number of documents in this source, if known.

Used for progress reporting. Return None if unknown or expensive to compute.

Attributes

def readDocumentStream(ref: DocumentRef): Result[InputStream]

Read document content as a stream.

Read document content as a stream.

Use this for large documents to avoid loading everything into memory. The caller is responsible for closing the returned stream.

Default implementation wraps readDocument; override for true streaming.

Value parameters

ref

Document reference from listDocuments()

Attributes

Returns

InputStream for the document content or an error