org.llm4s.rag.loader.DocumentSource
See theDocumentSource companion object
Abstraction for document sources (S3, GCS, Azure Blob, filesystem, etc.)
DocumentSource separates "where documents come from" from "how documents are processed". This enables:
- Using the same extraction logic for documents from any source
- Easy addition of new sources without modifying extraction code
- Source-specific optimizations (e.g., S3 pagination, streaming)
DocumentSource provides raw document bytes; text extraction is handled by DocumentExtractor.
To use a DocumentSource with the RAG system, convert it to a DocumentLoader using SourceBackedLoader:
val s3Source = S3DocumentSource(bucket, prefix)
val loader = SourceBackedLoader(s3Source)
rag.sync(loader)
Attributes
-
Companion
-
object
-
Graph
-
-
Supertypes
-
class Object
trait Matchable
class Any
-
Known subtypes
-
Members list
Human-readable description of this source.
Human-readable description of this source.
Used for logging and debugging (e.g., "S3(my-bucket/docs/)")
Attributes
List all document references in this source.
List all document references in this source.
Returns an iterator for streaming large document sets. Each element is either a successful DocumentRef or an error.
Attributes
Read document content into memory.
Read document content into memory.
Value parameters
-
ref
-
Document reference from listDocuments()
Attributes
-
Returns
-
Raw document bytes or an error
Estimated number of documents in this source, if known.
Estimated number of documents in this source, if known.
Used for progress reporting. Return None if unknown or expensive to compute.
Attributes
Read document content as a stream.
Read document content as a stream.
Use this for large documents to avoid loading everything into memory. The caller is responsible for closing the returned stream.
Default implementation wraps readDocument; override for true streaming.
Value parameters
-
ref
-
Document reference from listDocuments()
Attributes
-
Returns
-
InputStream for the document content or an error