S3DocumentSource

org.llm4s.rag.loader.s3.S3DocumentSource
See theS3DocumentSource companion object
final case class S3DocumentSource(bucket: String, prefix: String, region: String, extensions: Set[String], credentials: Option[AwsCredentialsProvider], metadata: Map[String, String], endpointOverride: Option[String]) extends SyncableSource

Document source for AWS S3.

Reads documents from an S3 bucket with support for:

  • Prefix filtering (e.g., "docs/", "reports/2024/")
  • File extension filtering
  • Automatic change detection via ETags
  • Pagination for large buckets

Authentication uses the AWS credential chain by default:

  1. Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
  2. System properties
  3. AWS credentials file (~/.aws/credentials)
  4. IAM role (for EC2/Lambda)

Usage:

val source = S3DocumentSource("my-bucket", prefix = "docs/")
val loader = SourceBackedLoader(source)
rag.sync(loader)

Value parameters

bucket

S3 bucket name

credentials

Optional credentials provider (default: AWS credential chain)

endpointOverride

Optional endpoint override (for LocalStack, MinIO, etc.)

extensions

File extensions to include (empty = all files)

metadata

Additional metadata to attach to all documents

prefix

Key prefix to filter objects (e.g., "docs/", "reports/")

region

AWS region (default: us-east-1)

Attributes

Companion
object
Graph
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Members list

Value members

Concrete methods

override def description: String

Human-readable description of this source.

Human-readable description of this source.

Used for logging and debugging (e.g., "S3(my-bucket/docs/)")

Attributes

Definition Classes
override def estimatedCount: Option[Int]

Estimated number of documents in this source, if known.

Estimated number of documents in this source, if known.

Used for progress reporting. Return None if unknown or expensive to compute.

Attributes

Definition Classes

Get version information for change detection.

Get version information for change detection.

This should return quickly without reading the full document content. For S3, use the ETag; for filesystems, use content hash + mtime.

Value parameters

ref

Document reference

Attributes

Returns

Version info for comparison, or error if unavailable

Definition Classes
override def listDocuments(): Iterator[Result[DocumentRef]]

List all document references in this source.

List all document references in this source.

Returns an iterator for streaming large document sets. Each element is either a successful DocumentRef or an error.

Attributes

Definition Classes

Read document content into memory.

Read document content into memory.

Value parameters

ref

Document reference from listDocuments()

Attributes

Returns

Raw document bytes or an error

Definition Classes
override def readDocumentStream(ref: DocumentRef): Result[InputStream]

Read document content as a stream.

Read document content as a stream.

Use this for large documents to avoid loading everything into memory. The caller is responsible for closing the returned stream.

Default implementation wraps readDocument; override for true streaming.

Value parameters

ref

Document reference from listDocuments()

Attributes

Returns

InputStream for the document content or an error

Definition Classes
def withEndpoint(endpoint: String): S3DocumentSource

Create a copy with endpoint override (for LocalStack, MinIO).

Create a copy with endpoint override (for LocalStack, MinIO).

Attributes

def withExtensions(newExtensions: Set[String]): S3DocumentSource

Create a copy with different extensions filter.

Create a copy with different extensions filter.

Attributes

def withMetadata(additionalMetadata: Map[String, String]): S3DocumentSource

Create a copy with additional metadata.

Create a copy with additional metadata.

Attributes

def withPrefix(newPrefix: String): S3DocumentSource

Create a copy with different prefix.

Create a copy with different prefix.

Attributes

Inherited methods

def productElementNames: Iterator[String]

Attributes

Inherited from:
Product
def productIterator: Iterator[Any]

Attributes

Inherited from:
Product