llm4s-core/org.llm4s/org.llm4s.rag/org.llm4s.rag.extract/DocumentExtractor

DocumentExtractor

org.llm4s.rag.extract.DocumentExtractor

Service for extracting text content from documents.

DocumentExtractor is source-agnostic - it works with raw bytes from any source (filesystem, S3, HTTP, database, etc.). This allows the same extraction logic to be used regardless of where the document is stored.

Supported formats:

Plain text files (.txt, .md, .json, .xml, .csv, .html)
PDF documents (.pdf)
Word documents (.docx, .doc)

Usage:

val extractor = DefaultDocumentExtractor

// Extract from bytes (common for S3, HTTP responses)
val result = extractor.extract(bytes, "report.pdf")

// Extract from stream (for large files)
val result = extractor.extractFromStream(inputStream, "report.pdf")

Attributes

Graph
Supertypes: class Object

trait Matchable

class Any
Known subtypes: object DefaultDocumentExtractor

Members list

Value members

Abstract methods

Check if a MIME type can be extracted to text.

Value parameters

mimeType: The MIME type to check

Attributes

Returns: true if this extractor can handle the MIME type

Detect the MIME type of document content.

Value parameters

content: Raw document bytes (only first few KB are needed)
filename: Filename hint for detection

Attributes

Returns: Detected MIME type string

Extract text content from document bytes.

Value parameters

content: Raw document bytes
filename: Filename for type detection (e.g., "report.pdf"). Used to detect format if mimeType is not provided.
mimeType: Optional MIME type override. If provided, skips detection.

Attributes

Returns: Extracted document with text, metadata, and detected format

Extract text content from an InputStream.

Use this for large files to avoid loading the entire content into memory. The caller is responsible for closing the stream after this method returns.

Value parameters

filename: Filename for type detection
input: InputStream to read from
mimeType: Optional MIME type override

Attributes

Returns: Extracted document with text, metadata, and detected format

In this article

Generated with