DocumentExtractor

org.llm4s.rag.extract.DocumentExtractor

Service for extracting text content from documents.

DocumentExtractor is source-agnostic - it works with raw bytes from any source (filesystem, S3, HTTP, database, etc.). This allows the same extraction logic to be used regardless of where the document is stored.

Supported formats:

  • Plain text files (.txt, .md, .json, .xml, .csv, .html)
  • PDF documents (.pdf)
  • Word documents (.docx, .doc)

Usage:

val extractor = DefaultDocumentExtractor

// Extract from bytes (common for S3, HTTP responses)
val result = extractor.extract(bytes, "report.pdf")

// Extract from stream (for large files)
val result = extractor.extractFromStream(inputStream, "report.pdf")

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Known subtypes

Members list

Value members

Abstract methods

def canExtract(mimeType: String): Boolean

Check if a MIME type can be extracted to text.

Check if a MIME type can be extracted to text.

Value parameters

mimeType

The MIME type to check

Attributes

Returns

true if this extractor can handle the MIME type

def detectMimeType(content: Array[Byte], filename: String): String

Detect the MIME type of document content.

Detect the MIME type of document content.

Value parameters

content

Raw document bytes (only first few KB are needed)

filename

Filename hint for detection

Attributes

Returns

Detected MIME type string

def extract(content: Array[Byte], filename: String, mimeType: Option[String]): Result[ExtractedDocument]

Extract text content from document bytes.

Extract text content from document bytes.

Value parameters

content

Raw document bytes

filename

Filename for type detection (e.g., "report.pdf"). Used to detect format if mimeType is not provided.

mimeType

Optional MIME type override. If provided, skips detection.

Attributes

Returns

Extracted document with text, metadata, and detected format

def extractFromStream(input: InputStream, filename: String, mimeType: Option[String]): Result[ExtractedDocument]

Extract text content from an InputStream.

Extract text content from an InputStream.

Use this for large files to avoid loading the entire content into memory. The caller is responsible for closing the stream after this method returns.

Value parameters

filename

Filename for type detection

input

InputStream to read from

mimeType

Optional MIME type override

Attributes

Returns

Extracted document with text, metadata, and detected format