DocumentExtractor
org.llm4s.rag.extract.DocumentExtractor
trait DocumentExtractor
Service for extracting text content from documents.
DocumentExtractor is source-agnostic - it works with raw bytes from any source (filesystem, S3, HTTP, database, etc.). This allows the same extraction logic to be used regardless of where the document is stored.
Supported formats:
- Plain text files (.txt, .md, .json, .xml, .csv, .html)
- PDF documents (.pdf)
- Word documents (.docx, .doc)
Usage:
val extractor = DefaultDocumentExtractor
// Extract from bytes (common for S3, HTTP responses)
val result = extractor.extract(bytes, "report.pdf")
// Extract from stream (for large files)
val result = extractor.extractFromStream(inputStream, "report.pdf")
Attributes
- Graph
-
- Supertypes
-
class Objecttrait Matchableclass Any
- Known subtypes
-
object DefaultDocumentExtractor
Members list
In this article