org.llm4s.rag.extract
Members list
Type members
Classlikes
Default implementation of DocumentExtractor using Apache Tika, PDFBox, and POI.
Default implementation of DocumentExtractor using Apache Tika, PDFBox, and POI.
Supports:
- PDF documents via Apache PDFBox
- Word documents (.docx) via Apache POI
- Plain text files with UTF-8 encoding
- HTML, XML, JSON via Apache Tika
- Other formats via Tika fallback
Attributes
- Supertypes
- Self type
Service for extracting text content from documents.
Service for extracting text content from documents.
DocumentExtractor is source-agnostic - it works with raw bytes from any source (filesystem, S3, HTTP, database, etc.). This allows the same extraction logic to be used regardless of where the document is stored.
Supported formats:
- Plain text files (.txt, .md, .json, .xml, .csv, .html)
- PDF documents (.pdf)
- Word documents (.docx, .doc)
Usage:
val extractor = DefaultDocumentExtractor
// Extract from bytes (common for S3, HTTP responses)
val result = extractor.extract(bytes, "report.pdf")
// Extract from stream (for large files)
val result = extractor.extractFromStream(inputStream, "report.pdf")
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Known subtypes
-
object DefaultDocumentExtractor
Supported document formats for extraction.
Attributes
- Companion
- trait
- Supertypes
-
trait Sumtrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
DocumentFormat.type
Extracted document content with metadata.
Extracted document content with metadata.
Represents the result of extracting text from a document, including any metadata that could be extracted (title, author, etc.)
Value parameters
- format
-
The detected document format
- metadata
-
Document metadata (title, author, pageCount, etc.)
- text
-
The extracted text content
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all