org.llm4s.rag.extract

Default implementation of DocumentExtractor using Apache Tika, PDFBox, and POI.

Supports:

PDF documents via Apache PDFBox
Word documents (.docx) via Apache POI
Plain text files with UTF-8 encoding
HTML, XML, JSON via Apache Tika
Other formats via Tika fallback

Attributes

Supertypes: trait DocumentExtractor

class Object

trait Matchable

class Any
Self type: DefaultDocumentExtractor.type

Service for extracting text content from documents.

DocumentExtractor is source-agnostic - it works with raw bytes from any source (filesystem, S3, HTTP, database, etc.). This allows the same extraction logic to be used regardless of where the document is stored.

Supported formats:

Plain text files (.txt, .md, .json, .xml, .csv, .html)
PDF documents (.pdf)
Word documents (.docx, .doc)

Usage:

val extractor = DefaultDocumentExtractor

// Extract from bytes (common for S3, HTTP responses)
val result = extractor.extract(bytes, "report.pdf")

// Extract from stream (for large files)
val result = extractor.extractFromStream(inputStream, "report.pdf")

Attributes

Supertypes: class Object

trait Matchable

class Any
Known subtypes: object DefaultDocumentExtractor

Supported document formats for extraction.

Attributes

Companion: object
Supertypes: class Object

trait Matchable

class Any
Known subtypes: object CSV

object DOC

object DOCX

object HTML

object JSON

object Markdown

object PDF

object PlainText

object Unknown

object XML
Show all

Attributes

Companion: trait
Supertypes: trait Sum

trait Mirror

class Object

trait Matchable

class Any
Self type: DocumentFormat.type

Extracted document content with metadata.

Represents the result of extracting text from a document, including any metadata that could be extracted (title, author, etc.)

Value parameters

format: The detected document format
metadata: Document metadata (title, author, pageCount, etc.)
text: The extracted text content

Attributes

Supertypes: trait Serializable

trait Product

trait Equals

class Object

trait Matchable

class Any
Show all

org.llm4s.rag.extract

Members list

Type members

Classlikes

Attributes

Attributes

Attributes

Attributes

Value parameters

Attributes