org.llm4s.rag.extract

Members list

Type members

Classlikes

Default implementation of DocumentExtractor using Apache Tika, PDFBox, and POI.

Default implementation of DocumentExtractor using Apache Tika, PDFBox, and POI.

Supports:

  • PDF documents via Apache PDFBox
  • Word documents (.docx) via Apache POI
  • Plain text files with UTF-8 encoding
  • HTML, XML, JSON via Apache Tika
  • Other formats via Tika fallback

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type

Service for extracting text content from documents.

Service for extracting text content from documents.

DocumentExtractor is source-agnostic - it works with raw bytes from any source (filesystem, S3, HTTP, database, etc.). This allows the same extraction logic to be used regardless of where the document is stored.

Supported formats:

  • Plain text files (.txt, .md, .json, .xml, .csv, .html)
  • PDF documents (.pdf)
  • Word documents (.docx, .doc)

Usage:

val extractor = DefaultDocumentExtractor

// Extract from bytes (common for S3, HTTP responses)
val result = extractor.extract(bytes, "report.pdf")

// Extract from stream (for large files)
val result = extractor.extractFromStream(inputStream, "report.pdf")

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes
sealed trait DocumentFormat

Supported document formats for extraction.

Supported document formats for extraction.

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any
Known subtypes
object CSV
object DOC
object DOCX
object HTML
object JSON
object Markdown
object PDF
object PlainText
object Unknown
object XML
Show all

Attributes

Companion
trait
Supertypes
trait Sum
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class ExtractedDocument(text: String, metadata: Map[String, String], format: DocumentFormat)

Extracted document content with metadata.

Extracted document content with metadata.

Represents the result of extracting text from a document, including any metadata that could be extracted (title, author, etc.)

Value parameters

format

The detected document format

metadata

Document metadata (title, author, pageCount, etc.)

text

The extracted text content

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all