DefaultDocumentExtractor

org.llm4s.rag.extract.DefaultDocumentExtractor

Default implementation of DocumentExtractor using Apache Tika, PDFBox, and POI.

Supports:

  • PDF documents via Apache PDFBox
  • Word documents (.docx) via Apache POI
  • Plain text files with UTF-8 encoding
  • HTML, XML, JSON via Apache Tika
  • Other formats via Tika fallback

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Self type

Members list

Value members

Concrete methods

override def canExtract(mimeType: String): Boolean

Check if a MIME type can be extracted to text.

Check if a MIME type can be extracted to text.

Value parameters

mimeType

The MIME type to check

Attributes

Returns

true if this extractor can handle the MIME type

Definition Classes
override def detectMimeType(content: Array[Byte], filename: String): String

Detect the MIME type of document content.

Detect the MIME type of document content.

Value parameters

content

Raw document bytes (only first few KB are needed)

filename

Filename hint for detection

Attributes

Returns

Detected MIME type string

Definition Classes
override def extract(content: Array[Byte], filename: String, mimeType: Option[String]): Result[ExtractedDocument]

Extract text content from document bytes.

Extract text content from document bytes.

Value parameters

content

Raw document bytes

filename

Filename for type detection (e.g., "report.pdf"). Used to detect format if mimeType is not provided.

mimeType

Optional MIME type override. If provided, skips detection.

Attributes

Returns

Extracted document with text, metadata, and detected format

Definition Classes
override def extractFromStream(input: InputStream, filename: String, mimeType: Option[String]): Result[ExtractedDocument]

Extract text content from an InputStream.

Extract text content from an InputStream.

Use this for large files to avoid loading the entire content into memory. The caller is responsible for closing the stream after this method returns.

Value parameters

filename

Filename for type detection

input

InputStream to read from

mimeType

Optional MIME type override

Attributes

Returns

Extracted document with text, metadata, and detected format

Definition Classes