Generated with

core/org.llm4s/org.llm4s.rag/org.llm4s.rag.extract/DefaultDocumentExtractor

DefaultDocumentExtractor

org.llm4s.rag.extract.DefaultDocumentExtractor

object DefaultDocumentExtractor extends DocumentExtractor

Default implementation of DocumentExtractor using Apache Tika, PDFBox, and POI.

Supports:

PDF documents via Apache PDFBox
Word documents (.docx) via Apache POI
Plain text files with UTF-8 encoding
HTML, XML, JSON via Apache Tika
Other formats via Tika fallback

Attributes

Graph
Supertypes: trait DocumentExtractor

class Object

trait Matchable

class Any
Self type: DefaultDocumentExtractor.type

Members list

Value members

Concrete methods

Check if a MIME type can be extracted to text.

Check if a MIME type can be extracted to text.

Value parameters

mimeType: The MIME type to check

Attributes

Returns: true if this extractor can handle the MIME type
Definition Classes: DocumentExtractor

Detect the MIME type of document content.

Detect the MIME type of document content.

Value parameters

content: Raw document bytes (only first few KB are needed)
filename: Filename hint for detection

Attributes

Returns: Detected MIME type string
Definition Classes: DocumentExtractor

Extract text content from document bytes.

Extract text content from document bytes.

Value parameters

content: Raw document bytes
filename: Filename for type detection (e.g., "report.pdf"). Used to detect format if mimeType is not provided.
mimeType: Optional MIME type override. If provided, skips detection.

Attributes

Returns: Extracted document with text, metadata, and detected format
Definition Classes: DocumentExtractor

Extract text content from an InputStream.

Extract text content from an InputStream.

Use this for large files to avoid loading the entire content into memory. The caller is responsible for closing the stream after this method returns.

Value parameters

filename: Filename for type detection
input: InputStream to read from
mimeType: Optional MIME type override

Attributes

Returns: Extracted document with text, metadata, and detected format
Definition Classes: DocumentExtractor

In this article

Generated with