DefaultDocumentExtractor
Default implementation of DocumentExtractor using Apache Tika, PDFBox, and POI.
Supports:
- PDF documents via Apache PDFBox
- Word documents (.docx) via Apache POI
- Plain text files with UTF-8 encoding
- HTML, XML, JSON via Apache Tika
- Other formats via Tika fallback
Attributes
- Graph
-
- Supertypes
- Self type
Members list
Value members
Concrete methods
Check if a MIME type can be extracted to text.
Check if a MIME type can be extracted to text.
Value parameters
- mimeType
-
The MIME type to check
Attributes
- Returns
-
true if this extractor can handle the MIME type
- Definition Classes
Detect the MIME type of document content.
Detect the MIME type of document content.
Value parameters
- content
-
Raw document bytes (only first few KB are needed)
- filename
-
Filename hint for detection
Attributes
- Returns
-
Detected MIME type string
- Definition Classes
Extract text content from document bytes.
Extract text content from document bytes.
Value parameters
- content
-
Raw document bytes
- filename
-
Filename for type detection (e.g., "report.pdf"). Used to detect format if mimeType is not provided.
- mimeType
-
Optional MIME type override. If provided, skips detection.
Attributes
- Returns
-
Extracted document with text, metadata, and detected format
- Definition Classes
Extract text content from an InputStream.
Extract text content from an InputStream.
Use this for large files to avoid loading the entire content into memory. The caller is responsible for closing the stream after this method returns.
Value parameters
- filename
-
Filename for type detection
- input
-
InputStream to read from
- mimeType
-
Optional MIME type override
Attributes
- Returns
-
Extracted document with text, metadata, and detected format
- Definition Classes