UniversalExtractor
Extracts text and multimedia content from files of various formats.
MIME type detection is performed by Apache Tika. Supported formats include:
- PDF (via PDFBox)
- DOCX (via Apache POI)
- Plain text and other text types
- Images (via ImageIO)
- Unknown types (Tika fallback)
Three entry points are provided:
extract-- text-only extraction from a file pathextractAny-- multimedia-aware extraction returning an Extracted ADTextractFromBytes/extractFromStream-- source-agnostic text extraction from raw bytes
Attributes
- Graph
-
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
UniversalExtractor.type
Members list
Type members
Classlikes
Extracted audio content as mono PCM samples with a sample rate.
Extracted audio content as mono PCM samples with a sample rate.
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait Extractedclass Objecttrait Matchableclass AnyShow all
ADT representing extracted content from a file, discriminated by media type.
ADT representing extracted content from a file, discriminated by media type.
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Known subtypes
Extracted image content.
Extracted image content.
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait Extractedclass Objecttrait Matchableclass AnyShow all
Extracted text content (from PDF, DOCX, plain text, etc.).
Extracted text content (from PDF, DOCX, plain text, etc.).
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait Extractedclass Objecttrait Matchableclass AnyShow all
Extracted video content as a sequence of frames at a given frame rate.
Extracted video content as a sequence of frames at a given frame rate.
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalstrait Extractedclass Objecttrait Matchableclass AnyShow all
Value members
Concrete methods
Detect MIME type from bytes and filename.
Detect MIME type from bytes and filename.
Value parameters
- content
-
Raw document bytes (first few KB are sufficient)
- filename
-
Filename hint for detection
Attributes
- Returns
-
Detected MIME type string
Extract text content from a file at the given path.
Extract text content from a file at the given path.
Supports PDF, DOCX, plain text, and Tika-parseable formats. Returns an error for files that cannot be found or whose MIME type is unsupported.
Value parameters
- inputPath
-
path to the file (quotes and whitespace are trimmed)
Attributes
- Returns
-
extracted text or an ExtractorError
Extract content from a file, returning the appropriate Extracted subtype based on MIME type (text, image, audio, or video).
Extract text from raw bytes.
Extract text from raw bytes.
This method enables source-agnostic document extraction - the same extraction logic can be used for documents from S3, HTTP responses, databases, etc.
Value parameters
- content
-
Raw document bytes
- filename
-
Filename for MIME type detection (e.g., "report.pdf")
- mimeType
-
Optional explicit MIME type (skips detection if provided)
Attributes
- Returns
-
Extracted text content or an error
Extract text from an InputStream.
Extract text from an InputStream.
Note: This reads the entire stream into memory for processing. The caller is responsible for closing the stream after this method returns.
Value parameters
- filename
-
Filename for MIME type detection
- input
-
InputStream to read from
- mimeType
-
Optional explicit MIME type
Attributes
- Returns
-
Extracted text content or an error
Check whether a MIME type represents text-extractable content (PDF, DOCX, text types, JSON, XML).
Check whether a MIME type represents text-extractable content (PDF, DOCX, text types, JSON, XML).
Value parameters
- mime
-
the MIME type string to check
Attributes
- Returns
-
true if the MIME type can be processed as text