UniversalExtractor

org.llm4s.llmconnect.extractors.UniversalExtractor

Extracts text and multimedia content from files of various formats.

MIME type detection is performed by Apache Tika. Supported formats include:

  • PDF (via PDFBox)
  • DOCX (via Apache POI)
  • Plain text and other text types
  • Images (via ImageIO)
  • Unknown types (Tika fallback)

Three entry points are provided:

  • extract -- text-only extraction from a file path
  • extractAny -- multimedia-aware extraction returning an Extracted ADT
  • extractFromBytes / extractFromStream -- source-agnostic text extraction from raw bytes

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Self type

Members list

Type members

Classlikes

final case class AudioContent(samples: Array[Float], sampleRate: Int) extends Extracted

Extracted audio content as mono PCM samples with a sample rate.

Extracted audio content as mono PCM samples with a sample rate.

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
trait Extracted
class Object
trait Matchable
class Any
Show all
sealed trait Extracted

ADT representing extracted content from a file, discriminated by media type.

ADT representing extracted content from a file, discriminated by media type.

Attributes

Supertypes
class Object
trait Matchable
class Any
Known subtypes
final case class ImageContent(image: BufferedImage) extends Extracted

Extracted image content.

Extracted image content.

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
trait Extracted
class Object
trait Matchable
class Any
Show all
final case class TextContent(text: String) extends Extracted

Extracted text content (from PDF, DOCX, plain text, etc.).

Extracted text content (from PDF, DOCX, plain text, etc.).

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
trait Extracted
class Object
trait Matchable
class Any
Show all
final case class VideoContent(frames: Seq[BufferedImage], fps: Int) extends Extracted

Extracted video content as a sequence of frames at a given frame rate.

Extracted video content as a sequence of frames at a given frame rate.

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
trait Extracted
class Object
trait Matchable
class Any
Show all

Value members

Concrete methods

def detectMimeType(content: Array[Byte], filename: String): String

Detect MIME type from bytes and filename.

Detect MIME type from bytes and filename.

Value parameters

content

Raw document bytes (first few KB are sufficient)

filename

Filename hint for detection

Attributes

Returns

Detected MIME type string

def extract(inputPath: String): Either[ExtractorError, String]

Extract text content from a file at the given path.

Extract text content from a file at the given path.

Supports PDF, DOCX, plain text, and Tika-parseable formats. Returns an error for files that cannot be found or whose MIME type is unsupported.

Value parameters

inputPath

path to the file (quotes and whitespace are trimmed)

Attributes

Returns

extracted text or an ExtractorError

def extractAny(inputPath: String): Either[ExtractorError, Extracted]

Extract content from a file, returning the appropriate Extracted subtype based on MIME type (text, image, audio, or video).

Extract content from a file, returning the appropriate Extracted subtype based on MIME type (text, image, audio, or video).

Value parameters

inputPath

path to the file (quotes and whitespace are trimmed)

Attributes

Returns

extracted content as an Extracted ADT variant, or an ExtractorError

def extractFromBytes(content: Array[Byte], filename: String, mimeType: Option[String]): Either[ExtractorError, String]

Extract text from raw bytes.

Extract text from raw bytes.

This method enables source-agnostic document extraction - the same extraction logic can be used for documents from S3, HTTP responses, databases, etc.

Value parameters

content

Raw document bytes

filename

Filename for MIME type detection (e.g., "report.pdf")

mimeType

Optional explicit MIME type (skips detection if provided)

Attributes

Returns

Extracted text content or an error

def extractFromStream(input: InputStream, filename: String, mimeType: Option[String]): Either[ExtractorError, String]

Extract text from an InputStream.

Extract text from an InputStream.

Note: This reads the entire stream into memory for processing. The caller is responsible for closing the stream after this method returns.

Value parameters

filename

Filename for MIME type detection

input

InputStream to read from

mimeType

Optional explicit MIME type

Attributes

Returns

Extracted text content or an error

def isTextLike(mime: String): Boolean

Check whether a MIME type represents text-extractable content (PDF, DOCX, text types, JSON, XML).

Check whether a MIME type represents text-extractable content (PDF, DOCX, text types, JSON, XML).

Value parameters

mime

the MIME type string to check

Attributes

Returns

true if the MIME type can be processed as text