core/org.llm4s/org.llm4s.llmconnect/org.llm4s.llmconnect.encoding/UniversalEncoder

UniversalEncoder

org.llm4s.llmconnect.encoding.UniversalEncoder

UniversalEncoder handles extracting content from various file types and passing it to the appropriate embedding models.

Encodes files of arbitrary MIME types into embedding vector sequences.

MIME type is detected automatically via Apache Tika. Dispatch then depends on the media type:

Text-like files (plain text, HTML, PDF, source code, …): text is extracted by UniversalExtractor, optionally chunked, then embedded via the supplied EmbeddingClient. Real embeddings are always produced.
Image / Audio / Video: behaviour depends on experimentalStubsEnabled. When false, the file bytes are read (bounded by maxMediaFileSize) and forwarded to client.embedMultimodal() to obtain real provider embeddings. When true, a deterministic L2-normalised stub vector is returned instead; the vector is seeded from the file name, size, and last-modified time, so the same file always produces the same stub vector.

== Stub dimensions ==

Stub vectors are capped at MAX_STUB_DIMENSION (8 192) regardless of the configured model dimension, to prevent OOM errors during testing.

== Modality disambiguation ==

Each modality (image, audio, video) uses a different XOR seed constant when generating stub vectors, so stubs for the same file differ across modalities.

Attributes

Graph
Supertypes: class Object

trait Matchable

class Any
Self type: UniversalEncoder.type

Members list

Type members

Classlikes

Controls how extracted text is split before embedding.

Value parameters

enabled: When false, the full extracted text is embedded as a single unit.
overlap: Number of characters shared between adjacent chunks, to preserve context at chunk boundaries.
size: Target chunk size in characters.

Attributes

Supertypes: trait Serializable

trait Product

trait Equals

class Object

trait Matchable

class Any
Show all

Value members

Concrete methods

Encodes the file at path into one or more embedding vectors.

Value parameters

chunking: Text chunking settings; if enabled, the extracted text is split before embedding.
client: Embedding client used for text files and for multimodal files when experimentalStubsEnabled is false.
experimentalStubsEnabled: When false, image/audio/video files are read and forwarded to client.embedMultimodal() for real provider embeddings. When true, deterministic stub vectors are returned.
localModels: Model configurations for image, audio, and video.
maxMediaFileSize: Maximum allowed media file size in bytes. Files exceeding this limit are rejected with an error. Defaults to 50 MB.
path: Path to the file to encode; must exist and be a regular file.
textModel: Model configuration (name + dimensions) forwarded to client.

Attributes

Returns: Right(vectors) — one vector per text chunk, or one vector per non-text file. Left(EmbeddingError) when the file does not exist, the MIME type is unsupported, or the media file exceeds maxMediaFileSize.

Concrete fields

In this article

Generated with