UniversalEncoder handles extracting content from various file types and passing it to the appropriate embedding models.
Encodes files of arbitrary MIME types into embedding vector sequences.
MIME type is detected automatically via Apache Tika. Dispatch then depends on the media type:
Text-like files (plain text, HTML, PDF, source code, …): text is extracted by UniversalExtractor, optionally chunked, then embedded via the supplied EmbeddingClient. Real embeddings are always produced.
Image / Audio / Video: behaviour depends on experimentalStubsEnabled. When false, the file bytes are read (bounded by maxMediaFileSize) and forwarded to client.embedMultimodal() to obtain real provider embeddings. When true, a deterministic L2-normalised stub vector is returned instead; the vector is seeded from the file name, size, and last-modified time, so the same file always produces the same stub vector.
== Stub dimensions ==
Stub vectors are capped at MAX_STUB_DIMENSION (8 192) regardless of the configured model dimension, to prevent OOM errors during testing.
== Modality disambiguation ==
Each modality (image, audio, video) uses a different XOR seed constant when generating stub vectors, so stubs for the same file differ across modalities.
Encodes the file at path into one or more embedding vectors.
Encodes the file at path into one or more embedding vectors.
Value parameters
chunking
Text chunking settings; if enabled, the extracted text is split before embedding.
client
Embedding client used for text files and for multimodal files when experimentalStubsEnabled is false.
experimentalStubsEnabled
When false, image/audio/video files are read and forwarded to client.embedMultimodal() for real provider embeddings. When true, deterministic stub vectors are returned.
localModels
Model configurations for image, audio, and video.
maxMediaFileSize
Maximum allowed media file size in bytes. Files exceeding this limit are rejected with an error. Defaults to 50 MB.
path
Path to the file to encode; must exist and be a regular file.
textModel
Model configuration (name + dimensions) forwarded to client.
Attributes
Returns
Right(vectors) — one vector per text chunk, or one vector per non-text file. Left(EmbeddingError) when the file does not exist, the MIME type is unsupported, or the media file exceeds maxMediaFileSize.