UniversalEncoder

org.llm4s.llmconnect.encoding.UniversalEncoder

UniversalEncoder handles extracting content from various file types and passing it to the appropriate embedding models.

Encodes files of arbitrary MIME types into embedding vector sequences.

MIME type is detected automatically via Apache Tika. Dispatch then depends on the media type:

  • Text-like files (plain text, HTML, PDF, source code, …): text is extracted by UniversalExtractor, optionally chunked, then embedded via the supplied EmbeddingClient. Real embeddings are always produced.

  • Image / Audio / Video: behaviour depends on experimentalStubsEnabled. When false, the file bytes are read (bounded by maxMediaFileSize) and forwarded to client.embedMultimodal() to obtain real provider embeddings. When true, a deterministic L2-normalised stub vector is returned instead; the vector is seeded from the file name, size, and last-modified time, so the same file always produces the same stub vector.

== Stub dimensions ==

Stub vectors are capped at MAX_STUB_DIMENSION (8 192) regardless of the configured model dimension, to prevent OOM errors during testing.

== Modality disambiguation ==

Each modality (image, audio, video) uses a different XOR seed constant when generating stub vectors, so stubs for the same file differ across modalities.

Attributes

Graph
Supertypes
class Object
trait Matchable
class Any
Self type

Members list

Type members

Classlikes

final case class TextChunkingConfig(enabled: Boolean, size: Int, overlap: Int)

Controls how extracted text is split before embedding.

Controls how extracted text is split before embedding.

Value parameters

enabled

When false, the full extracted text is embedded as a single unit.

overlap

Number of characters shared between adjacent chunks, to preserve context at chunk boundaries.

size

Target chunk size in characters.

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Value members

Concrete methods

def encodeFromPath(path: Path, client: EmbeddingClient, textModel: EmbeddingModelConfig, chunking: TextChunkingConfig, experimentalStubsEnabled: Boolean, localModels: LocalEmbeddingModels, maxMediaFileSize: Long): Result[Seq[EmbeddingVector]]

Encodes the file at path into one or more embedding vectors.

Encodes the file at path into one or more embedding vectors.

Value parameters

chunking

Text chunking settings; if enabled, the extracted text is split before embedding.

client

Embedding client used for text files and for multimodal files when experimentalStubsEnabled is false.

experimentalStubsEnabled

When false, image/audio/video files are read and forwarded to client.embedMultimodal() for real provider embeddings. When true, deterministic stub vectors are returned.

localModels

Model configurations for image, audio, and video.

maxMediaFileSize

Maximum allowed media file size in bytes. Files exceeding this limit are rejected with an error. Defaults to 50 MB.

path

Path to the file to encode; must exist and be a regular file.

textModel

Model configuration (name + dimensions) forwarded to client.

Attributes

Returns

Right(vectors) — one vector per text chunk, or one vector per non-text file. Left(EmbeddingError) when the file does not exist, the MIME type is unsupported, or the media file exceeds maxMediaFileSize.

Concrete fields