org.llm4s.llmconnect.encoding

Members list

Type members

Classlikes

UniversalEncoder handles extracting content from various file types and passing it to the appropriate embedding models.

UniversalEncoder handles extracting content from various file types and passing it to the appropriate embedding models.

Encodes files of arbitrary MIME types into embedding vector sequences.

MIME type is detected automatically via Apache Tika. Dispatch then depends on the media type:

  • Text-like files (plain text, HTML, PDF, source code, …): text is extracted by UniversalExtractor, optionally chunked, then embedded via the supplied EmbeddingClient. Real embeddings are always produced.

  • Image / Audio / Video: behaviour depends on experimentalStubsEnabled. When false, the file bytes are read (bounded by maxMediaFileSize) and forwarded to client.embedMultimodal() to obtain real provider embeddings. When true, a deterministic L2-normalised stub vector is returned instead; the vector is seeded from the file name, size, and last-modified time, so the same file always produces the same stub vector.

== Stub dimensions ==

Stub vectors are capped at MAX_STUB_DIMENSION (8 192) regardless of the configured model dimension, to prevent OOM errors during testing.

== Modality disambiguation ==

Each modality (image, audio, video) uses a different XOR seed constant when generating stub vectors, so stubs for the same file differ across modalities.

Attributes

Supertypes
class Object
trait Matchable
class Any
Self type