org.llm4s.llmconnect.encoding
Members list
Type members
Classlikes
UniversalEncoder handles extracting content from various file types and passing it to the appropriate embedding models.
UniversalEncoder handles extracting content from various file types and passing it to the appropriate embedding models.
Encodes files of arbitrary MIME types into embedding vector sequences.
MIME type is detected automatically via Apache Tika. Dispatch then depends on the media type:
-
Text-like files (plain text, HTML, PDF, source code, …): text is extracted by
UniversalExtractor, optionally chunked, then embedded via the suppliedEmbeddingClient. Real embeddings are always produced. -
Image / Audio / Video: behaviour depends on
experimentalStubsEnabled. Whenfalse, the file bytes are read (bounded bymaxMediaFileSize) and forwarded toclient.embedMultimodal()to obtain real provider embeddings. Whentrue, a deterministic L2-normalised stub vector is returned instead; the vector is seeded from the file name, size, and last-modified time, so the same file always produces the same stub vector.
== Stub dimensions ==
Stub vectors are capped at MAX_STUB_DIMENSION (8 192) regardless of the configured model dimension, to prevent OOM errors during testing.
== Modality disambiguation ==
Each modality (image, audio, video) uses a different XOR seed constant when generating stub vectors, so stubs for the same file differ across modalities.
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
UniversalEncoder.type