core/org.llm4s/org.llm4s.context/org.llm4s.context.tokens/Tokenizer

Tokenizer

org.llm4s.context.tokens.Tokenizer

Factory for obtaining StringTokenizer instances backed by jtokkit BPE encodings.

Encodings are looked up from the jtokkit default registry which bundles the standard TikToken vocabularies (cl100k_base, o200k_base, etc.). An unknown tokenizerId — one whose name does not match any bundled vocabulary — returns None; callers must handle the absent case explicitly.

Attributes

See also: org.llm4s.identity.TokenizerId for vocabulary name constants

org.llm4s.context.tokens.TokenizerMapping for the model → tokenizer mapping
Graph
Supertypes: class Object

trait Matchable

class Any
Self type: Tokenizer.type

Members list

Value members

Concrete methods

Returns a StringTokenizer for the given vocabulary, or None if the vocabulary name is not recognised by the bundled jtokkit registry.

Value parameters

tokenizerId: identifies the BPE vocabulary (e.g. TokenizerId("cl100k_base"))

Attributes

Returns: Some tokenizer when the vocabulary is available; None otherwise

In this article

Generated with