Tokenizer

org.llm4s.context.tokens.Tokenizer
object Tokenizer

Factory for obtaining StringTokenizer instances backed by jtokkit BPE encodings.

Encodings are looked up from the jtokkit default registry which bundles the standard TikToken vocabularies (cl100k_base, o200k_base, etc.). An unknown tokenizerId — one whose name does not match any bundled vocabulary — returns None; callers must handle the absent case explicitly.

Attributes

See also

org.llm4s.identity.TokenizerId for vocabulary name constants

org.llm4s.context.tokens.TokenizerMapping for the model → tokenizer mapping

Graph
Supertypes
class Object
trait Matchable
class Any
Self type
Tokenizer.type

Members list

Value members

Concrete methods

Returns a StringTokenizer for the given vocabulary, or None if the vocabulary name is not recognised by the bundled jtokkit registry.

Returns a StringTokenizer for the given vocabulary, or None if the vocabulary name is not recognised by the bundled jtokkit registry.

Value parameters

tokenizerId

identifies the BPE vocabulary (e.g. TokenizerId("cl100k_base"))

Attributes

Returns

Some tokenizer when the vocabulary is available; None otherwise