org.llm4s.context.tokens
Members list
Type members
Classlikes
Converts a plain string into a sequence of Tokens using a specific BPE vocabulary.
Converts a plain string into a sequence of Tokens using a specific BPE vocabulary.
Implementations are backed by jtokkit encodings and obtained via Tokenizer.lookupStringTokenizer; the interface is kept minimal to allow test doubles without a real encoding registry.
Attributes
- Supertypes
-
class Objecttrait Matchableclass Any
Represents a single BPE token as an integer vocabulary index.
Represents a single BPE token as an integer vocabulary index.
The integer corresponds to the token identifier used by the underlying jtokkit / TikToken encoding (e.g. cl100k_base for GPT-4).
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Factory for obtaining StringTokenizer instances backed by jtokkit BPE encodings.
Factory for obtaining StringTokenizer instances backed by jtokkit BPE encodings.
Encodings are looked up from the jtokkit default registry which bundles the standard TikToken vocabularies (cl100k_base, o200k_base, etc.). An unknown tokenizerId — one whose name does not match any bundled vocabulary — returns None; callers must handle the absent case explicitly.
Attributes
- See also
-
org.llm4s.identity.TokenizerId for vocabulary name constants
org.llm4s.context.tokens.TokenizerMapping for the model → tokenizer mapping
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
Tokenizer.type
Represents the accuracy of tokenizer mapping for a model.
Represents the accuracy of tokenizer mapping for a model.
This sealed trait hierarchy captures three levels of accuracy:
- '''Exact''': Native tokenizer available, counts are precise
- '''Approximate''': Using a similar tokenizer, counts may differ
- '''Unknown''': Unknown model, using fallback tokenizer
Attributes
- See also
- Companion
- object
- Supertypes
-
class Objecttrait Matchableclass Any
- Known subtypes
Accuracy level variants for tokenizer mappings.
Accuracy level variants for tokenizer mappings.
Attributes
- Companion
- trait
- Supertypes
-
trait Sumtrait Mirrorclass Objecttrait Matchableclass Any
- Self type
-
TokenizerAccuracy.type
Maps LLM model names to appropriate tokenizers for accurate token counting.
Maps LLM model names to appropriate tokenizers for accurate token counting.
Different LLM providers use different tokenization schemes, and even within a provider, newer models may use updated tokenizers. This object provides the mapping logic to select the correct tokenizer for a given model.
==Supported Providers==
| Provider | Model Pattern | Tokenizer | Accuracy |
|------------|-------------------------|----------------|----------|
| OpenAI | gpt-4o, o1-* | o200k_base | Exact |
| OpenAI | gpt-4, gpt-3.5 | cl100k_base | Exact |
| OpenAI | gpt-3 (legacy) | r50k_base | Exact |
| Azure | (same as OpenAI) | (inherited) | Exact |
| Anthropic | claude-* | cl100k_base | ~75% |
| Ollama | * | cl100k_base | ~80% |
| Unknown | * | cl100k_base | Unknown |
==Model Name Formats==
The mapper accepts various model name formats:
- Plain:
gpt-4o,claude-3-sonnet - Provider-prefixed:
openai/gpt-4o,anthropic/claude-3-sonnet - Azure:
azure/my-gpt4o-deployment - Ollama:
ollama/llama2
==Accuracy Considerations==
For non-OpenAI models, token counts are '''approximations'''. Claude uses a proprietary tokenizer that may differ 20-30% from cl100k_base. Always check TokenizerAccuracy to understand the expected accuracy.
Attributes
- See also
-
ConversationTokenCounter.forModel for the recommended entry point
TokenizerAccuracy for accuracy information
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
TokenizerMapping.type