TokenizerMapping
org.llm4s.context.tokens.TokenizerMapping
object TokenizerMapping
Maps LLM model names to appropriate tokenizers for accurate token counting.
Different LLM providers use different tokenization schemes, and even within a provider, newer models may use updated tokenizers. This object provides the mapping logic to select the correct tokenizer for a given model.
==Supported Providers==
| Provider | Model Pattern | Tokenizer | Accuracy |
|------------|-------------------------|----------------|----------|
| OpenAI | gpt-4o, o1-* | o200k_base | Exact |
| OpenAI | gpt-4, gpt-3.5 | cl100k_base | Exact |
| OpenAI | gpt-3 (legacy) | r50k_base | Exact |
| Azure | (same as OpenAI) | (inherited) | Exact |
| Anthropic | claude-* | cl100k_base | ~75% |
| Ollama | * | cl100k_base | ~80% |
| Unknown | * | cl100k_base | Unknown |
==Model Name Formats==
The mapper accepts various model name formats:
- Plain:
gpt-4o,claude-3-sonnet - Provider-prefixed:
openai/gpt-4o,anthropic/claude-3-sonnet - Azure:
azure/my-gpt4o-deployment - Ollama:
ollama/llama2
==Accuracy Considerations==
For non-OpenAI models, token counts are '''approximations'''. Claude uses a proprietary tokenizer that may differ 20-30% from cl100k_base. Always check TokenizerAccuracy to understand the expected accuracy.
Attributes
- See also
-
ConversationTokenCounter.forModel for the recommended entry point
TokenizerAccuracy for accuracy information
- Graph
-
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
TokenizerMapping.type
Members list
In this article