org.llm4s.context.tokens

Converts a plain string into a sequence of Tokens using a specific BPE vocabulary.

Implementations are backed by jtokkit encodings and obtained via Tokenizer.lookupStringTokenizer; the interface is kept minimal to allow test doubles without a real encoding registry.

Attributes

Supertypes: class Object

trait Matchable

class Any

Represents a single BPE token as an integer vocabulary index.

The integer corresponds to the token identifier used by the underlying jtokkit / TikToken encoding (e.g. cl100k_base for GPT-4).

Attributes

Supertypes: trait Serializable

trait Product

trait Equals

class Object

trait Matchable

class Any
Show all

Factory for obtaining StringTokenizer instances backed by jtokkit BPE encodings.

Encodings are looked up from the jtokkit default registry which bundles the standard TikToken vocabularies (cl100k_base, o200k_base, etc.). An unknown tokenizerId — one whose name does not match any bundled vocabulary — returns None; callers must handle the absent case explicitly.

Attributes

See also: org.llm4s.identity.TokenizerId for vocabulary name constants

org.llm4s.context.tokens.TokenizerMapping for the model → tokenizer mapping
Supertypes: class Object

trait Matchable

class Any
Self type: Tokenizer.type

Represents the accuracy of tokenizer mapping for a model.

This sealed trait hierarchy captures three levels of accuracy:

'''Exact''': Native tokenizer available, counts are precise
'''Approximate''': Using a similar tokenizer, counts may differ
'''Unknown''': Unknown model, using fallback tokenizer

Attributes

See also: TokenizerMapping.getAccuracyInfo
Companion: object
Supertypes: class Object

trait Matchable

class Any
Known subtypes: class Approximate

class Exact

class Unknown

Accuracy level variants for tokenizer mappings.

Attributes

Companion: trait
Supertypes: trait Sum

trait Mirror

class Object

trait Matchable

class Any
Self type: TokenizerAccuracy.type

Maps LLM model names to appropriate tokenizers for accurate token counting.

Different LLM providers use different tokenization schemes, and even within a provider, newer models may use updated tokenizers. This object provides the mapping logic to select the correct tokenizer for a given model.

==Supported Providers==

| Provider   | Model Pattern           | Tokenizer      | Accuracy |
|------------|-------------------------|----------------|----------|
| OpenAI     | gpt-4o, o1-*            | o200k_base     | Exact    |
| OpenAI     | gpt-4, gpt-3.5          | cl100k_base    | Exact    |
| OpenAI     | gpt-3 (legacy)          | r50k_base      | Exact    |
| Azure      | (same as OpenAI)        | (inherited)    | Exact    |
| Anthropic  | claude-*                | cl100k_base    | ~75%     |
| Ollama     | *                       | cl100k_base    | ~80%     |
| Unknown    | *                       | cl100k_base    | Unknown  |

==Model Name Formats==

The mapper accepts various model name formats:

Plain: gpt-4o, claude-3-sonnet
Provider-prefixed: openai/gpt-4o, anthropic/claude-3-sonnet
Azure: azure/my-gpt4o-deployment
Ollama: ollama/llama2

==Accuracy Considerations==

For non-OpenAI models, token counts are '''approximations'''. Claude uses a proprietary tokenizer that may differ 20-30% from cl100k_base. Always check TokenizerAccuracy to understand the expected accuracy.

Attributes

See also: ConversationTokenCounter.forModel for the recommended entry point

TokenizerAccuracy for accuracy information
Supertypes: class Object

trait Matchable

class Any
Self type: TokenizerMapping.type

org.llm4s.context.tokens

Members list

Type members

Classlikes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes