org.llm4s.context.tokens

Members list

Type members

Classlikes

Converts a plain string into a sequence of Tokens using a specific BPE vocabulary.

Converts a plain string into a sequence of Tokens using a specific BPE vocabulary.

Implementations are backed by jtokkit encodings and obtained via Tokenizer.lookupStringTokenizer; the interface is kept minimal to allow test doubles without a real encoding registry.

Attributes

Supertypes
class Object
trait Matchable
class Any
case class Token(tokenId: Int)

Represents a single BPE token as an integer vocabulary index.

Represents a single BPE token as an integer vocabulary index.

The integer corresponds to the token identifier used by the underlying jtokkit / TikToken encoding (e.g. cl100k_base for GPT-4).

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object Tokenizer

Factory for obtaining StringTokenizer instances backed by jtokkit BPE encodings.

Factory for obtaining StringTokenizer instances backed by jtokkit BPE encodings.

Encodings are looked up from the jtokkit default registry which bundles the standard TikToken vocabularies (cl100k_base, o200k_base, etc.). An unknown tokenizerId — one whose name does not match any bundled vocabulary — returns None; callers must handle the absent case explicitly.

Attributes

See also

org.llm4s.identity.TokenizerId for vocabulary name constants

org.llm4s.context.tokens.TokenizerMapping for the model → tokenizer mapping

Supertypes
class Object
trait Matchable
class Any
Self type
Tokenizer.type
sealed trait TokenizerAccuracy

Represents the accuracy of tokenizer mapping for a model.

Represents the accuracy of tokenizer mapping for a model.

This sealed trait hierarchy captures three levels of accuracy:

  • '''Exact''': Native tokenizer available, counts are precise
  • '''Approximate''': Using a similar tokenizer, counts may differ
  • '''Unknown''': Unknown model, using fallback tokenizer

Attributes

See also
Companion
object
Supertypes
class Object
trait Matchable
class Any
Known subtypes
class Approximate
class Exact
class Unknown

Accuracy level variants for tokenizer mappings.

Accuracy level variants for tokenizer mappings.

Attributes

Companion
trait
Supertypes
trait Sum
trait Mirror
class Object
trait Matchable
class Any
Self type

Maps LLM model names to appropriate tokenizers for accurate token counting.

Maps LLM model names to appropriate tokenizers for accurate token counting.

Different LLM providers use different tokenization schemes, and even within a provider, newer models may use updated tokenizers. This object provides the mapping logic to select the correct tokenizer for a given model.

==Supported Providers==

| Provider   | Model Pattern           | Tokenizer      | Accuracy |
|------------|-------------------------|----------------|----------|
| OpenAI     | gpt-4o, o1-*            | o200k_base     | Exact    |
| OpenAI     | gpt-4, gpt-3.5          | cl100k_base    | Exact    |
| OpenAI     | gpt-3 (legacy)          | r50k_base      | Exact    |
| Azure      | (same as OpenAI)        | (inherited)    | Exact    |
| Anthropic  | claude-*                | cl100k_base    | ~75%     |
| Ollama     | *                       | cl100k_base    | ~80%     |
| Unknown    | *                       | cl100k_base    | Unknown  |

==Model Name Formats==

The mapper accepts various model name formats:

  • Plain: gpt-4o, claude-3-sonnet
  • Provider-prefixed: openai/gpt-4o, anthropic/claude-3-sonnet
  • Azure: azure/my-gpt4o-deployment
  • Ollama: ollama/llama2

==Accuracy Considerations==

For non-OpenAI models, token counts are '''approximations'''. Claude uses a proprietary tokenizer that may differ 20-30% from cl100k_base. Always check TokenizerAccuracy to understand the expected accuracy.

Attributes

See also

ConversationTokenCounter.forModel for the recommended entry point

TokenizerAccuracy for accuracy information

Supertypes
class Object
trait Matchable
class Any
Self type