Context Window Pruning Strategies
Complete guide to LLM4S context window management and pruning strategies.
Overview
As conversations grow, token counts increase, leading to:
- Higher costs - Most LLM providers charge per token
- Slower responses - Longer context = slower processing
- Context window limits - Models have maximum input token limits
LLM4S provides automatic context window management via ContextWindowConfig with multiple pruning strategies to keep conversations within budget while preserving important context.
Configuration Basics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import org.llm4s.agent.{ Agent, ContextWindowConfig, PruningStrategy }
val contextConfig = ContextWindowConfig(
maxTokens = Some(4096), // Keep max 4K tokens
maxMessages = Some(20), // OR max 20 messages
preserveSystemMessage = true, // Always keep system prompt
minRecentTurns = 3, // Keep last 3 user/assistant pairs
pruningStrategy = PruningStrategy.OldestFirst // How to prune
)
agent.runMultiTurn(
initialQuery = "...",
followUpQueries = Seq(...),
tools = tools,
contextWindowConfig = Some(contextConfig)
)
Pruning Strategies
1. OldestFirst (Recommended for most use cases)
How it works: Removes the oldest messages first while preserving the system message and most recent turns.
Best for:
- β General-purpose multi-turn conversations
- β FAQ-bot with rolling history
- β Customer support chats
- β When you want to keep latest context fresh
Cost:
- π’ Good - Removes old/repetitive context first
- π’ Predictable token savings
Quality:
- π‘ Medium - May lose context from early turns
- β Recent context stays intact
Example:
1
2
3
4
5
6
7
val config = ContextWindowConfig(
maxMessages = Some(15),
pruningStrategy = PruningStrategy.OldestFirst
)
// Conversation with 30 messages β last 15 messages kept
// System message is always preserved
Trade-offs: | Aspect | Rating | Notes | |βββ|βββ|ββ-| | Token efficiency | ββββ | Removes metadata/greetings first | | Context continuity | βββ | Recent context preserved | | Implementation | βββββ | Simple FIFO removal | | Use case fit | Broad | Works for most scenarios |
2. MiddleOut (For complex reasoning)
How it works: Keeps the start (system + initial setup) and end (recent exchanges), removes middle messages.
Best for:
- β Complex problem-solving requiring both initial context and recent progress
- β Long code reviews or technical documents
- β When you need to preserve both βsetupβ and βcurrent stateβ
- β Scientific or analytical conversations
Cost:
- π‘ Medium - Loses middle context but keeps important bookends
- π‘ Less predictable than OldestFirst
Quality:
- π‘ Medium - Loses intermediate steps
- β Good - Keeps initial and recent context
Example:
1
2
3
4
5
6
7
8
9
val config = ContextWindowConfig(
maxMessages = Some(20),
pruningStrategy = PruningStrategy.MiddleOut
)
// Conversation:
// [System] [Initial Context] [Step 1] [Step 2] ... [Step N] [Recent Q&A]
// β β
// These middle steps are pruned; keeps ends
Trade-offs: | Aspect | Rating | Notes | |βββ|βββ|ββ-| | Token efficiency | βββ | Medium removal | | Context continuity | ββββ | Preserves bookends | | Implementation | ββββ | Straightforward middle removal | | Use case fit | Specialized | Good for reasoning chains |
3. RecentTurnsOnly (For strict turn-based systems)
How it works: Keeps only the last N complete conversation turns (user message + assistant response + tool calls).
Best for:
- β Turn-based games or dialogues
- β Systems that value latest N interactions equally
- β When conversation structure matters more than time
- β Predictable conversation patterns
Cost:
- π’ Good - Focused token reduction
- β Very predictable
Quality:
- π‘ Medium - Only recent exchanges (may need earlier context)
- β Good - Full recent turns preserved
Example:
1
2
3
4
5
6
7
8
9
10
11
12
val config = ContextWindowConfig(
pruningStrategy = PruningStrategy.RecentTurnsOnly(5)
)
// Keeps exactly the last 5 user/assistant turn pairs
// A "turn" = [User message] β [Assistant response] + optional [Tool messages]
// Turn 1: User asks β Assistant responds (+ tool calls)
// Turn 2: User asks β Assistant responds (+ tool calls)
// ...
// Turn 5: User asks β Assistant responds (+ tool calls)
// Turns 1-N: Pruned
Trade-offs: | Aspect | Rating | Notes | |βββ|βββ|ββ-| | Token efficiency | ββββ | Removes entire turn groups | | Context continuity | ββ | May lose early turns | | Implementation | ββββ | Turn-based filtering | | Use case fit | Specific | Best for turn-based systems |
4. Custom (Maximum flexibility)
How it works: You provide a pure function that decides which messages to keep.
Best for:
- β Domain-specific pruning logic
- β Marketing messages removal
- β Metadata-first removal
- β Complex importance scoring
- β Hybrid strategies
Cost:
- π’ Excellent - You decide whatβs expensive
- β Depends on your logic
Quality:
- π’ Excellent - You decide what matters
- β Depends on your logic
Example:
1
2
3
4
5
6
7
8
9
10
11
12
// Remove verbose debug messages, keep concise ones
val pruneDebugMessages: Seq[Message] => Seq[Message] = { messages =>
messages.filter { msg =>
!msg.content.toLowerCase.contains("[debug]") ||
msg.content.length < 100 // Keep short debug messages
}
}
val config = ContextWindowConfig(
maxMessages = Some(50),
pruningStrategy = PruningStrategy.Custom(pruneDebugMessages)
)
Advanced Example (importance scoring):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
val scoreImportance: Seq[Message] => Seq[Message] = { messages =>
messages.map { msg =>
val score = msg.role match {
case MessageRole.System => 100 // Always keep
case MessageRole.User => 80 // Keep user messages
case MessageRole.Assistant => 70 // Keep responses
case MessageRole.Tool => 40 // Lower priority
case _ => 0
}
(msg, score)
}
.sortBy(_._2) // Sort by importance
.dropWhile(_ => tokenCount > limit) // Remove until under limit
.map(_._1)
}
Trade-offs: | Aspect | Rating | Notes | |βββ|βββ|ββ-| | Token efficiency | βββββ | Complete control | | Context continuity | βββββ | Complete control | | Implementation | ββ | Requires custom code | | Use case fit | Very specific | Best for special requirements |
5. AdaptiveWindowing (Intelligent auto-tuning)
How it works: Automatically calculates the optimal context window based on:
- Modelβs context size
- Model tier (small, medium, large, extra-large)
- Token pricing (input vs output cost)
- Cost sensitivity preference
Best for:
- β Multi-model deployments (switch models without config changes)
- β Cost-conscious production systems
- β Quality-sensitive applications
- β Hybrid cost/quality optimization
- β When you want βset it and forget itβ
Cost:
- π’ Excellent - Optimizes based on actual pricing
- β Adapts to model changes
Quality:
- π’ Good - Uses 70-80% of context window
- β Leaves room for model reasoning
Example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import org.llm4s.model.ModelMetadata
import org.llm4s.agent.PruningStrategy
// From ModelMetadata (automatically available)
val strategy = PruningStrategy.AdaptiveWindowing(
contextWindowSize = 128_000, // gpt-4o context
inputCostPerToken = Some(0.000005), // $5 per 1M input tokens
outputCostPerToken = Some(0.000015), // $15 per 1M output tokens
preserveMinTurns = 3,
costSensitivity = 0.5 // Balanced (0=quality, 1=cost)
)
val config = ContextWindowConfig(
pruningStrategy = strategy
)
println(strategy.calculateOptimalWindow) // Output: 89,600 tokens (70% * 128K)
println(strategy.explanation) // "AdaptiveWindowing: using 89,600 tokens (70% of 128K large model)"
Model tier adjustments:
| Model Size | Context | Multiplier | Adjusted Window | Use Case |
|---|---|---|---|---|
| Small | 8K | 60% | 4.8K | Budget-conscious |
| Medium | 100K | 70% | 70K | Balanced |
| Large | 200K | 75% | 150K | Quality-focused |
| Extra Large | 1M+ | 80% | 800K+ | Complex reasoning |
Cost sensitivity examples:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// π‘ Balanced (default)
// Sensitivity = 0.5
// Minimizes cost moderately while preserving quality
val balanced = PruningStrategy.AdaptiveWindowing(
contextWindowSize = 128_000,
inputCostPerToken = Some(0.000005),
outputCostPerToken = Some(0.000015),
costSensitivity = 0.5
)
// π’ Quality-focused
// Sensitivity = 0.1
// Use most of the available context, cost matters less
val qualityFocused = PruningStrategy.AdaptiveWindowing(
contextWindowSize = 128_000,
costSensitivity = 0.1 // Low cost sensitivity
)
// π° Cost-optimized
// Sensitivity = 0.9
// Aggressively minimize token usage
val costOptimized = PruningStrategy.AdaptiveWindowing(
contextWindowSize = 128_000,
costSensitivity = 0.9 // High cost sensitivity
)
Trade-offs: | Aspect | Rating | Notes | |βββ|βββ|ββ-| | Token efficiency | ββββ | Auto-optimized | | Context continuity | ββββ | Preserves minimum turns | | Implementation | βββββ | Recommended for production | | Use case fit | Universal | Works everywhere |
Strategy Comparison Matrix
| Strategy | Pros | Cons | Cost | Quality | When to Use |
|---|---|---|---|---|---|
| OldestFirst | Simple, predictable | May lose early context | ββββ | βββ | Default choice |
| MiddleOut | Preserves bookends | Loses intermediate steps | βββ | ββββ | Complex reasoning |
| RecentTurnsOnly | Consistent turn count | Loses older turns | ββββ | ββ | Turn-based systems |
| Custom | Maximum flexibility | Requires implementation | βββββ | βββββ | Special needs |
| AdaptiveWindowing | Auto-optimized, scales | Needs model metadata | ββββ | ββββ | Production systems |
Configuration Patterns
Pattern 1: Token-based Limiting (Recommended)
1
2
3
4
5
6
val config = ContextWindowConfig(
maxTokens = Some(4096),
pruningStrategy = PruningStrategy.OldestFirst
)
// More precise - based on actual token usage
// Handles variable-length messages correctly
Pattern 2: Message-based Limiting (Simpler)
1
2
3
4
5
6
val config = ContextWindowConfig(
maxMessages = Some(20),
pruningStrategy = PruningStrategy.OldestFirst
)
// Simpler to understand
// Less precise but easier to reason about
Pattern 3: Hybrid (Both limits)
1
2
3
4
5
6
7
val config = ContextWindowConfig(
maxTokens = Some(4096),
maxMessages = Some(50),
pruningStrategy = PruningStrategy.OldestFirst
)
// Enforces BOTH limits
// Prunes when either limit exceeded
Pattern 4: Preserve Minimum Context
1
2
3
4
5
6
7
8
val config = ContextWindowConfig(
maxMessages = Some(15),
preserveSystemMessage = true,
minRecentTurns = 3, // Always keep last 3 turns
pruningStrategy = PruningStrategy.RecentTurnsOnly(5)
)
// Guarantees minimum context is always available
// Even if other limits would be exceeded
Pattern 5: Adaptive Mode (Multi-model)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
val strategy = if (isExpensiveModel) {
PruningStrategy.AdaptiveWindowing(
contextWindowSize = modelSize,
inputCostPerToken = Some(0.000010),
costSensitivity = 0.8 // Aggressive cost optimization
)
} else {
PruningStrategy.AdaptiveWindowing(
contextWindowSize = modelSize,
inputCostPerToken = Some(0.000001),
costSensitivity = 0.3 // More quality-focused
)
}
val config = ContextWindowConfig(
pruningStrategy = strategy
)
Cost Analysis
Calculate conversation cost
1
2
3
4
5
6
7
8
val tokenCounter = ConversationTokenCounter.forModel("gpt-4o").get
val conversationTokens = tokenCounter.countConversation(conversation)
// Cost calculation
val inputCostPerToken = 0.000005 // $5 per 1M tokens
val totalCost = conversationTokens * inputCostPerToken
println(s"Conversation: $conversationTokens tokens = \$$totalCost")
Compare strategies
1
2
3
4
5
6
7
// Without pruning: 10,000 tokens = $0.05
// With OldestFirst (keeping 50 messages): ~3,000 tokens = $0.015 (70% savings)
// With AdaptiveWindowing: ~4,096 tokens = $0.020 (60% savings)
// For 100 conversations per day:
// Without: $5.00/day
// With pruning: $1.50-2.00/day (savings: $3-3.50/day)
Best Practices
β DO
- Use token-based limits - More accurate for variable-length conversations
1
maxTokens = Some(4096) // Better than maxMessages
- Preserve system message - Keeps instructions intact
1
preserveSystemMessage = true // Default, always set this
- Use AdaptiveWindowing for production - Handles model changes automatically
1
PruningStrategy.AdaptiveWindowing(contextWindowSize, costPerToken)
- Monitor pruning events - Log when and what gets removed
1
// See next section on observability - Test with real conversations - Benchmark different strategies
β DONβT
- Set maxTokens too low - Breaks context coherence
1
maxTokens = Some(256) // Too small!
- Forget about output budget - Leave room for model responses
1 2
// If context is 100K, use ~70K for input (leave 30K for output) maxTokens = Some(70_000)
- Use token limits without a tokenizer - Falls back to word count estimation
1
// Integrate ConversationTokenCounter for accuracy - Hard-code window sizes - Use AdaptiveWindowing instead
1 2
// Bad: Fixed 4K window works only for specific models // Good: AdaptiveWindowing scales with model changes
Observability & Monitoring
Inspect pruning results
Pruning happens automatically inside runMultiTurn. To observe it explicitly, apply
AgentState.pruneConversation before running and compare message counts:
1
2
3
4
5
6
7
8
import org.llm4s.agent.AgentState
val before = state.conversation.messages.length
val pruned = AgentState.pruneConversation(state, config)
val after = pruned.conversation.messages.length
if (after < before)
logger.info(s"Pruned ${before - after} messages (${before} β ${after})")
Log pruning performance
1
2
3
4
5
6
val pruned = AgentState.pruneConversation(state, config, tokenCounter)
val tokenBefore = tokenCounter.countConversation(state.conversation)
val tokenAfter = tokenCounter.countConversation(pruned.conversation)
logger.info(s"Pruning efficiency: saved ${tokenBefore - tokenAfter} tokens")
logger.info(s"Cost savings: $${(tokenBefore - tokenAfter) * costPerToken}")
Troubleshooting
Issue: Conversation keeps growing despite pruning
Cause: contextWindowConfig not passed to agent method
Solution:
1
2
3
4
5
6
7
8
9
10
// β Wrong: pruning not enabled
agent.runMultiTurn(query, followUps, tools)
// β
Right: pruning enabled
agent.runMultiTurn(
query,
followUps,
tools,
contextWindowConfig = Some(config) // Must specify!
)
Issue: Important context is being removed
Cause: Too aggressive pruning strategy
Solution:
1
2
3
4
5
6
7
// Increase limits
val config = ContextWindowConfig(
maxTokens = Some(8192), // Was 4096
pruningStrategy = PruningStrategy.MiddleOut // Preserves bookends
)
// Or use Custom strategy to protect important messages
Issue: Conversations are too short
Cause: minRecentTurns or preserveSystemMessage settings
Solution:
1
2
3
4
5
6
val config = ContextWindowConfig(
maxTokens = Some(8192),
minRecentTurns = 1, // Fewer forced turns
preserveSystemMessage = false, // Allow system to be pruned if needed
pruningStrategy = PruningStrategy.OldestFirst
)
Advanced Examples
Example 1: Cost-conscious chatbot
1
2
3
4
5
6
7
8
val config = ContextWindowConfig(
pruningStrategy = PruningStrategy.AdaptiveWindowing(
contextWindowSize = 8000, // Smaller model
inputCostPerToken = Some(0.00001), // Expensive
outputCostPerToken = Some(0.00003),
costSensitivity = 0.9 // Minimize cost
)
)
Example 2: Research assistant (preserve all context)
1
2
3
4
5
val config = ContextWindowConfig(
maxTokens = Some(200_000), // Large window
pruningStrategy = PruningStrategy.MiddleOut, // Keep setup + recent
preserveSystemMessage = true
)
Example 3: Code review assistant
1
2
3
4
5
6
7
8
9
10
11
val config = ContextWindowConfig(
pruningStrategy = PruningStrategy.Custom { messages =>
messages.filter { msg =>
// Keep code snippets and review comments
msg.role == MessageRole.System || // Keep instructions
msg.content.contains("```") || // Keep code blocks
msg.content.contains("review") || // Keep review comments
messages.indexOf(msg) >= messages.length - 10 // Keep recent
}
}
)
Example 4: Turn-based game
1
2
3
4
5
val config = ContextWindowConfig(
pruningStrategy = PruningStrategy.RecentTurnsOnly(10), // Last 10 turns
preserveSystemMessage = true,
minRecentTurns = 1 // Always keep last turn
)
Summary
| Need | Strategy | Reason |
|---|---|---|
| I just want a working solution | OldestFirst | Simplest, most reliable |
| Iβm building a production system | AdaptiveWindowing | Auto-scales with models |
| I have complex reasoning | MiddleOut | Preserves bookends |
| I need precise control | Custom | Maximum flexibility |
| I have turn-based logic | RecentTurnsOnly | Natural for turns |
Choose AdaptiveWindowing for production systems that work across multiple models. Choose OldestFirst for everything else.