llm4s Agent Framework Roadmap

Date: 2025-11-16 Purpose: Strategic roadmap for enhancing llm4s agent capabilities while maintaining functional programming principles Status: Analysis Complete Context: Comprehensive comparison against OpenAI Agents SDK, PydanticAI, and CrewAI - focused on llm4s-specific improvements


Table of Contents

  1. Executive Summary
  2. Framework Landscape Comparison
  3. llm4s Design Philosophy
  4. Detailed Feature Comparison
  5. Gap Analysis
  6. Implementation Roadmap
  7. Priority Recommendations
  8. Appendix: Architecture Notes

Executive Summary

Current State

llm4s provides a solid foundation for agent-based workflows with:

  • βœ… Single-agent execution with tool calling
  • βœ… Multi-agent orchestration via DAG-based plans
  • βœ… Type-safe agent composition
  • βœ… Parallel and sequential execution
  • βœ… Result-based error handling
  • βœ… Markdown trace logging
  • βœ… MCP (Model Context Protocol) integration
  • βœ… Cross-version Scala support (2.13 & 3.x)

OpenAI Agents SDK offers additional capabilities for production workflows:

  • Advanced session management with automatic conversation history
  • Input/output guardrails for validation
  • Native handoff mechanism for agent delegation
  • Built-in tools (web search, file search, computer use)
  • Multiple streaming event types
  • Temporal integration for durable workflows
  • Extensive observability integrations (Logfire, AgentOps, Braintrust, etc.)
  • Provider-agnostic design (100+ LLM providers)

Gap Score

Category llm4s Score OpenAI SDK Score Gap
Core Agent Execution 9/10 10/10 Small
Multi-Agent Orchestration 8/10 9/10 Small
Tool Management 8/10 10/10 Moderate
State & Session Management 6/10 10/10 Large
Error Handling & Validation 7/10 10/10 Moderate
Streaming 4/10 10/10 Large
Observability 6/10 10/10 Moderate
Production Features 5/10 10/10 Large
Built-in Tools 2/10 10/10 Large

Overall Assessment: llm4s has a strong foundation but lacks several production-critical features that OpenAI Agents SDK provides out-of-the-box.


Framework Landscape Comparison

To properly position llm4s, we compare it against three leading Python agent frameworks: OpenAI Agents SDK, PydanticAI, and CrewAI. Each framework takes a different approach to agent development.

Framework Overview

Framework Language Primary Focus Design Philosophy Target Use Case
llm4s Scala Type-safe, functional agent framework Functional purity, immutability, compile-time safety Enterprise Scala teams, FP practitioners, mission-critical systems
OpenAI Agents SDK Python Production-ready multi-agent workflows Practical, feature-rich, mutable sessions Python developers building production agents
PydanticAI Python Type-safe Python agents with validation Type safety via Pydantic, FastAPI-like DX Python developers wanting type safety and validation
CrewAI Python Role-based multi-agent orchestration Collaborative agents with roles, sequential/hierarchical processes Teams building role-based agent workflows

Core Architecture Comparison

State Management

Framework Approach Pros Cons
llm4s Immutable AgentState with explicit threading Pure, testable, composable More verbose, requires manual threading
OpenAI SDK Mutable Session objects Convenient, automatic history Hidden mutations, side effects
PydanticAI Dependency injection with RunContext Type-safe, flexible, testable Still mutable under the hood
CrewAI Crew/task state managed internally Simple API, automatic Opaque state, hard to debug

llm4s Advantage: Only framework with pure functional state management.

Type Safety

Framework Type System Validation Compile-time Checking
llm4s Scala’s strong type system Result types, case classes βœ… Full compile-time checking
OpenAI SDK Python type hints (optional) Runtime only ❌ Runtime validation only
PydanticAI Pydantic models + type hints βœ… Pydantic validation ⚠️ Type hints checked by mypy, not enforced
CrewAI Python type hints (minimal) Minimal ❌ Runtime validation only

llm4s Advantage: Only framework with true compile-time type safety and enforcement.

Multi-Agent Orchestration

Framework Orchestration Model Type Safety Parallel Execution Complexity Control
llm4s DAG-based with typed edges Edge[A, B] βœ… Compile-time βœ… Batch-based ⚠️ Requires explicit DAG construction
OpenAI SDK Handoffs + agent-as-tool ❌ Runtime βœ… asyncio.gather βœ… Simple delegation API
PydanticAI Graph support via type hints ⚠️ Type hints only βœ… Async support βœ… Flexible graph definition
CrewAI Sequential / Hierarchical processes ❌ Runtime ⚠️ Sequential by default βœ… Role-based with manager

llm4s Advantage: Only framework with compile-time type checking for agent composition.

CrewAI Advantage: Highest-level abstractions with role-based agents and built-in hierarchical management.

Feature Matrix

Feature llm4s OpenAI SDK PydanticAI CrewAI
Core Features Β  Β  Β  Β 
Single-agent execution βœ… βœ… βœ… βœ…
Multi-agent orchestration βœ… DAG βœ… Handoffs βœ… Graphs βœ… Crews
Tool calling βœ… βœ… βœ… βœ…
Streaming ⚠️ Basic βœ… Advanced βœ… Validated ⚠️ Limited
Type Safety Β  Β  Β  Β 
Compile-time checking βœ… ❌ ❌ ❌
Runtime validation βœ… βœ… βœ… βœ… Pydantic ⚠️ Minimal
Type-safe composition βœ… ❌ ⚠️ Partial ❌
State Management Β  Β  Β  Β 
Immutable state βœ… ❌ ❌ ❌
Explicit state flow βœ… ❌ ⚠️ DI-based ❌
Session persistence ⚠️ Manual βœ… ⚠️ Manual ⚠️ Manual
Context window mgmt ❌ βœ… ❌ ❌
Validation & Safety Β  Β  Β  Β 
Input guardrails ❌ βœ… βœ… Pydantic ❌
Output guardrails ❌ βœ… βœ… Pydantic ❌
Structured output βœ… βœ… βœ… βœ… Strong ⚠️ Basic
Developer Experience Β  Β  Β  Β 
Dependency injection ❌ ❌ βœ… βœ… ❌
Error handling βœ… Result ⚠️ Exceptions ⚠️ Exceptions ⚠️ Exceptions
Debugging/tracing βœ… Markdown βœ… Logfire+ βœ… Logfire ⚠️ Basic
Production Features Β  Β  Β  Β 
Durable execution ❌ βœ… Temporal βœ… Built-in ❌
Human-in-the-loop ❌ βœ… Temporal βœ… Built-in ⚠️ Manual
Model agnostic βœ… 4 providers βœ… 100+ βœ… All major βœ… LangChain models
Built-in tools ⚠️ Minimal βœ… Web/file/computer ❌ ⚠️ Via integrations
Unique Features Β  Β  Β  Β 
Workspace isolation βœ… Docker ❌ ❌ ❌
MCP integration βœ… ⚠️ Planned βœ… ❌
Cross-version support βœ… 2.13/3.x N/A N/A N/A
Role-based agents ❌ ❌ ❌ βœ… βœ…
Hierarchical mgmt ⚠️ Via DAG ❌ ❌ βœ… βœ…

Design Philosophy Comparison

1. PydanticAI vs llm4s

Similarities:

  • Both prioritize type safety (PydanticAI via Pydantic, llm4s via Scala)
  • Both aim for great developer experience
  • Both are model-agnostic
  • Both have strong validation

Key Differences:

Aspect llm4s PydanticAI
Type Safety Compile-time (Scala) Runtime (Pydantic)
State Immutable, pure functions Mutable with DI
Error Handling Result types Exceptions
Language Scala (functional) Python (imperative)
Philosophy Correctness first Developer experience first

PydanticAI Advantages:

  • βœ… Dependency injection system (cleaner than manual DI)
  • βœ… Pydantic validation (industry standard in Python)
  • βœ… Durable execution built-in
  • βœ… Human-in-the-loop built-in
  • βœ… Larger Python ecosystem

llm4s Advantages:

  • βœ… True compile-time safety (catches errors before runtime)
  • βœ… Functional purity (no hidden mutations)
  • βœ… Better for mission-critical systems (immutability guarantees)
  • βœ… Workspace isolation (security)

Quote from PydanticAI docs: β€œBuilt with one simple aim: to bring that FastAPI feeling to GenAI app and agent development”

llm4s counterpart: β€œBuild the correct agent framework for functional programming”

2. CrewAI vs llm4s

Similarities:

  • Both support multi-agent orchestration
  • Both have parallel execution capabilities
  • Both are extensible

Key Differences:

Aspect llm4s CrewAI
Abstraction Level Low-level (DAGs, edges) High-level (roles, crews)
Orchestration DAG-based, type-safe Role-based, sequential/hierarchical
Learning Curve Steeper (FP concepts) Gentler (intuitive roles)
Control Fine-grained Abstracted away
Type Safety Compile-time Runtime (minimal)

CrewAI Advantages:

  • βœ… Extremely intuitive API (roles, tasks, crews)
  • βœ… Built-in hierarchical management with manager agents
  • βœ… Sequential and hierarchical process types
  • βœ… 10M+ agents executed in production
  • βœ… Faster iteration for common patterns

llm4s Advantages:

  • βœ… Fine-grained control over agent flow
  • βœ… Type-safe agent composition (compile-time)
  • βœ… Concurrency control (maxConcurrentNodes)
  • βœ… Cancellation support (CancellationToken)
  • βœ… Predictable execution (no hidden manager logic)

CrewAI Quote: β€œEasily orchestrate autonomous agents through intuitive Crews”

llm4s counterpart: β€œType-safe agent composition with explicit control flow”

3. OpenAI SDK vs llm4s

(See detailed comparison in main sections)

Key Distinction: OpenAI SDK optimizes for features and convenience; llm4s optimizes for correctness and functional purity.

Strategic Insights

Where Each Framework Excels

llm4s - Best For:

  • Enterprise Scala environments
  • Mission-critical systems requiring correctness guarantees
  • Teams valuing functional programming
  • Applications requiring compile-time safety
  • Long-term maintainability over rapid prototyping

OpenAI SDK - Best For:

  • Python teams needing production-ready agents quickly
  • Projects requiring extensive built-in tools (web search, file search)
  • Teams wanting Temporal integration for durability
  • Applications needing broad model provider support (100+)

PydanticAI - Best For:

  • Python teams wanting type safety and validation
  • Projects already using Pydantic/FastAPI
  • Applications needing dependency injection
  • Teams wanting FastAPI-like developer experience
  • Human-in-the-loop workflows

CrewAI - Best For:

  • Teams modeling real-world organizational structures
  • Role-based agent systems (manager, researcher, writer, etc.)
  • Sequential workflows with task delegation
  • Rapid prototyping of multi-agent systems
  • Python teams prioritizing ease of use over type safety

Competitive Positioning

Type Safety & Correctness
        ↑
        β”‚
   llm4sβ”‚
        β”‚                    PydanticAI
        β”‚                         ↓
        β”‚
        β”‚
        β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β†’
        β”‚                              Ease of Use
        β”‚                              & Speed
        β”‚
        β”‚    OpenAI SDK
        β”‚              ↓
        β”‚                         CrewAI
        ↓

llm4s Unique Position: The only type-safe, functional agent framework - serving the Scala/FP niche that none of the Python frameworks can address.

Key Takeaways

  1. llm4s is NOT competing directly with Python frameworks - different languages, different ecosystems, different philosophies

  2. Python frameworks converge on convenience and features; llm4s diverges toward correctness and functional purity

  3. Feature gaps are real but many features (mutable sessions, exceptions) would violate llm4s principles

  4. The right comparison is not β€œwhat features do they have?” but β€œwhat can we achieve functionally that provides equivalent value?”

  5. llm4s’s target audience values compile-time safety, immutability, and functional correctness - these users won’t choose Python frameworks regardless of features

Lessons for llm4s Development

From PydanticAI, we learn:

  • βœ… Dependency injection improves testability (can be done functionally with Reader monad or explicit passing)
  • βœ… Strong validation is valuable (llm4s already has this via case classes)
  • βœ… Model-agnostic design is table stakes (llm4s has 4 providers, should expand)
  • βœ… Developer experience matters (functional doesn’t mean verbose - need helper methods)

From CrewAI, we learn:

  • βœ… High-level abstractions attract users (consider role-based DSL on top of DAG)
  • βœ… Hierarchical workflows are common (could provide pre-built DAG patterns)
  • βœ… Simplicity wins for adoption (document common patterns extensively)
  • ⚠️ But don’t sacrifice correctness for convenience

From OpenAI SDK, we learn:

  • βœ… Built-in tools are essential (need llm4s-tools module)
  • βœ… Streaming events improve UX (implement functionally as Iterators)
  • βœ… Observability integration is expected (expand beyond Langfuse)
  • ⚠️ But maintain functional purity in all implementations

llm4s Design Philosophy

Before comparing features, it’s essential to understand llm4s’s core design principles. These principles guide all architectural decisions and differentiate llm4s from other agent frameworks.

1. Prefer Functional and Immutable

Principle: All data structures are immutable; all operations are pure functions that return new states.

Rationale:

  • Correctness - Immutability eliminates entire classes of bugs (race conditions, unexpected mutations)
  • Testability - Pure functions are trivially testable with no setup/teardown
  • Composability - Pure functions compose naturally via for-comprehensions
  • Reasoning - Code behavior is locally understandable without tracking global state

Example:

// ❌ BAD: Mutable session (OpenAI SDK style)
session = Session()
session.add_message("Hello")  // Mutates session
result = runner.run(agent, session)  // More mutation

// βœ… GOOD: Immutable state (llm4s style)
val state1 = agent.initialize("Hello", tools)
val state2 = agent.run(state1)  // Returns new state, state1 unchanged
val state3 = agent.continueConversation(state2, "Next query")  // Pure

Implication for Feature Design:

  • Multi-turn conversations use state threading, not mutable sessions
  • Configuration is passed explicitly, not stored in mutable objects
  • All agent operations return Result[AgentState], never mutate in place

2. Framework Agnostic

Principle: Minimize dependencies on heavyweight frameworks; remain composable with any functional effect system.

Rationale:

  • Flexibility - Users can integrate with Cats Effect, ZIO, or plain Scala
  • Simplicity - Less coupling = easier to understand and maintain
  • Long-term stability - Don’t tie users to framework version churn

Example:

// llm4s doesn't require cats-effect, but works seamlessly with it
import cats.effect.IO

val program: IO[AgentState] = for {
  client <- IO.fromEither(LLMConnect.fromEnv())
  state1 <- IO.fromEither(agent.run("Query", tools))
  state2 <- IO.fromEither(agent.continueConversation(state1, "Next"))
} yield state2

Implication for Feature Design:

  • Use scala.concurrent.Future for async (universally compatible)
  • Provide Result[A] (simple Either) instead of custom effect types
  • Don’t force users into a specific effect system (IO, Task, etc.)

3. Simplicity Over Cleverness

Principle: APIs should be literate, clear, and properly documented. Prefer explicit over implicit.

Rationale:

  • Discoverability - New users can understand code by reading it
  • Maintainability - Clever code is hard to change; simple code is easy to evolve
  • Debugging - Explicit control flow makes debugging straightforward

Example:

// βœ… GOOD: Explicit, clear intent
val result = for {
  state1 <- agent.run("First query", tools)
  state2 <- agent.continueConversation(state1, "Second query")
} yield state2

// ❌ BAD: Too clever, hard to understand
implicit class AgentOps(state: AgentState) {
  def >>(query: String)(implicit agent: Agent): Result[AgentState] =
    agent.continueConversation(state, query)
}
val result = state >> "Next query"  // What does >> mean?

Implication for Feature Design:

  • Descriptive method names (continueConversation, not continue or +)
  • Avoid operator overloading for domain operations
  • Comprehensive ScalaDoc on all public APIs
  • Examples in documentation showing common use cases

4. Principle of Least Surprise

Principle: Follow established conventions; behave as users would expect.

Rationale:

  • Learnability - Users can leverage existing knowledge
  • Trust - Predictable behavior builds confidence
  • Productivity - Less time reading docs, more time building

Example:

// βœ… Expected: Conversation grows with each message
val state1 = agent.initialize("Hello", tools)
state1.conversation.messageCount  // 1

val state2 = state1.copy(
  conversation = state1.conversation.addMessage(UserMessage("Hi again"))
)
state2.conversation.messageCount  // 2 βœ“ As expected

// ❌ Surprising: Mutating would violate immutability
// state1.conversation.messages += UserMessage("...")  // Doesn't compile βœ“

Implication for Feature Design:

  • Immutable collections behave as expected (returns new collection)
  • Method names follow Scala conventions (map, flatMap, fold, etc.)
  • Error handling via Either (standard Scala pattern)
  • No magic behavior or hidden side effects

Design Philosophy Summary

Principle What It Means How It Differs from OpenAI SDK
Functional & Immutable All data immutable, operations pure OpenAI uses mutable Session objects
Framework Agnostic Works with any effect system OpenAI is Python-specific, asyncio-based
Simplicity Over Cleverness Explicit, well-documented APIs Both SDKs value simplicity
Least Surprise Follow Scala conventions OpenAI follows Python conventions

Key Insight: llm4s prioritizes correctness and composability over convenience. However, through careful API design, we achieve both - functional purity AND ergonomic developer experience.

Reference: See Phase 1.1: Functional Conversation Management for detailed application of these principles to multi-turn conversations.


Detailed Feature Comparison

1. Core Agent Primitives

Feature llm4s OpenAI Agents SDK Notes
Agent Definition βœ… Agent class with client injection βœ… Agent with instructions, tools, handoffs Similar concepts
Tool Calling βœ… ToolRegistry with type-safe tools βœ… Function tools with Pydantic validation llm4s has good type safety
System Prompts βœ… SystemMessage support βœ… Instructions field Equivalent
Completion Options βœ… CompletionOptions (temp, maxTokens, etc.) βœ… ModelSettings (reasoning, temp, etc.) OpenAI has reasoning modes
Agent State βœ… AgentState with conversation + status βœ… Implicit via session Different approaches

2. Multi-Agent Orchestration

Feature llm4s OpenAI Agents SDK Notes
Orchestration Pattern βœ… DAG-based with PlanRunner βœ… Handoffs + Agent-as-Tool Different paradigms
Type Safety βœ… Compile-time type checking ⚠️ Runtime validation llm4s advantage
Parallel Execution βœ… Batch-based parallelism βœ… asyncio.gather support Similar
Sequential Execution βœ… Topological ordering βœ… Control flow in code Similar
Agent Delegation ⚠️ Manual via DAG edges βœ… Native handoffs OpenAI cleaner API
Concurrency Control βœ… maxConcurrentNodes ⚠️ Manual with asyncio llm4s advantage
Cancellation βœ… CancellationToken ⚠️ Not documented llm4s advantage

3. Session & State Management

Feature llm4s OpenAI Agents SDK Notes
Conversation History βœ… Manual via AgentState.conversation βœ… Automatic via Session GAP: No auto-session
Session Persistence ❌ Not built-in βœ… Built-in with .to_input_list() GAP: Need session storage
Multi-Turn Support ⚠️ Manual state threading βœ… Automatic across runs GAP: Manual effort
Session Serialization ⚠️ Partial (ujson support) βœ… Full support GAP: Incomplete
Context Management ⚠️ Manual message pruning βœ… Automatic with sessions GAP: No auto-pruning

4. Guardrails & Validation

Feature llm4s OpenAI Agents SDK Notes
Input Validation ⚠️ Manual via Result βœ… Input guardrails GAP: No framework
Output Validation ⚠️ Manual via Result βœ… Output guardrails GAP: No framework
Parallel Validation ❌ Not supported βœ… Runs in parallel GAP: Need framework
Debounced Validation ❌ Not supported βœ… For realtime agents GAP: For streaming
Safety Checks ⚠️ Manual implementation βœ… Configurable framework GAP: Need declarative API

5. Tool Ecosystem

Feature llm4s OpenAI Agents SDK Notes
Custom Tools βœ… ToolFunction with schema gen βœ… Function tools with Pydantic Similar
Tool Registry βœ… ToolRegistry βœ… Agent.tools list Similar
Tool Execution βœ… Synchronous βœ… Sync and async OpenAI more flexible
Web Search ❌ Not built-in βœ… WebSearchTool GAP: No built-in
File Search ❌ Not built-in βœ… FileSearchTool with vector stores GAP: No built-in
Computer Use ❌ Not built-in βœ… ComputerTool (preview) GAP: No built-in
MCP Support βœ… Via integration ⚠️ Not documented llm4s advantage
Tool Error Handling βœ… Result-based βœ… Exception-based Different approaches

6. Streaming

Feature llm4s OpenAI Agents SDK Notes
Streaming Support ⚠️ Basic via StreamResult βœ… run_streamed() GAP: Limited
Token-level Events ❌ Not supported βœ… RawResponsesStreamEvent GAP: Need fine-grained
Item-level Events ❌ Not supported βœ… RunItemStreamEvents GAP: Need coarse-grained
Progress Updates ⚠️ Via logs only βœ… Via stream events GAP: Need event system
Partial Responses ❌ Not supported βœ… Via deltas GAP: Need delta support

7. Observability & Tracing

Feature llm4s OpenAI Agents SDK Notes
Built-in Tracing βœ… Langfuse integration βœ… Automatic + extensible Similar
Markdown Traces βœ… writeTraceLog() ❌ Not built-in llm4s advantage
Structured Logging βœ… SLF4J with MDC βœ… Standard logging Similar
External Integrations ⚠️ Langfuse only βœ… Logfire, AgentOps, Braintrust, etc. GAP: Fewer integrations
Custom Spans ⚠️ Not documented βœ… Supported GAP: Need custom spans
Debug Mode βœ… debug flag ⚠️ Not documented llm4s advantage

8. Production Features

Feature llm4s OpenAI Agents SDK Notes
Durable Execution ❌ Not supported βœ… Temporal integration GAP: No workflow engine
Human-in-the-Loop ❌ Not supported βœ… Via Temporal GAP: No HITL framework
Automatic Retries ⚠️ Manual via client ⚠️ Manual Similar
State Recovery ❌ Not supported βœ… Via Temporal GAP: No crash recovery
Long-running Tasks ⚠️ Limited by timeouts βœ… Via Temporal GAP: No persistence
Workspace Isolation βœ… Docker containers ❌ Not built-in llm4s advantage

9. Configuration & Flexibility

Feature llm4s OpenAI Agents SDK Notes
Multi-Provider Support βœ… OpenAI, Anthropic, Azure, Ollama βœ… 100+ providers OpenAI broader support
Configuration System βœ… ConfigReader (type-safe) ⚠️ Standard env vars llm4s advantage
Model Selection βœ… Per-request override βœ… Per-agent config Similar
Temperature Control βœ… CompletionOptions βœ… ModelSettings Similar
Reasoning Modes ❌ Not supported βœ… none/low/medium/high GAP: No reasoning config
Cross-version Support βœ… Scala 2.13 & 3.x N/A (Python-only) llm4s advantage

Gap Analysis

Critical Gaps (High Priority)

1. Conversation Management ⭐⭐⭐⭐⭐

Gap: llm4s lacks ergonomic APIs for multi-turn conversations while maintaining functional purity.

Impact:

  • More verbose multi-turn conversation code
  • No continuation helper methods
  • No automatic context window management
  • Samples show imperative patterns (using var)

OpenAI Approach (Mutable Sessions):

# Mutable session object
session = Session()
result1 = runner.run(agent, "What's the weather?", session=session)
result2 = runner.run(agent, "And tomorrow?", session=session)  # Mutates session

llm4s Current (Verbose but Functional):

// Manual state threading - verbose
val state1 = agent.initialize(query1, tools)
val result1 = agent.run(state1, ...)
// Must manually construct continuation
val state2 = result1.map(s => s.copy(
  conversation = s.conversation.addMessage(UserMessage(query2)),
  status = AgentStatus.InProgress
))
val result2 = state2.flatMap(agent.run(_, ...))

Proposed Solution (Functional & Ergonomic):

// Functional state threading with helper methods
val result = for {
  state1 <- agent.run("What's the weather?", tools)
  state2 <- agent.continueConversation(state1, "And tomorrow?")  // Pure function!
} yield state2

Design Philosophy Alignment:

  • ❌ NO mutable Session objects (violates functional principle)
  • βœ… YES pure functions that return new states
  • βœ… YES helper methods for common patterns (continueConversation, runMultiTurn)
  • βœ… YES explicit state flow via for-comprehensions
  • βœ… YES context window management as pure functions (returns new state)

Recommendation: Implement functional conversation APIs (see Phase 1.1 Design).


2. Guardrails Framework ⭐⭐⭐⭐⭐

Gap: No declarative validation framework for input/output safety.

Impact:

  • Manual validation increases code complexity
  • No standardized approach to safety checks
  • Harder to compose and reuse validation logic

OpenAI Advantage:

# Declarative validation
agent = Agent(
    input_guardrails=[ProfanityFilter(), LengthCheck(max=1000)],
    output_guardrails=[FactCheck(), ToneValidator()]
)

llm4s Current:

// Manual validation
def validateInput(input: String): Result[String] =
  if (input.contains("badword")) Left(ValidationError("..."))
  else Right(input)

Recommendation: Build Guardrail trait with composable validators.


3. Streaming Events ⭐⭐⭐⭐

Gap: Limited streaming support with no event system.

Impact:

  • Poor UX for long-running agents (no progress updates)
  • Cannot show partial responses to users
  • No fine-grained control over streaming behavior

OpenAI Advantage:

# Rich streaming events
for event in runner.run_streamed(agent, prompt):
    if event.type == "output_text.delta":
        print(event.data, end="")
    elif event.type == "tool_call.started":
        print(f"\n[Tool: {event.data.tool_name}]")

llm4s Current:

// Limited to basic streaming
val stream: Iterator[String] = client.streamComplete(...)
stream.foreach(println)  // No event types, just raw text

Recommendation: Implement event-based streaming with multiple event types.


4. Built-in Tools ⭐⭐⭐⭐

Gap: No production-ready tools for common tasks (web search, file search).

Impact:

  • Users must implement common tools from scratch
  • Inconsistent quality of tool implementations
  • Longer time-to-production for agent applications

OpenAI Advantage:

  • WebSearchTool (ChatGPT search quality)
  • FileSearchTool (vector store integration)
  • ComputerTool (screen automation)

llm4s Current:

  • WeatherTool (demo only)
  • Users implement custom tools

Recommendation: Build llm4s-tools module with production-grade tools.


5. Durable Execution ⭐⭐⭐⭐

Gap: No integration with workflow engines for long-running tasks.

Impact:

  • Agents cannot survive crashes or restarts
  • No support for multi-day workflows
  • Human-in-the-loop patterns require custom infrastructure

OpenAI Advantage:

# Temporal integration for durability
@workflow
def approval_workflow(request):
    result = await runner.run(agent, request)
    approved = await human_approval(result)  # Can wait days
    if approved:
        return await runner.run(executor_agent, result)

llm4s Current:

  • No workflow engine integration
  • Manual state persistence required
  • No HITL framework

Recommendation: Explore integration with Camunda, Temporal, or build native workflow support.


Moderate Gaps (Medium Priority)

6. Handoff Mechanism ⭐⭐⭐

Gap: No native API for agent-to-agent delegation.

Current: Must explicitly model handoffs as DAG edges or tool calls.

Recommendation: Add Handoff tool type for cleaner delegation semantics.


7. Observability Integrations ⭐⭐⭐

Gap: Limited to Langfuse only.

OpenAI Support: Logfire, AgentOps, Braintrust, Scorecard, Keywords AI

Recommendation: Build plugin architecture for observability backends.


8. Reasoning Modes ⭐⭐⭐

Gap: No support for configuring reasoning effort (none/low/medium/high).

Impact: Cannot optimize latency vs. quality tradeoff for reasoning models.

Recommendation: Add reasoning field to CompletionOptions.


Minor Gaps (Low Priority)

9. Provider Breadth ⭐⭐

Gap: Supports 4 providers vs. OpenAI’s 100+.

Impact: Limited for users wanting niche models.

Recommendation: Consider Litellm integration for broader provider support.


10. Async Tool Execution ⭐⭐

Gap: Tools are synchronous only.

Impact: Blocking I/O in tools can slow down agent execution.

Recommendation: Support AsyncResult in ToolFunction.


Unique llm4s Strengths

  1. Type Safety ⭐⭐⭐⭐⭐
    • Compile-time type checking for agent composition
    • Type-safe DAG construction with Edge[A, B]
    • Superior to Python’s runtime validation
  2. Result-based Error Handling ⭐⭐⭐⭐
    • Explicit error handling via Result[A]
    • No hidden exceptions
    • Easier to reason about failure modes
  3. Workspace Isolation ⭐⭐⭐⭐
    • Docker-based workspace for tool execution
    • Security advantage over OpenAI SDK
    • Production-ready sandboxing
  4. MCP Integration ⭐⭐⭐
    • Native Model Context Protocol support
    • Standardized tool sharing across providers
  5. Cross-version Support ⭐⭐⭐
    • Scala 2.13 and 3.x compatibility
    • Valuable for enterprise Scala users
  6. Configuration System ⭐⭐⭐
    • Type-safe ConfigReader
    • Better than raw environment variables
    • Centralized configuration management
  7. Markdown Trace Logs ⭐⭐⭐
    • Built-in writeTraceLog() for debugging
    • Human-readable execution traces
    • Useful for development and debugging

Implementation Roadmap

Phase 1: Core Usability (Q1 2026 - 3 months)

Goal: Improve developer experience for multi-turn conversations while maintaining functional purity.

Design Philosophy Applied:

  • All APIs remain pure functions (no mutable sessions)
  • Helper methods reduce boilerplate while maintaining explicit state flow
  • Framework agnostic - works with plain Scala, Cats Effect, ZIO, etc.
  • Simple, well-documented APIs following principle of least surprise

1.1 Functional Conversation APIs ⭐⭐⭐⭐⭐

Effort: 2-3 weeks

Deliverables:

package org.llm4s.agent

// Pure continuation API
class Agent(client: LLMClient) {
  /**
   * Continue a conversation with a new user message.
   * Pure function - returns new state, does not mutate.
   */
  def continueConversation(
    previousState: AgentState,
    newUserMessage: String,
    maxSteps: Option[Int] = None,
    contextWindowConfig: Option[ContextWindowConfig] = None
  ): Result[AgentState]

  /**
   * Run multiple turns sequentially using functional fold.
   * No mutable state required.
   */
  def runMultiTurn(
    initialQuery: String,
    followUpQueries: Seq[String],
    tools: ToolRegistry,
    maxStepsPerTurn: Option[Int] = None
  ): Result[AgentState]
}

// Context window management (pure functions)
case class ContextWindowConfig(
  maxTokens: Option[Int] = None,
  maxMessages: Option[Int] = None,
  preserveSystemMessage: Boolean = true,
  pruningStrategy: PruningStrategy = PruningStrategy.OldestFirst
)

object AgentState {
  /**
   * Prune conversation - returns new state, does not mutate.
   */
  def pruneConversation(
    state: AgentState,
    config: ContextWindowConfig
  ): AgentState
}

Testing:

  • Multi-turn conversation flows (all functional)
  • Context window pruning strategies
  • State serialization/deserialization
  • Integration with effect systems (IO, Task)

Documentation:

  • Functional conversation management guide
  • Context window management tutorial
  • Migration from imperative to functional style
  • Examples showing composition with Cats Effect, ZIO

Reference: See Phase 1.1 Design Document


1.2 Guardrails Framework ⭐⭐⭐⭐⭐

Effort: 2-3 weeks

Deliverables:

package org.llm4s.agent.guardrails

trait Guardrail[A] {
  def validate(value: A): Result[A]
  def name: String
  def description: Option[String] = None
}

trait InputGuardrail extends Guardrail[String]
trait OutputGuardrail extends Guardrail[String]

// Built-in guardrails
class ProfanityFilter extends InputGuardrail with OutputGuardrail
class LengthCheck(min: Int, max: Int) extends InputGuardrail
class JSONValidator(schema: JsonSchema) extends OutputGuardrail
class RegexValidator(pattern: Regex) extends Guardrail[String]

// Composable validators
class CompositeGuardrail[A](
  guardrails: Seq[Guardrail[A]],
  mode: ValidationMode = ValidationMode.All  // All, Any, First
) extends Guardrail[A]

// Enhanced Agent API
class Agent(client: LLMClient) {
  def run(
    query: String,
    tools: ToolRegistry,
    inputGuardrails: Seq[InputGuardrail] = Seq.empty,   // NEW
    outputGuardrails: Seq[OutputGuardrail] = Seq.empty  // NEW
  ): Result[AgentState]
}

Testing:

  • Individual guardrail validation
  • Composite guardrail logic
  • Parallel validation execution
  • Guardrail error aggregation

Documentation:

  • Guardrails user guide
  • Custom guardrail tutorial
  • Best practices for safety validation

1.3 Handoff Mechanism ⭐⭐⭐⭐

Effort: 1-2 weeks

Deliverables:

package org.llm4s.agent

case class Handoff(
  targetAgent: Agent,
  transferReason: Option[String] = None,
  preserveContext: Boolean = true
)

// Enhanced Agent API
class Agent(client: LLMClient) {
  def initialize(
    query: String,
    tools: ToolRegistry,
    handoffs: Seq[Handoff] = Seq.empty  // NEW
  ): AgentState
}

// Handoff execution in agent loop
sealed trait AgentStatus
object AgentStatus {
  case object InProgress extends AgentStatus
  case object WaitingForTools extends AgentStatus
  case class HandoffRequested(handoff: Handoff) extends AgentStatus  // NEW
  case object Complete extends AgentStatus
  case class Failed(error: String) extends AgentStatus
}

Testing:

  • Single handoff execution
  • Chained handoffs
  • Context preservation across handoffs
  • Handoff loops prevention

Documentation:

  • Handoff patterns guide
  • Multi-agent coordination examples
  • Comparison with DAG orchestration

Phase 2: Streaming & Events (Q2 2026 - 2 months)

Goal: Enable real-time UX with fine-grained progress updates.

2.1 Event-based Streaming ⭐⭐⭐⭐⭐

Effort: 3-4 weeks

Deliverables:

package org.llm4s.agent.streaming

sealed trait AgentEvent {
  def timestamp: Instant
  def eventId: String
}

object AgentEvent {
  // Token-level events
  case class TextDelta(delta: String, ...) extends AgentEvent
  case class ToolCallStarted(toolName: String, toolCallId: String, ...) extends AgentEvent
  case class ToolCallCompleted(toolCallId: String, result: ujson.Value, ...) extends AgentEvent

  // Item-level events
  case class MessageGenerated(message: Message, ...) extends AgentEvent
  case class StepCompleted(stepIndex: Int, ...) extends AgentEvent

  // Status events
  case class AgentStarted(...) extends AgentEvent
  case class AgentCompleted(finalState: AgentState, ...) extends AgentEvent
  case class AgentFailed(error: LLMError, ...) extends AgentEvent
}

class Agent(client: LLMClient) {
  def runStreamed(
    query: String,
    tools: ToolRegistry,
    ...
  ): Iterator[Result[AgentEvent]]  // NEW
}

Testing:

  • Event ordering guarantees
  • Backpressure handling
  • Event filtering and transformation
  • Stream error recovery

Documentation:

  • Streaming events guide
  • Building real-time UIs
  • Event handling patterns

2.2 Async Tool Execution ⭐⭐⭐

Effort: 1-2 weeks

Deliverables:

package org.llm4s.toolapi

trait AsyncToolFunction {
  def execute(request: ToolCallRequest): AsyncResult[ujson.Value]
  def schema: ToolSchema
  def name: String
}

// Enhanced ToolRegistry
class ToolRegistry(
  syncTools: Seq[ToolFunction],
  asyncTools: Seq[AsyncToolFunction]  // NEW
)(implicit ec: ExecutionContext)

Testing:

  • Async tool execution
  • Concurrent tool calls
  • Timeout handling
  • Error propagation

Phase 3: Production Features (Q3 2026 - 3 months)

Goal: Enterprise-grade reliability and durability.

3.1 Workflow Engine Integration ⭐⭐⭐⭐⭐

Effort: 4-6 weeks

Deliverables:

package org.llm4s.agent.workflow

trait WorkflowEngine {
  def startWorkflow[I, O](
    workflow: Workflow[I, O],
    input: I
  ): AsyncResult[WorkflowExecution[O]]

  def resumeWorkflow[O](
    executionId: WorkflowExecutionId
  ): AsyncResult[WorkflowExecution[O]]
}

// Camunda integration (preferred for Scala ecosystem)
class CamundaWorkflowEngine(camunda: CamundaClient) extends WorkflowEngine

// Human-in-the-loop support
trait HumanTask[I, O] {
  def submit(input: I): AsyncResult[TaskId]
  def await(taskId: TaskId): AsyncResult[O]
}

Testing:

  • Workflow persistence
  • Crash recovery
  • Long-running workflows (days)
  • Human approval flows

Documentation:

  • Workflow integration guide
  • HITL patterns
  • Durable agent examples

3.2 Built-in Tools Module ⭐⭐⭐⭐

Effort: 4-6 weeks

Deliverables:

package org.llm4s.toolapi.builtin

// Web search via multiple providers
trait WebSearchTool extends AsyncToolFunction {
  def search(query: String): AsyncResult[SearchResults]
}

class BraveSearchTool(apiKey: ApiKey) extends WebSearchTool
class GoogleSearchTool(apiKey: ApiKey, cseId: String) extends WebSearchTool
class DuckDuckGoSearchTool() extends WebSearchTool  // Free, no API key

// Vector store / file search
trait VectorSearchTool extends AsyncToolFunction {
  def search(query: String, topK: Int): AsyncResult[Seq[Document]]
}

class PineconeSearchTool(pinecone: PineconeClient) extends VectorSearchTool
class WeaviateSearchTool(weaviate: WeaviateClient) extends VectorSearchTool
class LocalVectorSearchTool(embeddings: EmbeddingClient) extends VectorSearchTool

// Filesystem tools
object FileSystemTools {
  val readFile: ToolFunction = ...
  val writeFile: ToolFunction = ...
  val listDirectory: ToolFunction = ...
}

// HTTP tools
class HTTPTool extends AsyncToolFunction {
  def get(url: String): AsyncResult[HTTPResponse]
  def post(url: String, body: ujson.Value): AsyncResult[HTTPResponse]
}

Testing:

  • Integration tests with real APIs
  • Error handling for API failures
  • Rate limiting and retries
  • Tool safety (e.g., filesystem access limits)

Documentation:

  • Built-in tools catalog
  • Tool configuration guide
  • Safety and sandboxing recommendations

3.3 Enhanced Observability ⭐⭐⭐

Effort: 2-3 weeks

Deliverables:

package org.llm4s.trace

trait TracingBackend {
  def trace(span: Span): Result[Unit]
  def flush(): Result[Unit]
}

// New integrations
class LogfireBackend(config: LogfireConfig) extends TracingBackend
class AgentOpsBackend(config: AgentOpsConfig) extends TracingBackend
class BraintrustBackend(config: BraintrustConfig) extends TracingBackend

// Plugin architecture
class CompositeTracingBackend(backends: Seq[TracingBackend]) extends TracingBackend

// Custom spans
class Agent(client: LLMClient) {
  def runWithSpans(
    query: String,
    tools: ToolRegistry,
    customSpans: Seq[CustomSpan] = Seq.empty  // NEW
  ): Result[AgentState]
}

Testing:

  • Multi-backend tracing
  • Custom span integration
  • Performance overhead measurement

Phase 4: Advanced Features (Q4 2026 - 2 months)

Goal: Match or exceed OpenAI SDK feature parity.

4.1 Reasoning Modes ⭐⭐⭐

Effort: 1 week

Deliverables:

package org.llm4s.llmconnect.model

sealed trait ReasoningEffort
object ReasoningEffort {
  case object None extends ReasoningEffort
  case object Minimal extends ReasoningEffort
  case object Low extends ReasoningEffort
  case object Medium extends ReasoningEffort
  case object High extends ReasoningEffort
}

case class CompletionOptions(
  temperature: Option[Double] = None,
  maxTokens: Option[Int] = None,
  reasoning: Option[ReasoningEffort] = None,  // NEW
  ...
)

4.2 Provider Expansion ⭐⭐

Effort: 2-3 weeks

Deliverables:

// Litellm integration for 100+ providers
class LiteLLMClient(config: LiteLLMConfig) extends LLMClient

// Or direct integrations
class CohereClient(config: CohereConfig) extends LLMClient
class MistralClient(config: MistralConfig) extends LLMClient
class GeminiClient(config: GeminiConfig) extends LLMClient

4.3 Session Serialization Enhancements ⭐⭐

Effort: 1 week

Deliverables:

// Complete serialization support
object AgentState {
  implicit val rw: ReadWriter[AgentState] = macroRW
}

// Session export/import
class Session {
  def toJson: ujson.Value
  def toInputList: Seq[Message]  // OpenAI compatibility
}

object Session {
  def fromJson(json: ujson.Value): Result[Session]
  def fromInputList(messages: Seq[Message]): Session
}

Priority Recommendations

Immediate Action (Next 3 Months)

  1. Session Management - Critical for usability
  2. Guardrails Framework - Critical for production safety
  3. Event-based Streaming - Critical for UX

Short-term (3-6 Months)

  1. Built-in Tools Module - High value, reduces friction
  2. Handoff Mechanism - Improves multi-agent patterns
  3. Async Tool Execution - Performance improvement

Medium-term (6-12 Months)

  1. Workflow Engine Integration - Production durability
  2. Enhanced Observability - Enterprise requirement
  3. Reasoning Modes - Model optimization

Long-term (12+ Months)

  1. Provider Expansion - Nice-to-have for broader adoption

Appendix: Architecture Notes

Design Principles for Gap Closure

All enhancements must adhere to llm4s core design philosophy:

1. Functional and Immutable First

Preserve Type Safety:

  • Don’t sacrifice Scala’s type system for feature parity
  • Use compile-time type checking where OpenAI uses runtime validation
  • Keep compile-time guarantees for agent composition

Result-based Error Handling:

  • Continue using Result[A] for all fallible operations
  • Avoid exceptions in public APIs
  • Provide conversion utilities for exception-heavy libraries (Try.toResult)

Functional Core, Imperative Shell:

  • Keep agent core logic pure and testable
  • Push effects (I/O, state mutations) to boundaries
  • All operations return new states, never mutate

Example:

// ❌ Don't add mutable sessions
class Session {
  var messages: List[Message] = List.empty
  def add(msg: Message): Unit = { messages = messages :+ msg }
}

// βœ… Do add pure functions
def continueConversation(state: AgentState, msg: String): Result[AgentState] =
  Right(state.copy(conversation = state.conversation.addMessage(UserMessage(msg))))

2. Framework Agnostic

Minimal Dependencies:

  • Use scala.concurrent.Future for async (universally compatible)
  • Don’t require Cats Effect, ZIO, or any specific effect system
  • Provide integration examples for popular frameworks

Composability:

  • Ensure all APIs work with plain Scala, Cats Effect IO, ZIO Task, etc.
  • Use Result[A] which naturally converts to any effect type
  • Avoid tying users to framework-specific abstractions

Example:

// βœ… Framework agnostic - works with any effect system
val result: Result[AgentState] = agent.run(query, tools)

// Users can lift to their preferred effect system
val io: IO[AgentState] = IO.fromEither(result)
val task: Task[AgentState] = ZIO.fromEither(result)

3. Simplicity Over Cleverness

Literate APIs:

  • Descriptive method names (continueConversation, not >> or +)
  • Avoid operator overloading for domain operations
  • Comprehensive ScalaDoc on all public APIs
  • Examples in documentation showing common use cases

Explicit Over Implicit:

  • Minimize use of implicit parameters
  • Explicit state flow (visible in code)
  • No magic behavior or hidden side effects

Example:

// ❌ Too clever
state1 >> "query" >> "followup"  // What does >> mean?

// βœ… Clear and explicit
for {
  state1 <- agent.run("query", tools)
  state2 <- agent.continueConversation(state1, "followup")
} yield state2

4. Principle of Least Surprise

Follow Conventions:

  • Method names follow Scala conventions (map, flatMap, fold)
  • Error handling via Either (standard Scala pattern)
  • Immutable collections behave as expected (return new collections)

Predictable Behavior:

  • No hidden mutations
  • No global state
  • Operations compose as expected

Backward Compatibility:

  • Add new features as optional parameters
  • Provide migration guides for breaking changes
  • Maintain cross-version Scala support

5. Modularity

Separation of Concerns:

  • Keep core agent framework separate from built-in tools
  • Make integrations (workflow engines, observability) pluggable
  • Allow users to opt-out of features they don’t need

Pure Core, Effectful Edges:

  • Core business logic is pure (easy to test, reason about)
  • I/O and effects pushed to module boundaries
  • Clear separation between pure and effectful code

Architectural Patterns

Functional Conversation Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Initial Query     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
   agent.run(query, tools) ──────► Result[AgentState]
          β”‚                              β”‚
          β”‚                              β”‚ (immutable state1)
          β”‚                              β”‚
          β–Ό                              β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚  User wants to continue conversation     β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
                      β–Ό
   agent.continueConversation(state1, "next query")
                      β”‚
                      β”œβ”€β†’ Validate state (must be Complete/Failed)
                      β”œβ”€β†’ Add user message (pure function)
                      β”œβ”€β†’ Optionally prune context (pure function)
                      └─→ Run agent ──────► Result[AgentState]
                                                   β”‚
                                                   β”‚ (immutable state2)
                                                   β–Ό
                                          Continue as needed...

Key: All arrows represent pure functions returning new immutable states

Conversation Persistence (Optional)

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AgentState     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β”œβ”€β†’ AgentState.toJson(state) ──► ujson.Value (pure)
         β”‚
         β”œβ”€β†’ AgentState.saveToFile(state, path) ──► Result[Unit] (I/O)
         β”‚
         └─→ AgentState.loadFromFile(path, tools) ──► Result[AgentState] (I/O)

Key: Pure serialization separated from I/O operations

Guardrails Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  User Query      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β”œβ”€β†’ InputGuardrails (parallel)
          β”‚   β”œβ”€β†’ ProfanityFilter
          β”‚   β”œβ”€β†’ LengthCheck
          β”‚   └─→ CustomValidator
          β”‚
          β”œβ”€β†’ Agent.run() if all pass
          β”‚
          β”œβ”€β†’ OutputGuardrails (parallel)
          β”‚   β”œβ”€β†’ FactChecker
          β”‚   β”œβ”€β†’ JSONValidator
          β”‚   └─→ ToneValidator
          β”‚
          └─→ Return result if all pass

Streaming Events Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Agent.runStreamed()
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β”œβ”€β†’ LLM Streaming
         β”‚   └─→ TextDelta events
         β”‚
         β”œβ”€β†’ Tool Execution
         β”‚   β”œβ”€β†’ ToolCallStarted events
         β”‚   └─→ ToolCallCompleted events
         β”‚
         └─→ Agent Status
             β”œβ”€β†’ StepCompleted events
             └─→ AgentCompleted event

Code Organization

Recommended module structure after implementation:

modules/core/src/main/scala/org/llm4s/
β”œβ”€β”€ agent/
β”‚   β”œβ”€β”€ Agent.scala                    # Core agent (enhanced)
β”‚   β”œβ”€β”€ AgentState.scala               # State management (enhanced)
β”‚   β”œβ”€β”€ Session.scala                  # NEW: Session management
β”‚   β”œβ”€β”€ SessionStore.scala             # NEW: Session persistence
β”‚   β”œβ”€β”€ Handoff.scala                  # NEW: Agent delegation
β”‚   β”œβ”€β”€ guardrails/                    # NEW: Guardrails framework
β”‚   β”‚   β”œβ”€β”€ Guardrail.scala
β”‚   β”‚   β”œβ”€β”€ InputGuardrail.scala
β”‚   β”‚   β”œβ”€β”€ OutputGuardrail.scala
β”‚   β”‚   └── builtin/
β”‚   β”‚       β”œβ”€β”€ ProfanityFilter.scala
β”‚   β”‚       β”œβ”€β”€ LengthCheck.scala
β”‚   β”‚       └── JSONValidator.scala
β”‚   β”œβ”€β”€ streaming/                     # NEW: Streaming events
β”‚   β”‚   β”œβ”€β”€ AgentEvent.scala
β”‚   β”‚   └── EventStream.scala
β”‚   β”œβ”€β”€ workflow/                      # NEW: Workflow integration
β”‚   β”‚   β”œβ”€β”€ WorkflowEngine.scala
β”‚   β”‚   β”œβ”€β”€ CamundaWorkflowEngine.scala
β”‚   β”‚   └── HumanTask.scala
β”‚   └── orchestration/                 # Existing multi-agent
β”‚       β”œβ”€β”€ Agent.scala
β”‚       β”œβ”€β”€ DAG.scala
β”‚       └── PlanRunner.scala
β”œβ”€β”€ toolapi/
β”‚   β”œβ”€β”€ ToolFunction.scala             # Existing
β”‚   β”œβ”€β”€ AsyncToolFunction.scala        # NEW: Async tools
β”‚   β”œβ”€β”€ ToolRegistry.scala             # Enhanced
β”‚   └── builtin/                       # NEW: Built-in tools
β”‚       β”œβ”€β”€ WebSearchTool.scala
β”‚       β”œβ”€β”€ VectorSearchTool.scala
β”‚       β”œβ”€β”€ FileSystemTools.scala
β”‚       └── HTTPTool.scala
└── trace/
    β”œβ”€β”€ TracingBackend.scala           # Enhanced
    β”œβ”€β”€ LogfireBackend.scala           # NEW
    β”œβ”€β”€ AgentOpsBackend.scala          # NEW
    └── CustomSpan.scala               # NEW

Conclusion

llm4s has a strong foundation built on solid design principles. While OpenAI Agents SDK provides more features out-of-the-box, llm4s offers a fundamentally different and more correct approach grounded in functional programming.

Strategic Focus Areas

To enhance llm4s while maintaining its design philosophy:

  1. Functional Developer Experience - Ergonomic APIs for multi-turn conversations without sacrificing purity
  2. Production Readiness - Workflow integration and durability (explored functionally)
  3. Tool Ecosystem - Built-in tools as pure, composable functions
  4. Real-time UX - Streaming events as functional streams (Iterators, FS2, etc.)

The roadmap is achievable over 12 months with 1-2 dedicated developers, with one critical constraint: all implementations must adhere to llm4s design philosophy.

Unique Value Proposition

After closing gaps, llm4s will offer a unique combination not found in any other agent framework:

Functional Correctness:

  • βœ… Pure functions and immutable data (no mutable sessions)
  • βœ… Explicit state flow via for-comprehensions
  • βœ… Referential transparency - code behaves as written
  • βœ… Composable with any effect system (Cats Effect, ZIO, plain Scala)

Type Safety:

  • βœ… Compile-time safety for multi-agent composition
  • βœ… Type-safe DAG construction with Edge[A, B]
  • βœ… Result-based error handling (no hidden exceptions)

Production Features:

  • βœ… Workspace isolation for secure tool execution
  • βœ… Cross-version Scala support (2.13 & 3.x)
  • βœ… MCP integration for standardized tool protocols

Developer Experience:

  • βœ… Simple, literate APIs (principle of least surprise)
  • βœ… Framework agnostic - bring your own stack
  • βœ… Well-documented with comprehensive examples

Positioning

llm4s is not trying to be a Scala port of OpenAI SDK. Instead, it’s building the correct agent framework for functional programming:

Aspect OpenAI SDK llm4s
Philosophy Convenient, practical Correct, composable
State Management Mutable objects Immutable, explicit flow
Error Handling Exceptions Result types
Effect System Python asyncio Framework agnostic
Type Safety Runtime validation Compile-time checking
Target Audience Python developers Scala/FP developers

The llm4s Way: We don’t compromise functional principles for convenience. Instead, we design APIs that are both functionally pure AND ergonomic - proving that correctness and usability are not mutually exclusive.

This positions llm4s as the premier choice for:

  • Enterprise Scala teams valuing correctness and maintainability
  • Functional programming practitioners
  • Teams building mission-critical agent systems
  • Organizations requiring compile-time safety guarantees

Final Note: Feature gaps should be closed with solutions that align with llm4s philosophy. The Phase 1.1 Design demonstrates this approach - achieving OpenAI SDK ergonomics while maintaining functional purity.


End of Report