Ollama Quick Start Guide

Run LLM4S with Ollama for free, local LLM inference - no API keys required!

Table of contents

  1. Why Ollama?
  2. Prerequisites
  3. Step 1: Install Ollama
    1. macOS / Linux
    2. Windows
    3. Verify Installation
  4. Step 2: Start Ollama Server
    1. Start the Server
    2. Verify Server is Running
  5. Step 3: Pull a Model
    1. Available Models
    2. List Downloaded Models
  6. Step 4: Configure LLM4S
    1. Option A: Environment Variables (Recommended)
    2. Option B: Application Config
  7. Step 5: Write Your First LLM4S + Ollama App
    1. Run It!
    2. Expected Output
  8. Step 6: Try Different Models
  9. Streaming Responses
  10. Tool Calling with Ollama
  11. Model Comparison
  12. Configuration Options
    1. Temperature Control
    2. Context Length
  13. Troubleshooting
    1. “Connection refused” error
    2. “Model not found” error
    3. Slow inference
    4. Model comparison
  14. Running the Examples
  15. Production Deployment
    1. Docker Compose Setup
    2. Pre-pull Models
  16. Ollama vs Cloud Providers
  17. Next Steps
  18. Resources

Why Ollama?

Ollama is the easiest way to run large language models locally on your machine:

  • 100% Free - No API costs or rate limits
  • Private - Your data never leaves your machine
  • Fast - Low latency for local inference
  • Offline - Works without internet connection
  • Multiple Models - Easy model switching (llama2, mistral, phi, etc.)

Perfect for development, testing, and production workloads where privacy matters.


Prerequisites

  • Java 11+ (JDK 21 recommended)
  • Scala 2.13.16 or 3.7.1
  • SBT 1.10.6+
  • 4-8GB RAM (depending on model size)

No API keys needed! 🎉


Step 1: Install Ollama

macOS / Linux

1
curl -fsSL https://ollama.com/install.sh | sh

Windows

Download the installer from ollama.com/download

Or use PowerShell:

1
2
# Download and install Ollama
winget install Ollama.Ollama

Verify Installation

1
2
ollama --version
# Should output: ollama version is 0.x.x

Step 2: Start Ollama Server

Start the Server

1
ollama serve

The server will start on http://localhost:11434 by default.

On Windows/macOS, Ollama may start automatically as a background service. Check your system tray/menu bar.

Verify Server is Running

1
2
curl http://localhost:11434
# Should output: Ollama is running

Step 3: Pull a Model

Ollama models are pulled on-demand. Let’s start with Mistral 7B (fast and capable):

1
ollama pull mistral

Available Models

Model Size RAM Required Best For Pull Command
mistral 4.1GB 8GB General purpose, fast ollama pull mistral
llama2 3.8GB 8GB Good balance ollama pull llama2
phi 1.6GB 4GB Lightweight, fast ollama pull phi
neural-chat 4.1GB 8GB Conversational ollama pull neural-chat
codellama 3.8GB 8GB Code generation ollama pull codellama
llama3.2 2.0GB 8GB Latest Llama ollama pull llama3.2
gemma2 5.4GB 8GB Google’s model ollama pull gemma2

Recommendation: Start with mistral for the best balance of speed and quality.

List Downloaded Models

1
ollama list

Step 4: Configure LLM4S

In your terminal (or add to .env file):

1
2
3
4
5
6
7
# Linux / macOS
export LLM_MODEL=ollama/mistral
export OLLAMA_BASE_URL=http://localhost:11434

# Windows PowerShell
$env:LLM_MODEL = "ollama/mistral"
$env:OLLAMA_BASE_URL = "http://localhost:11434"

Option B: Application Config

Create src/main/resources/application.conf:

1
2
3
4
5
6
7
llm4s {
  provider = "ollama"
  model = "mistral"
  ollama {
    baseUrl = "http://localhost:11434"
  }
}

Step 5: Write Your First LLM4S + Ollama App

Create HelloOllama.scala:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import org.llm4s.config.Llm4sConfig
import org.llm4s.llmconnect.LLMConnect
import org.llm4s.llmconnect.model._

object HelloOllama extends App {
  // Create a conversation with system and user messages
  val conversation = Conversation(Seq(
    SystemMessage("You are a helpful AI assistant."),
    UserMessage("Explain what Scala is in one sentence.")
  ))

  // Load config and make the request
  val result = for {
    providerConfig <- Llm4sConfig.provider()
    client <- LLMConnect.getClient(providerConfig)
    completion <- client.complete(conversation)
  } yield completion

  result match {
    case Right(completion) =>
      println(s"Response from ${completion.model}:")
      println(completion.message.content)
    case Left(error) =>
      println(s"Error: ${error.formatted}")
  }
}

Run It!

1
sbt run

Expected Output

1
2
3
4
✓ Response from mistral:
Scala is a statically-typed programming language that combines
object-oriented and functional programming paradigms, running on
the Java Virtual Machine (JVM).

Step 6: Try Different Models

You can easily switch models:

1
2
3
4
5
6
7
8
# Try Llama 2
export LLM_MODEL=ollama/llama2

# Try Phi (faster, smaller)
export LLM_MODEL=ollama/phi

# Try CodeLlama (for coding tasks)
export LLM_MODEL=ollama/codellama

Then run your program again without code changes!


Streaming Responses

Get real-time token streaming (like ChatGPT):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import org.llm4s.config.Llm4sConfig
import org.llm4s.llmconnect.LLMConnect
import org.llm4s.llmconnect.model._

object StreamingOllama extends App {
  val conversation = Conversation(Seq(
    SystemMessage("You are a concise assistant."),
    UserMessage("Write a haiku about Scala programming.")
  ))

  val result = for {
    providerConfig <- Llm4sConfig.provider()
    client <- LLMConnect.getClient(providerConfig)
    completion <- client.streamComplete(
      conversation,
      CompletionOptions(),
      chunk => chunk.content.foreach(print)  // Print tokens as they arrive
    )
  } yield completion

  result match {
    case Right(completion) =>
      println("\n--- Streaming complete! ---")
      println(s"Total content: ${completion.message.content}")
    case Left(error) =>
      println(s"Error: ${error.formatted}")
  }
}

Tool Calling with Ollama

Ollama supports tool calling (function calling) with compatible models:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import org.llm4s.agent.Agent
import org.llm4s.config.Llm4sConfig
import org.llm4s.llmconnect.LLMConnect
import org.llm4s.toolapi._
import upickle.default._

object OllamaTools extends App {
  // Define result type
  case class WeatherResult(forecast: String)
  implicit val weatherResultRW: ReadWriter[WeatherResult] = macroRW

  // Define a weather tool with proper schema
  val weatherSchema = Schema
    .`object`[Map[String, Any]]("Weather parameters")
    .withProperty(
      Schema.property("location", Schema.string("City or location name"))
    )

  val getWeather = ToolBuilder[Map[String, Any], WeatherResult](
    "get_weather",
    "Get the current weather in a location",
    weatherSchema
  ).withHandler { extractor =>
    extractor.getString("location").map { location =>
      // Mock implementation
      WeatherResult(s"Weather in $location: Sunny, 72F")
    }
  }.build()

  val tools = new ToolRegistry(Seq(getWeather))

  val result = for {
    providerConfig <- Llm4sConfig.provider()
    client <- LLMConnect.getClient(providerConfig)
    agent = new Agent(client)
    state <- agent.run("What's the weather in San Francisco?", tools)
  } yield state

  result match {
    case Right(state) =>
      println("Final response:")
      println(state.conversation.messages.last.content)
    case Left(error) =>
      println(s"Error: ${error.formatted}")
  }
}

Model Comparison

Performance comparison running on Apple M1 Mac:

Model Speed (tokens/sec) Quality Memory Best Use Case
mistral ~40 ⭐⭐⭐⭐ 8GB General purpose, great balance
llama2 ~35 ⭐⭐⭐⭐ 8GB Conversational, creative
phi ~80 ⭐⭐⭐ 4GB Quick tests, development
codellama ~35 ⭐⭐⭐⭐ 8GB Code generation
neural-chat ~40 ⭐⭐⭐⭐ 8GB Dialogue, chat apps

Performance varies by hardware. These are approximate values on M1 MacBook Pro 16GB RAM.


Configuration Options

Temperature Control

1
2
3
4
5
CompletionOptions(
  temperature = 0.7,  // Higher = more creative (0.0-2.0)
  maxTokens = Some(1000),
  topP = Some(0.9)
)

Context Length

Ollama models have different context windows:

  • mistral: 8k tokens
  • llama2: 4k tokens
  • llama3.2: 128k tokens
  • codellama: 16k tokens
  • gemma2: 8k tokens

Troubleshooting

“Connection refused” error

Problem: Ollama server not running

Solution:

1
ollama serve

“Model not found” error

Problem: Model not pulled

Solution:

1
ollama pull mistral

Slow inference

Problem: Not enough RAM or CPU

Solutions:

  • Use a smaller model: ollama pull phi
  • Close other applications
  • Check if running on GPU (M-series Mac, CUDA GPU)

Model comparison

1
2
3
4
5
# List all models with sizes
ollama list

# Delete a model to free space
ollama rm llama2

Running the Examples

Try the built-in LLM4S Ollama samples:

1
2
3
4
5
6
7
8
9
10
11
12
# Set environment
export LLM_MODEL=ollama/mistral
export OLLAMA_BASE_URL=http://localhost:11434

# Run basic Ollama example
sbt "samples/runMain org.llm4s.samples.basic.OllamaExample"

# Run Ollama streaming example
sbt "samples/runMain org.llm4s.samples.basic.OllamaStreamingExample"

# Run tool calling example (works with any provider)
sbt "samples/runMain org.llm4s.samples.toolapi.BuiltinToolsExample"

Production Deployment

Docker Compose Setup

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# docker-compose.yml
version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    mem_limit: 8g

  llm4s-app:
    build: .
    environment:
      - LLM_MODEL=ollama/mistral
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama

volumes:
  ollama-data:

Pre-pull Models

1
2
3
# Pull models into Docker volume
docker-compose exec ollama ollama pull mistral
docker-compose exec ollama ollama pull llama2

Ollama vs Cloud Providers

Feature Ollama OpenAI Anthropic
Cost Free $0.01-0.06/1K tokens $0.003-0.015/1K tokens
Privacy 100% local Cloud-based Cloud-based
Speed Depends on hardware Fast (API) Fast (API)
Offline ✅ Yes ❌ No ❌ No
Model quality Good (7B-70B) Excellent (GPT-4) Excellent (Claude)
Setup 5 minutes Instant Instant

Use Ollama for:

  • Development and testing
  • Privacy-sensitive applications
  • Cost-conscious deployments
  • Offline environments

Use cloud providers for:

  • Highest quality responses
  • Scale without hardware limits
  • Latest model capabilities

Next Steps


Resources


💡 Pro Tip

Use Ollama for development and testing, then switch to cloud providers for production by just changing the LLM_MODEL environment variable:

1
2
3
4
5
6
# Development
export LLM_MODEL=ollama/mistral

# Production
export LLM_MODEL=openai/gpt-4o
export OPENAI_API_KEY=sk-...

Your code stays exactly the same! 🎉