Ollama Quick Start Guide
Run LLM4S with Ollama for free, local LLM inference - no API keys required!
Table of contents
- Why Ollama?
- Prerequisites
- Step 1: Install Ollama
- Step 2: Start Ollama Server
- Step 3: Pull a Model
- Step 4: Configure LLM4S
- Step 5: Write Your First LLM4S + Ollama App
- Step 6: Try Different Models
- Streaming Responses
- Tool Calling with Ollama
- Model Comparison
- Configuration Options
- Troubleshooting
- Running the Examples
- Production Deployment
- Ollama vs Cloud Providers
- Next Steps
- Resources
Why Ollama?
Ollama is the easiest way to run large language models locally on your machine:
- ✅ 100% Free - No API costs or rate limits
- ✅ Private - Your data never leaves your machine
- ✅ Fast - Low latency for local inference
- ✅ Offline - Works without internet connection
- ✅ Multiple Models - Easy model switching (llama2, mistral, phi, etc.)
Perfect for development, testing, and production workloads where privacy matters.
Prerequisites
- Java 11+ (JDK 21 recommended)
- Scala 2.13.16 or 3.7.1
- SBT 1.10.6+
- 4-8GB RAM (depending on model size)
No API keys needed! 🎉
Step 1: Install Ollama
macOS / Linux
1
curl -fsSL https://ollama.com/install.sh | sh
Windows
Download the installer from ollama.com/download
Or use PowerShell:
1
2
# Download and install Ollama
winget install Ollama.Ollama
Verify Installation
1
2
ollama --version
# Should output: ollama version is 0.x.x
Step 2: Start Ollama Server
Start the Server
1
ollama serve
The server will start on http://localhost:11434 by default.
On Windows/macOS, Ollama may start automatically as a background service. Check your system tray/menu bar.
Verify Server is Running
1
2
curl http://localhost:11434
# Should output: Ollama is running
Step 3: Pull a Model
Ollama models are pulled on-demand. Let’s start with Mistral 7B (fast and capable):
1
ollama pull mistral
Available Models
| Model | Size | RAM Required | Best For | Pull Command |
|---|---|---|---|---|
| mistral | 4.1GB | 8GB | General purpose, fast | ollama pull mistral |
| llama2 | 3.8GB | 8GB | Good balance | ollama pull llama2 |
| phi | 1.6GB | 4GB | Lightweight, fast | ollama pull phi |
| neural-chat | 4.1GB | 8GB | Conversational | ollama pull neural-chat |
| codellama | 3.8GB | 8GB | Code generation | ollama pull codellama |
| llama3.2 | 2.0GB | 8GB | Latest Llama | ollama pull llama3.2 |
| gemma2 | 5.4GB | 8GB | Google’s model | ollama pull gemma2 |
Recommendation: Start with
mistralfor the best balance of speed and quality.
List Downloaded Models
1
ollama list
Step 4: Configure LLM4S
Option A: Environment Variables (Recommended)
In your terminal (or add to .env file):
1
2
3
4
5
6
7
# Linux / macOS
export LLM_MODEL=ollama/mistral
export OLLAMA_BASE_URL=http://localhost:11434
# Windows PowerShell
$env:LLM_MODEL = "ollama/mistral"
$env:OLLAMA_BASE_URL = "http://localhost:11434"
Option B: Application Config
Create src/main/resources/application.conf:
1
2
3
4
5
6
7
llm4s {
provider = "ollama"
model = "mistral"
ollama {
baseUrl = "http://localhost:11434"
}
}
Step 5: Write Your First LLM4S + Ollama App
Create HelloOllama.scala:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import org.llm4s.config.Llm4sConfig
import org.llm4s.llmconnect.LLMConnect
import org.llm4s.llmconnect.model._
object HelloOllama extends App {
// Create a conversation with system and user messages
val conversation = Conversation(Seq(
SystemMessage("You are a helpful AI assistant."),
UserMessage("Explain what Scala is in one sentence.")
))
// Load config and make the request
val result = for {
providerConfig <- Llm4sConfig.provider()
client <- LLMConnect.getClient(providerConfig)
completion <- client.complete(conversation)
} yield completion
result match {
case Right(completion) =>
println(s"Response from ${completion.model}:")
println(completion.message.content)
case Left(error) =>
println(s"Error: ${error.formatted}")
}
}
Run It!
1
sbt run
Expected Output
1
2
3
4
✓ Response from mistral:
Scala is a statically-typed programming language that combines
object-oriented and functional programming paradigms, running on
the Java Virtual Machine (JVM).
Step 6: Try Different Models
You can easily switch models:
1
2
3
4
5
6
7
8
# Try Llama 2
export LLM_MODEL=ollama/llama2
# Try Phi (faster, smaller)
export LLM_MODEL=ollama/phi
# Try CodeLlama (for coding tasks)
export LLM_MODEL=ollama/codellama
Then run your program again without code changes!
Streaming Responses
Get real-time token streaming (like ChatGPT):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import org.llm4s.config.Llm4sConfig
import org.llm4s.llmconnect.LLMConnect
import org.llm4s.llmconnect.model._
object StreamingOllama extends App {
val conversation = Conversation(Seq(
SystemMessage("You are a concise assistant."),
UserMessage("Write a haiku about Scala programming.")
))
val result = for {
providerConfig <- Llm4sConfig.provider()
client <- LLMConnect.getClient(providerConfig)
completion <- client.streamComplete(
conversation,
CompletionOptions(),
chunk => chunk.content.foreach(print) // Print tokens as they arrive
)
} yield completion
result match {
case Right(completion) =>
println("\n--- Streaming complete! ---")
println(s"Total content: ${completion.message.content}")
case Left(error) =>
println(s"Error: ${error.formatted}")
}
}
Tool Calling with Ollama
Ollama supports tool calling (function calling) with compatible models:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import org.llm4s.agent.Agent
import org.llm4s.config.Llm4sConfig
import org.llm4s.llmconnect.LLMConnect
import org.llm4s.toolapi._
import upickle.default._
object OllamaTools extends App {
// Define result type
case class WeatherResult(forecast: String)
implicit val weatherResultRW: ReadWriter[WeatherResult] = macroRW
// Define a weather tool with proper schema
val weatherSchema = Schema
.`object`[Map[String, Any]]("Weather parameters")
.withProperty(
Schema.property("location", Schema.string("City or location name"))
)
val getWeather = ToolBuilder[Map[String, Any], WeatherResult](
"get_weather",
"Get the current weather in a location",
weatherSchema
).withHandler { extractor =>
extractor.getString("location").map { location =>
// Mock implementation
WeatherResult(s"Weather in $location: Sunny, 72F")
}
}.build()
val tools = new ToolRegistry(Seq(getWeather))
val result = for {
providerConfig <- Llm4sConfig.provider()
client <- LLMConnect.getClient(providerConfig)
agent = new Agent(client)
state <- agent.run("What's the weather in San Francisco?", tools)
} yield state
result match {
case Right(state) =>
println("Final response:")
println(state.conversation.messages.last.content)
case Left(error) =>
println(s"Error: ${error.formatted}")
}
}
Model Comparison
Performance comparison running on Apple M1 Mac:
| Model | Speed (tokens/sec) | Quality | Memory | Best Use Case |
|---|---|---|---|---|
| mistral | ~40 | ⭐⭐⭐⭐ | 8GB | General purpose, great balance |
| llama2 | ~35 | ⭐⭐⭐⭐ | 8GB | Conversational, creative |
| phi | ~80 | ⭐⭐⭐ | 4GB | Quick tests, development |
| codellama | ~35 | ⭐⭐⭐⭐ | 8GB | Code generation |
| neural-chat | ~40 | ⭐⭐⭐⭐ | 8GB | Dialogue, chat apps |
Performance varies by hardware. These are approximate values on M1 MacBook Pro 16GB RAM.
Configuration Options
Temperature Control
1
2
3
4
5
CompletionOptions(
temperature = 0.7, // Higher = more creative (0.0-2.0)
maxTokens = Some(1000),
topP = Some(0.9)
)
Context Length
Ollama models have different context windows:
- mistral: 8k tokens
- llama2: 4k tokens
- llama3.2: 128k tokens
- codellama: 16k tokens
- gemma2: 8k tokens
Troubleshooting
“Connection refused” error
Problem: Ollama server not running
Solution:
1
ollama serve
“Model not found” error
Problem: Model not pulled
Solution:
1
ollama pull mistral
Slow inference
Problem: Not enough RAM or CPU
Solutions:
- Use a smaller model:
ollama pull phi - Close other applications
- Check if running on GPU (M-series Mac, CUDA GPU)
Model comparison
1
2
3
4
5
# List all models with sizes
ollama list
# Delete a model to free space
ollama rm llama2
Running the Examples
Try the built-in LLM4S Ollama samples:
1
2
3
4
5
6
7
8
9
10
11
12
# Set environment
export LLM_MODEL=ollama/mistral
export OLLAMA_BASE_URL=http://localhost:11434
# Run basic Ollama example
sbt "samples/runMain org.llm4s.samples.basic.OllamaExample"
# Run Ollama streaming example
sbt "samples/runMain org.llm4s.samples.basic.OllamaStreamingExample"
# Run tool calling example (works with any provider)
sbt "samples/runMain org.llm4s.samples.toolapi.BuiltinToolsExample"
Production Deployment
Docker Compose Setup
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
mem_limit: 8g
llm4s-app:
build: .
environment:
- LLM_MODEL=ollama/mistral
- OLLAMA_BASE_URL=http://ollama:11434
depends_on:
- ollama
volumes:
ollama-data:
Pre-pull Models
1
2
3
# Pull models into Docker volume
docker-compose exec ollama ollama pull mistral
docker-compose exec ollama ollama pull llama2
Ollama vs Cloud Providers
| Feature | Ollama | OpenAI | Anthropic |
|---|---|---|---|
| Cost | Free | $0.01-0.06/1K tokens | $0.003-0.015/1K tokens |
| Privacy | 100% local | Cloud-based | Cloud-based |
| Speed | Depends on hardware | Fast (API) | Fast (API) |
| Offline | ✅ Yes | ❌ No | ❌ No |
| Model quality | Good (7B-70B) | Excellent (GPT-4) | Excellent (Claude) |
| Setup | 5 minutes | Instant | Instant |
Use Ollama for:
- Development and testing
- Privacy-sensitive applications
- Cost-conscious deployments
- Offline environments
Use cloud providers for:
- Highest quality responses
- Scale without hardware limits
- Latest model capabilities
Next Steps
- Configuration Guide - Advanced Ollama settings
- First Example - Build more complex agents
- Tool Calling - Add custom tools
- RAG with Ollama - Retrieval-augmented generation
Resources
- Ollama Official Site
- Ollama GitHub
- Ollama Model Library
- LLM4S Discord - Get help from the community
💡 Pro Tip
Use Ollama for development and testing, then switch to cloud providers for production by just changing the
LLM_MODELenvironment variable:
1 2 3 4 5 6 # Development export LLM_MODEL=ollama/mistral # Production export LLM_MODEL=openai/gpt-4o export OPENAI_API_KEY=sk-...Your code stays exactly the same! 🎉