RAG Evaluation & Benchmarking

Systematically measure and improve your RAG pipeline quality using RAGAS metrics and the benchmarking harness.

Overview
RAGAS Metrics Explained
Quick Start: Evaluate Your RAG
1. Basic Evaluation
2. Batch Evaluation
Benchmarking Harness
1. Running Pre-Built Benchmarks
2. Understanding the Output
What Each Benchmark Tests
Programmatic Benchmarking
1. Custom Benchmark Suite
2. Custom Dataset Format
Improving Your RAG with Benchmarks
1. Optimization Workflow
2. Common Optimization Patterns
Interpreting Results
1. What’s a Good RAGAS Score?
2. Comparing Configurations
Benchmark Results Reference
1. Fusion Strategy Results
2. Chunking Strategy Results
Best Practices
API Reference
Next Steps

Overview

Building a RAG (Retrieval-Augmented Generation) system is just the beginning. The real challenge is knowing whether your retrieval is actually helping your LLM produce better answers. LLM4S provides two complementary tools:

RAGAS Evaluator - Measure retrieval and generation quality with industry-standard metrics
Benchmarking Harness - Systematically compare different RAG configurations

Key Benefits:

Data-driven optimization of chunking strategies
Objective comparison of fusion methods (RRF vs weighted vs vector-only)
Embedding provider evaluation
Reproducible experiments with JSON reports

RAGAS Metrics Explained

RAGAS (Retrieval Augmented Generation Assessment) provides four core metrics:

Faithfulness (0.0 - 1.0)

Measures whether the generated answer is grounded in the retrieved context.

1.0: Answer fully supported by context - no hallucination
0.0: Answer contains claims not in the retrieved documents

Question: "What year was Scala released?"
Context: "Scala was first released in 2004."
Answer: "Scala was released in 2004."
Faithfulness: 1.0 (fully grounded)

Answer Relevancy (0.0 - 1.0)

Measures how relevant the answer is to the question asked.

1.0: Answer directly addresses the question
0.0: Answer is completely off-topic

Question: "What is Scala?"
Answer: "Scala is a programming language."
Relevancy: ~0.85 (relevant but could be more complete)

Context Precision (0.0 - 1.0)

Measures whether the retrieved chunks are relevant to answering the question.

1.0: All retrieved chunks contribute to the answer
0.0: None of the retrieved chunks are relevant

High precision = no wasted context tokens

Context Recall (0.0 - 1.0)

Measures whether all information needed to answer was retrieved.

1.0: All necessary information was retrieved
0.0: Critical information was missed

High recall = comprehensive retrieval

Overall RAGAS Score

The harmonic mean of all metrics, providing a single quality score. This is what the benchmarking harness uses for ranking.

Quick Start: Evaluate Your RAG

Basic Evaluation

import org.llm4s.rag.evaluation._
import org.llm4s.llmconnect.LLMConnect

// Initialize
val llm = LLMConnect.fromEnv().getOrElse(???)
val evaluator = new RAGASEvaluator(llm)

// Create an evaluation sample
val sample = EvalSample(
  question = "What is Scala?",
  answer = "Scala is a programming language that runs on the JVM.",
  contexts = Seq("Scala is a functional and object-oriented programming language."),
  groundTruth = Some("Scala is a programming language.")
)

// Evaluate
val result = evaluator.evaluate(sample)

result.foreach { metrics =>
  println(s"Faithfulness: ${metrics.faithfulness}")
  println(s"Answer Relevancy: ${metrics.answerRelevancy}")
  println(s"Context Precision: ${metrics.contextPrecision}")
  println(s"Context Recall: ${metrics.contextRecall}")
  println(s"Overall RAGAS: ${metrics.ragasScore}")
}

Batch Evaluation

val samples = Seq(sample1, sample2, sample3)
val results = evaluator.evaluateBatch(samples)

results.foreach { allMetrics =>
  val avgRagas = allMetrics.map(_.ragasScore).sum / allMetrics.size
  println(s"Average RAGAS Score: $avgRagas")
}

Benchmarking Harness

The benchmarking harness lets you systematically compare different RAG configurations to find what works best for your data.

Running Pre-Built Benchmarks

# Download test dataset
./scripts/download-datasets.sh ragbench

# Set environment variables
export LLM_MODEL=openai/gpt-4o
export OPENAI_API_KEY=sk-...
export EMBEDDING_PROVIDER=openai
export OPENAI_EMBEDDING_BASE_URL=https://api.openai.com  # Note: no /v1
export OPENAI_EMBEDDING_MODEL=text-embedding-3-small

# Run fusion strategy comparison
sbt "samples/runMain org.llm4s.samples.rag.BenchmarkExample --suite fusion"

# Run chunking strategy comparison
sbt "samples/runMain org.llm4s.samples.rag.BenchmarkExample --suite chunking"

# Run embedding provider comparison
sbt "samples/runMain org.llm4s.samples.rag.BenchmarkExample --suite embedding"

# Quick test (5 samples)
sbt "samples/runMain org.llm4s.samples.rag.BenchmarkExample --quick"

Understanding the Output

======================================================================
BENCHMARK RESULTS: fusion-comparison
Compare different search fusion strategies
======================================================================

Duration: 327.83s
Experiments: 6 (6 passed, 0 failed)

RANKINGS BY RAGAS SCORE
----------------------------------------------------------------------
Rank   Config           RAGAS      Faith      Relevancy
----------------------------------------------------------------------
1      rrf-20           0.911      1.000      0.844
2      rrf-60           0.894      1.000      0.841
3      weighted-70v     0.893      1.000      0.841
4      weighted-50v     0.892      1.000      0.836
5      vector-only      0.884      0.960      0.845
6      keyword-only     0.809      N/A        0.809

======================================================================
WINNER: rrf-20 (RAGAS: 0.911)
======================================================================

What Each Benchmark Tests

Fusion Strategy Benchmark

Compares how to combine vector (semantic) and keyword (BM25) search results.

Strategy	Description	When to Use
`rrf-20`	Reciprocal Rank Fusion, k=20	Best overall performance
`rrf-60`	Reciprocal Rank Fusion, k=60	More conservative ranking
`weighted-70v`	70% vector + 30% keyword	Semantic-focused
`weighted-50v`	50% vector + 50% keyword	Balanced
`vector-only`	Pure embedding similarity	Conceptual queries
`keyword-only`	Pure BM25	Exact term matching

Typical Finding: Hybrid search (RRF) outperforms single-method search by 3-5%.

Chunking Strategy Benchmark

Compares different document splitting approaches.

Strategy	Target Size	Best For
`sentence-small`	400 chars	Focused factual queries
`sentence-default`	800 chars	General use
`sentence-large`	1200 chars	Complex reasoning
`markdown-default`	800 chars	Structured documents
`simple-default`	800 chars	Baseline comparison

Typical Finding: Sentence-aware chunking with smaller chunks improves precision.

Embedding Provider Benchmark

Compares embedding quality across providers.

Provider	Model	Dimensions	Notes
OpenAI	text-embedding-3-small	1536	Good balance of cost/quality
OpenAI	text-embedding-3-large	3072	Highest quality
Voyage	voyage-3	1024	Optimized for retrieval
Ollama	nomic-embed-text	768	Free, local

Programmatic Benchmarking

Custom Benchmark Suite

import org.llm4s.rag.benchmark._
import org.llm4s.llmconnect.{LLMConnect, EmbeddingClient}

// Initialize clients
val llm = LLMConnect.fromEnv().getOrElse(???)
val embed = EmbeddingClient.fromEnv().getOrElse(???)
val runner = new BenchmarkRunner(llm, embed)

// Define custom experiments
val experiments = Seq(
  RAGExperimentConfig(
    name = "my-config-1",
    chunkingStrategy = ChunkerFactory.Strategy.Sentence,
    fusionStrategy = FusionStrategy.RRF(30),
    topK = 5
  ),
  RAGExperimentConfig(
    name = "my-config-2",
    chunkingStrategy = ChunkerFactory.Strategy.Markdown,
    fusionStrategy = FusionStrategy.WeightedScore(0.8, 0.2),
    topK = 10
  )
)

// Create suite
val suite = BenchmarkSuite(
  name = "my-comparison",
  description = "Custom experiment comparison",
  experiments = experiments,
  datasetPath = "path/to/dataset.jsonl"
)

// Run and report
val results = runner.runSuite(suite)
println(BenchmarkReport.console(results))
BenchmarkReport.saveJson(results, "data/results/my-comparison.json")

Custom Dataset Format

Create a JSONL file with your own test data:

{"question": "What is Scala?", "documents": ["Scala is a programming language..."], "answer": "Scala is a JVM language."}
{"question": "How does pattern matching work?", "documents": ["Pattern matching in Scala..."], "answer": "Pattern matching uses case expressions."}

Improving Your RAG with Benchmarks

Optimization Workflow

1. BASELINE
   Run benchmark with default settings
   Record baseline RAGAS score

2. HYPOTHESIS
   "Smaller chunks might improve precision"

3. EXPERIMENT
   Run chunking benchmark
   Compare sentence-small vs sentence-default

4. ANALYZE
   Did RAGAS score improve?
   Which sub-metrics changed?

5. ITERATE
   Apply winning config
   Test new hypothesis

Common Optimization Patterns

Low Faithfulness Score

Problem: LLM generating claims not in context.

Solutions:

Increase topK to retrieve more context
Switch to larger chunk sizes
Add reranking to improve context quality
Use more explicit RAG prompts

Low Context Precision

Problem: Retrieving irrelevant chunks.

Solutions:

Try smaller chunk sizes for tighter matches
Use hybrid search (RRF) to combine semantic + keyword
Add metadata filtering
Tune embedding model

Low Context Recall

Problem: Missing relevant information.

Solutions:

Increase topK retrieval count
Use larger chunk overlap
Add query expansion
Try different embedding models

Low Answer Relevancy

Problem: Answers don’t address the question directly.

Solutions:

Improve RAG prompt template
Use query rewriting
Check if context contains the answer

Interpreting Results

What’s a Good RAGAS Score?

Score Range	Interpretation
0.90+	Excellent - production ready
0.80-0.90	Good - suitable for most use cases
0.70-0.80	Acceptable - room for improvement
0.60-0.70	Fair - needs optimization
<0.60	Poor - significant issues

Comparing Configurations

When comparing configurations:

Statistical Significance: Small differences (<0.02) may be noise
Trade-offs: Higher precision often means lower recall
Domain Specificity: Best config varies by data type
Cost Considerations: More retrieval = more embedding calls

Benchmark Results Reference

Based on actual benchmarks with RAGBench dataset:

Fusion Strategy Results

Config	RAGAS	Faithfulness	Relevancy
rrf-20	0.911	1.000	0.844
rrf-60	0.894	1.000	0.841
weighted-70v	0.893	1.000	0.841
weighted-50v	0.892	1.000	0.836
vector-only	0.884	0.960	0.845
keyword-only	0.809	N/A	0.809

Key Insight: Hybrid search (RRF) outperforms pure vector or keyword search.

Chunking Strategy Results

Config	RAGAS	Faithfulness	Relevancy
sentence-small	0.910	1.000	0.840
sentence-large	0.893	1.000	0.838
sentence-default	0.892	1.000	0.834
markdown-default	0.885	0.960	0.847
simple-default	0.882	0.960	0.833

Key Insight: Smaller, sentence-aware chunks perform best for focused queries.

Best Practices

1. Start with Baselines

Always run benchmarks with default settings first to establish a baseline.

2. Change One Variable at a Time

When optimizing, change only one configuration at a time to understand impact.

3. Use Representative Data

Your benchmark dataset should reflect real-world queries your system will handle.

4. Consider the Full Picture

Don’t optimize for one metric at the expense of others. Balance is key.

5. Monitor in Production

Benchmark scores are offline metrics. Monitor real user satisfaction in production.

6. Re-benchmark After Changes

When you update your document corpus, re-run benchmarks to verify quality.

API Reference

RAGASEvaluator

class RAGASEvaluator(llmClient: LLMClient) {
  def evaluate(sample: EvalSample): Result[RAGASMetrics]
  def evaluateBatch(samples: Seq[EvalSample]): Result[Seq[RAGASMetrics]]
}

EvalSample

final case class EvalSample(
  question: String,
  answer: String,
  contexts: Seq[String],
  groundTruth: Option[String] = None
)

RAGASMetrics

final case class RAGASMetrics(
  faithfulness: Option[Double],
  answerRelevancy: Option[Double],
  contextPrecision: Option[Double],
  contextRecall: Option[Double]
) {
  def ragasScore: Double  // Harmonic mean of available metrics
}

BenchmarkRunner

class BenchmarkRunner(llmClient: LLMClient, embeddingClient: EmbeddingClient) {
  def runSuite(suite: BenchmarkSuite): BenchmarkResults
  def runExperiment(config: RAGExperimentConfig, dataset: Dataset): ExperimentResult
}

Next Steps

Vector Store Guide - Configure vector backends
Benchmark Results - Detailed benchmark data
Examples Gallery - RAG implementation examples

RAG Evaluation & Benchmarking

Table of contents

Overview

RAGAS Metrics Explained

Faithfulness (0.0 - 1.0)

Answer Relevancy (0.0 - 1.0)

Context Precision (0.0 - 1.0)

Context Recall (0.0 - 1.0)

Overall RAGAS Score

Quick Start: Evaluate Your RAG

Basic Evaluation

Batch Evaluation

Benchmarking Harness

Running Pre-Built Benchmarks

Understanding the Output

What Each Benchmark Tests

Fusion Strategy Benchmark

Chunking Strategy Benchmark

Embedding Provider Benchmark

Programmatic Benchmarking

Custom Benchmark Suite

Custom Dataset Format

Improving Your RAG with Benchmarks

Optimization Workflow

Common Optimization Patterns

Low Faithfulness Score

Low Context Precision

Low Context Recall

Low Answer Relevancy

Interpreting Results

What’s a Good RAGAS Score?

Comparing Configurations

Benchmark Results Reference

Fusion Strategy Results

Chunking Strategy Results

Best Practices

1. Start with Baselines

2. Change One Variable at a Time

3. Use Representative Data

4. Consider the Full Picture

5. Monitor in Production

6. Re-benchmark After Changes

API Reference

RAGASEvaluator

EvalSample

RAGASMetrics

BenchmarkRunner

Next Steps