org.llm4s.rag.evaluation

Members list

Type members

Classlikes

final case class ClaimVerification(claim: String, supported: Boolean, evidence: Option[String])

Verification result for a single claim in the Faithfulness metric.

Verification result for a single claim in the Faithfulness metric.

Value parameters

claim

The extracted claim from the answer

evidence

Optional evidence from context that supports/refutes the claim

supported

Whether the claim is supported by the context

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
final case class EvalResult(sample: EvalSample, metrics: Seq[MetricResult], ragasScore: Double, evaluatedAt: Long)

Complete evaluation result for a single sample.

Complete evaluation result for a single sample.

Value parameters

evaluatedAt

Timestamp of evaluation

metrics

Results from each evaluated metric

ragasScore

Composite RAGAS score (mean of all metric scores)

sample

The evaluated sample

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
final case class EvalSample(question: String, answer: String, contexts: Seq[String], groundTruth: Option[String], metadata: Map[String, String])

A single evaluation sample containing all inputs needed for RAGAS metrics.

A single evaluation sample containing all inputs needed for RAGAS metrics.

Value parameters

answer

The generated answer from the RAG system

contexts

The retrieved context documents used to generate the answer

groundTruth

Optional ground truth answer (required for precision/recall metrics)

metadata

Additional metadata for tracking/filtering

question

The user's query

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
final case class EvalSummary(results: Seq[EvalResult], averages: Map[String, Double], overallRagasScore: Double, sampleCount: Int)

Summary of batch evaluation across multiple samples.

Summary of batch evaluation across multiple samples.

Value parameters

averages

Average score per metric across all samples

overallRagasScore

Average RAGAS score across all samples

results

Individual results for each sample

sampleCount

Number of samples evaluated

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
final case class EvaluationError(code: Option[String], message: String) extends LLMError

Error type for evaluation failures.

Error type for evaluation failures.

Attributes

Companion
object
Supertypes
trait LLMError
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class EvaluatorOptions(parallelEvaluation: Boolean, maxConcurrency: Int, timeoutMs: Long)

Configuration options for the RAGAS evaluator.

Configuration options for the RAGAS evaluator.

Value parameters

maxConcurrency

Maximum concurrent metric evaluations

parallelEvaluation

Whether to evaluate metrics in parallel

timeoutMs

Timeout per metric evaluation in milliseconds

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
final case class MetricResult(metricName: String, score: Double, details: Map[String, Any])

Result of evaluating a single metric.

Result of evaluating a single metric.

Value parameters

details

Metric-specific breakdown (e.g., individual claim scores)

metricName

Unique identifier of the metric (e.g., "faithfulness")

score

Score between 0.0 (worst) and 1.0 (best)

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
class RAGASEvaluator(llmClient: LLMClient, embeddingClient: EmbeddingClient, embeddingModelConfig: EmbeddingModelConfig, metrics: Seq[RAGASMetric], options: EvaluatorOptions, tracer: Option[Tracing])

Main RAGAS evaluator that orchestrates all metrics.

Main RAGAS evaluator that orchestrates all metrics.

RAGAS (Retrieval Augmented Generation Assessment) evaluates RAG pipelines across four dimensions:

  • Faithfulness: Are claims in the answer supported by context?
  • Answer Relevancy: Does the answer address the question?
  • Context Precision: Are relevant docs ranked at the top?
  • Context Recall: Were all relevant docs retrieved?

The composite RAGAS score is the mean of all evaluated metric scores.

Value parameters

embeddingClient

Embedding client for similarity calculations

embeddingModelConfig

Configuration for the embedding model

llmClient

LLM client for semantic evaluation (claim verification, relevance)

metrics

Custom metrics to use (defaults to all four RAGAS metrics)

options

Evaluation options (parallelism, timeouts)

tracer

Optional tracer for cost tracking

Attributes

Example
{
val evaluator = RAGASEvaluator(llmClient, embeddingClient, embeddingConfig)
val sample = EvalSample(
 question = "What is the capital of France?",
 answer = "Paris is the capital of France.",
 contexts = Seq("Paris is the capital and largest city of France."),
 groundTruth = Some("The capital of France is Paris.")
)
val result = evaluator.evaluate(sample)
result match {
 case Right(eval) =>
   println(s"RAGAS Score: $${eval.ragasScore}")
   eval.metrics.foreach { m =>
     println(s"  $${m.metricName}: $${m.score}")
   }
 case Left(error) =>
   println(s"Evaluation failed: $${error.message}")
}

}

Companion
object
Supertypes
class Object
trait Matchable
class Any

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
object RAGASFactory

Factory for creating RAGAS evaluators and individual metrics.

Factory for creating RAGAS evaluators and individual metrics.

Provides convenient methods to create evaluators from environment configuration or with specific settings.

Attributes

Example
{
// Create from environment
val evaluator = RAGASFactory.fromConfigs(providerCfg, embeddingCfg)
// Create with specific metrics
val basicEvaluator = RAGASFactory.withMetrics(
 llmClient, embeddingClient, embeddingConfig,
 Set("faithfulness", "answer_relevancy")
)
// Create individual metrics
val faithfulness = RAGASFactory.faithfulness(llmClient)

}

Supertypes
class Object
trait Matchable
class Any
Self type
class RAGASLangfuseObserver(langfuseUrl: String, publicKey: String, secretKey: String, environment: String, release: String, version: String, batchSender: LangfuseBatchSender)

Observer that logs RAGAS evaluation results to Langfuse.

Observer that logs RAGAS evaluation results to Langfuse.

Integrates with existing Langfuse tracing infrastructure to log:

  • Individual metric scores
  • Composite RAGAS scores
  • Evaluation details and metadata

Value parameters

environment

Environment name (e.g., "production", "development")

langfuseUrl

The Langfuse API URL

publicKey

Langfuse public key

release

Release version

secretKey

Langfuse secret key

version

API version

Attributes

Example
{
val observer = RAGASLangfuseObserver.fromTracingSettings(tracingSettings)
val result = evaluator.evaluate(sample)
result.foreach { evalResult =>
 observer.logEvaluation(evalResult)
}

}

Companion
object
Supertypes
class Object
trait Matchable
class Any

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
trait RAGASMetric

Base trait for RAGAS evaluation metrics.

Base trait for RAGAS evaluation metrics.

Each metric evaluates a specific aspect of RAG quality and returns a score between 0.0 (worst) and 1.0 (best).

Implementations should:

  • Use LLM calls for semantic evaluation (faithfulness, relevancy)
  • Use embeddings for similarity calculations (answer relevancy)
  • Return detailed breakdowns in the MetricResult.details map

Attributes

Example
val faithfulness = new Faithfulness(llmClient)
val result = faithfulness.evaluate(sample)
result match {
 case Right(r) => println(s"Faithfulness: $${r.score}")
 case Left(e) => println(s"Error: $${e.message}")
}
Supertypes
class Object
trait Matchable
class Any
Known subtypes
sealed trait RequiredInput

Enumeration of possible required inputs for RAGAS metrics.

Enumeration of possible required inputs for RAGAS metrics.

Attributes

Companion
object
Supertypes
class Object
trait Matchable
class Any
Known subtypes
object Answer
object Contexts
object GroundTruth
object Question
object RequiredInput

Attributes

Companion
trait
Supertypes
trait Sum
trait Mirror
class Object
trait Matchable
class Any
Self type
final case class TestDataset(name: String, samples: Seq[EvalSample], metadata: Map[String, String])

Test dataset for RAG evaluation.

Test dataset for RAG evaluation.

Supports loading from JSON files and generating synthetic test cases from documents using LLM.

Value parameters

metadata

Additional metadata for tracking/filtering

name

Name identifier for this dataset

samples

The evaluation samples

Attributes

Example
{
// Load from file
val dataset = TestDataset.fromJsonFile("test_cases.json")
// Generate synthetic test cases
val generated = TestDataset.generateFromDocuments(
 documents = Seq("Paris is the capital of France...", "Tokyo is the capital of Japan..."),
 llmClient = client,
 samplesPerDoc = 3
)
// Save to file
TestDataset.save(dataset, "output.json")

}

Companion
object
Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
object TestDataset

Attributes

Companion
class
Supertypes
trait Product
trait Mirror
class Object
trait Matchable
class Any
Self type