Each metric evaluates a specific aspect of RAG quality and returns a score between 0.0 (worst) and 1.0 (best).
Implementations should:
Use LLM calls for semantic evaluation (faithfulness, relevancy)
Use embeddings for similarity calculations (answer relevancy)
Return detailed breakdowns in the MetricResult.details map
Attributes
Example
val faithfulness = new Faithfulness(llmClient)
val result = faithfulness.evaluate(sample)
result match {
case Right(r) => println(s"Faithfulness: $${r.score}")
case Left(e) => println(s"Error: $${e.message}")
}