org.llm4s.rag.evaluation.metrics

Members list

Type members

Classlikes

class AnswerRelevancy(llmClient: LLMClient, embeddingClient: EmbeddingClient, modelConfig: EmbeddingModelConfig, numGeneratedQuestions: Int) extends RAGASMetric

Answer Relevancy metric: measures how well the answer addresses the question.

Answer Relevancy metric: measures how well the answer addresses the question.

Algorithm:

  1. Generate N questions that the provided answer would address
  2. Compute embedding for the original question
  3. Compute embeddings for the generated questions
  4. Calculate cosine similarity between original and generated question embeddings
  5. Score = average similarity across generated questions

The intuition: if the answer is relevant to the question, then questions generated from the answer should be semantically similar to the original question.

Value parameters

embeddingClient

Client for computing embeddings

llmClient

LLM client for generating questions from the answer

modelConfig

Embedding model configuration

numGeneratedQuestions

Number of questions to generate (default: 3)

Attributes

Example
val metric = AnswerRelevancy(llmClient, embeddingClient, modelConfig)
val sample = EvalSample(
 question = "What is machine learning?",
 answer = "Machine learning is a subset of AI that enables systems to learn from data.",
 contexts = Seq("...") // contexts not used for this metric
)
val result = metric.evaluate(sample)
// High score if generated questions are similar to "What is machine learning?"
Companion
object
Supertypes
trait RAGASMetric
class Object
trait Matchable
class Any

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
class ContextPrecision(llmClient: LLMClient) extends RAGASMetric

Context Precision metric: measures if relevant contexts are ranked at the top.

Context Precision metric: measures if relevant contexts are ranked at the top.

Algorithm:

  1. For each retrieved context, determine if it's relevant to the question/ground_truth
  2. Calculate precision@k for each position where a relevant doc appears
  3. Score = Average Precision (AP) = sum of (precision@k * relevance@k) / total_relevant

The intuition: if your retrieval system ranks relevant documents at the top, you get a higher score. Documents ranked lower contribute less to the score.

Value parameters

llmClient

The LLM client for relevance assessment

Attributes

Example
{
val metric = ContextPrecision(llmClient)
val sample = EvalSample(
 question = "What is the capital of France?",
 answer = "Paris is the capital of France.",
 contexts = Seq(
   "Paris is the capital and largest city of France.",  // relevant
   "France has beautiful countryside.",                  // less relevant
   "Paris has the Eiffel Tower."                         // relevant
 ),
 groundTruth = Some("The capital of France is Paris.")
)
val result = metric.evaluate(sample)
// High score if relevant contexts are at positions 1 and 2 vs scattered

}

Companion
object
Supertypes
trait RAGASMetric
class Object
trait Matchable
class Any

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
class ContextRecall(llmClient: LLMClient) extends RAGASMetric

Context Recall metric: measures if all relevant information was retrieved.

Context Recall metric: measures if all relevant information was retrieved.

Algorithm:

  1. Extract key facts/sentences from the ground truth answer
  2. For each fact, check if it can be attributed to the retrieved contexts
  3. Score = Number of facts covered by contexts / Total facts in ground truth

The intuition: if all facts needed to answer the question correctly are present in the retrieved contexts, recall is 1.0. Missing facts lower the score.

Value parameters

llmClient

The LLM client for fact extraction and attribution

Attributes

Example
{
val metric = ContextRecall(llmClient)
val sample = EvalSample(
 question = "What are the symptoms of diabetes?",
 answer = "...",  // answer not used for this metric
 contexts = Seq(
   "Diabetes symptoms include excessive thirst and frequent urination.",
   "Type 2 diabetes may cause fatigue and blurred vision."
 ),
 groundTruth = Some("Symptoms of diabetes include increased thirst, frequent urination, fatigue, and blurred vision.")
)
val result = metric.evaluate(sample)
// Score = facts covered / total facts from ground truth

}

Companion
object
Supertypes
trait RAGASMetric
class Object
trait Matchable
class Any
object ContextRecall

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type
final case class FactAttribution(fact: String, covered: Boolean, sourceContext: Option[Int])

Result of attributing a fact to contexts.

Result of attributing a fact to contexts.

Value parameters

covered

Whether the fact is covered by any context

fact

The fact from ground truth

sourceContext

The context number (1-indexed) that covers this fact, if any

Attributes

Supertypes
trait Serializable
trait Product
trait Equals
class Object
trait Matchable
class Any
Show all
class Faithfulness(llmClient: LLMClient, batchSize: Int) extends RAGASMetric

Faithfulness metric: measures factual accuracy of the answer relative to the retrieved contexts.

Faithfulness metric: measures factual accuracy of the answer relative to the retrieved contexts.

Algorithm:

  1. Extract factual claims from the generated answer using LLM
  2. For each claim, verify if it can be inferred from the contexts
  3. Score = Number of supported claims / Total number of claims

A score of 1.0 means all claims in the answer can be verified from the retrieved context. Lower scores indicate hallucination.

Value parameters

batchSize

Number of claims to verify per LLM call (default: 5)

llmClient

The LLM client for claim extraction and verification

Attributes

Example
val faithfulness = Faithfulness(llmClient)
val sample = EvalSample(
 question = "What is the capital of France?",
 answer = "Paris is the capital of France and has a population of 2.1 million.",
 contexts = Seq("Paris is the capital and largest city of France.")
)
val result = faithfulness.evaluate(sample)
// Result: score ~0.5 (capital claim supported, population claim not supported)
Companion
object
Supertypes
trait RAGASMetric
class Object
trait Matchable
class Any
object Faithfulness

Attributes

Companion
class
Supertypes
class Object
trait Matchable
class Any
Self type