org.llm4s.rag.evaluation.metrics

Answer Relevancy metric: measures how well the answer addresses the question.

Algorithm:

Generate N questions that the provided answer would address
Compute embedding for the original question
Compute embeddings for the generated questions
Calculate cosine similarity between original and generated question embeddings
Score = average similarity across generated questions

The intuition: if the answer is relevant to the question, then questions generated from the answer should be semantically similar to the original question.

Value parameters

embeddingClient: Client for computing embeddings
llmClient: LLM client for generating questions from the answer
modelConfig: Embedding model configuration
numGeneratedQuestions: Number of questions to generate (default: 3)

Attributes

Example

val metric = AnswerRelevancy(llmClient, embeddingClient, modelConfig)
val sample = EvalSample(
 question = "What is machine learning?",
 answer = "Machine learning is a subset of AI that enables systems to learn from data.",
 contexts = Seq("...") // contexts not used for this metric
)
val result = metric.evaluate(sample)
// High score if generated questions are similar to "What is machine learning?"

Companion

object

Supertypes

trait RAGASMetric

class Object

trait Matchable

class Any

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: AnswerRelevancy.type

Context Precision metric: measures if relevant contexts are ranked at the top.

Algorithm:

For each retrieved context, determine if it's relevant to the question/ground_truth
Calculate precision@k for each position where a relevant doc appears
Score = Average Precision (AP) = sum of (precision@k * relevance@k) / total_relevant

The intuition: if your retrieval system ranks relevant documents at the top, you get a higher score. Documents ranked lower contribute less to the score.

Value parameters

llmClient: The LLM client for relevance assessment

Attributes

Example

{
val metric = ContextPrecision(llmClient)
val sample = EvalSample(
 question = "What is the capital of France?",
 answer = "Paris is the capital of France.",
 contexts = Seq(
   "Paris is the capital and largest city of France.",  // relevant
   "France has beautiful countryside.",                  // less relevant
   "Paris has the Eiffel Tower."                         // relevant
 ),
 groundTruth = Some("The capital of France is Paris.")
)
val result = metric.evaluate(sample)
// High score if relevant contexts are at positions 1 and 2 vs scattered

}

Companion

object

Supertypes

trait RAGASMetric

class Object

trait Matchable

class Any

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: ContextPrecision.type

Context Recall metric: measures if all relevant information was retrieved.

Algorithm:

Extract key facts/sentences from the ground truth answer
For each fact, check if it can be attributed to the retrieved contexts
Score = Number of facts covered by contexts / Total facts in ground truth

The intuition: if all facts needed to answer the question correctly are present in the retrieved contexts, recall is 1.0. Missing facts lower the score.

Value parameters

llmClient: The LLM client for fact extraction and attribution

Attributes

Example

{
val metric = ContextRecall(llmClient)
val sample = EvalSample(
 question = "What are the symptoms of diabetes?",
 answer = "...",  // answer not used for this metric
 contexts = Seq(
   "Diabetes symptoms include excessive thirst and frequent urination.",
   "Type 2 diabetes may cause fatigue and blurred vision."
 ),
 groundTruth = Some("Symptoms of diabetes include increased thirst, frequent urination, fatigue, and blurred vision.")
)
val result = metric.evaluate(sample)
// Score = facts covered / total facts from ground truth

}

Companion

object

Supertypes

trait RAGASMetric

class Object

trait Matchable

class Any

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: ContextRecall.type

Result of attributing a fact to contexts.

Value parameters

covered: Whether the fact is covered by any context
fact: The fact from ground truth
sourceContext: The context number (1-indexed) that covers this fact, if any

Attributes

Supertypes: trait Serializable

trait Product

trait Equals

class Object

trait Matchable

class Any
Show all

Faithfulness metric: measures factual accuracy of the answer relative to the retrieved contexts.

Algorithm:

Extract factual claims from the generated answer using LLM
For each claim, verify if it can be inferred from the contexts
Score = Number of supported claims / Total number of claims

A score of 1.0 means all claims in the answer can be verified from the retrieved context. Lower scores indicate hallucination.

Value parameters

batchSize: Number of claims to verify per LLM call (default: 5)
llmClient: The LLM client for claim extraction and verification

Attributes

Example

val faithfulness = Faithfulness(llmClient)
val sample = EvalSample(
 question = "What is the capital of France?",
 answer = "Paris is the capital of France and has a population of 2.1 million.",
 contexts = Seq("Paris is the capital and largest city of France.")
)
val result = faithfulness.evaluate(sample)
// Result: score ~0.5 (capital claim supported, population claim not supported)

Companion

object

Supertypes

trait RAGASMetric

class Object

trait Matchable

class Any

Attributes

Companion: class
Supertypes: class Object

trait Matchable

class Any
Self type: Faithfulness.type

org.llm4s.rag.evaluation.metrics

Members list

Type members

Classlikes

Value parameters

Attributes

Attributes

Value parameters

Attributes

Attributes

Value parameters

Attributes

Attributes

Value parameters

Attributes

Value parameters

Attributes

Attributes