org.llm4s.rag.evaluation.metrics
Members list
Type members
Classlikes
Answer Relevancy metric: measures how well the answer addresses the question.
Answer Relevancy metric: measures how well the answer addresses the question.
Algorithm:
- Generate N questions that the provided answer would address
- Compute embedding for the original question
- Compute embeddings for the generated questions
- Calculate cosine similarity between original and generated question embeddings
- Score = average similarity across generated questions
The intuition: if the answer is relevant to the question, then questions generated from the answer should be semantically similar to the original question.
Value parameters
- embeddingClient
-
Client for computing embeddings
- llmClient
-
LLM client for generating questions from the answer
- modelConfig
-
Embedding model configuration
- numGeneratedQuestions
-
Number of questions to generate (default: 3)
Attributes
- Example
-
val metric = AnswerRelevancy(llmClient, embeddingClient, modelConfig) val sample = EvalSample( question = "What is machine learning?", answer = "Machine learning is a subset of AI that enables systems to learn from data.", contexts = Seq("...") // contexts not used for this metric ) val result = metric.evaluate(sample) // High score if generated questions are similar to "What is machine learning?" - Companion
- object
- Supertypes
Attributes
- Companion
- class
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
AnswerRelevancy.type
Context Precision metric: measures if relevant contexts are ranked at the top.
Context Precision metric: measures if relevant contexts are ranked at the top.
Algorithm:
- For each retrieved context, determine if it's relevant to the question/ground_truth
- Calculate precision@k for each position where a relevant doc appears
- Score = Average Precision (AP) = sum of (precision@k * relevance@k) / total_relevant
The intuition: if your retrieval system ranks relevant documents at the top, you get a higher score. Documents ranked lower contribute less to the score.
Value parameters
- llmClient
-
The LLM client for relevance assessment
Attributes
- Example
-
{ val metric = ContextPrecision(llmClient) val sample = EvalSample( question = "What is the capital of France?", answer = "Paris is the capital of France.", contexts = Seq( "Paris is the capital and largest city of France.", // relevant "France has beautiful countryside.", // less relevant "Paris has the Eiffel Tower." // relevant ), groundTruth = Some("The capital of France is Paris.") ) val result = metric.evaluate(sample) // High score if relevant contexts are at positions 1 and 2 vs scattered}
- Companion
- object
- Supertypes
Attributes
- Companion
- class
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
ContextPrecision.type
Context Recall metric: measures if all relevant information was retrieved.
Context Recall metric: measures if all relevant information was retrieved.
Algorithm:
- Extract key facts/sentences from the ground truth answer
- For each fact, check if it can be attributed to the retrieved contexts
- Score = Number of facts covered by contexts / Total facts in ground truth
The intuition: if all facts needed to answer the question correctly are present in the retrieved contexts, recall is 1.0. Missing facts lower the score.
Value parameters
- llmClient
-
The LLM client for fact extraction and attribution
Attributes
- Example
-
{ val metric = ContextRecall(llmClient) val sample = EvalSample( question = "What are the symptoms of diabetes?", answer = "...", // answer not used for this metric contexts = Seq( "Diabetes symptoms include excessive thirst and frequent urination.", "Type 2 diabetes may cause fatigue and blurred vision." ), groundTruth = Some("Symptoms of diabetes include increased thirst, frequent urination, fatigue, and blurred vision.") ) val result = metric.evaluate(sample) // Score = facts covered / total facts from ground truth}
- Companion
- object
- Supertypes
Attributes
- Companion
- class
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
ContextRecall.type
Result of attributing a fact to contexts.
Result of attributing a fact to contexts.
Value parameters
- covered
-
Whether the fact is covered by any context
- fact
-
The fact from ground truth
- sourceContext
-
The context number (1-indexed) that covers this fact, if any
Attributes
- Supertypes
-
trait Serializabletrait Producttrait Equalsclass Objecttrait Matchableclass AnyShow all
Faithfulness metric: measures factual accuracy of the answer relative to the retrieved contexts.
Faithfulness metric: measures factual accuracy of the answer relative to the retrieved contexts.
Algorithm:
- Extract factual claims from the generated answer using LLM
- For each claim, verify if it can be inferred from the contexts
- Score = Number of supported claims / Total number of claims
A score of 1.0 means all claims in the answer can be verified from the retrieved context. Lower scores indicate hallucination.
Value parameters
- batchSize
-
Number of claims to verify per LLM call (default: 5)
- llmClient
-
The LLM client for claim extraction and verification
Attributes
- Example
-
val faithfulness = Faithfulness(llmClient) val sample = EvalSample( question = "What is the capital of France?", answer = "Paris is the capital of France and has a population of 2.1 million.", contexts = Seq("Paris is the capital and largest city of France.") ) val result = faithfulness.evaluate(sample) // Result: score ~0.5 (capital claim supported, population claim not supported) - Companion
- object
- Supertypes
Attributes
- Companion
- class
- Supertypes
-
class Objecttrait Matchableclass Any
- Self type
-
Faithfulness.type