llm4s-core/org.llm4s/org.llm4s.agent/org.llm4s.agent.guardrails/org.llm4s.agent.guardrails.builtin/LLMSafetyGuardrail

LLMSafetyGuardrail

org.llm4s.agent.guardrails.builtin.LLMSafetyGuardrail

See theLLMSafetyGuardrail companion object

class LLMSafetyGuardrail(val llmClient: LLMClient, val threshold: Double, customCriteria: Option[String]) extends LLMGuardrail

LLM-based content safety validation guardrail.

Uses an LLM to evaluate whether content is safe, appropriate, and non-harmful. This provides more nuanced safety checking than keyword-based filters.

Safety categories evaluated:

Harmful or dangerous content
Inappropriate or offensive language
Misinformation or misleading claims
Privacy violations
Illegal activity promotion

Value parameters

customCriteria: Optional additional safety criteria to check
llmClient: The LLM client to use for evaluation
threshold: Minimum score to pass (default: 0.8 - higher for safety)

Attributes

Example

val guardrail = LLMSafetyGuardrail(client)
agent.run(query, tools, outputGuardrails = Seq(guardrail))

Companion

object

Graph

Supertypes

trait LLMGuardrail

trait OutputGuardrail

trait Guardrail[String]

class Object

trait Matchable

class Any

Show all

Members list

Value members

Inherited methods

Compose this guardrail with another sequentially.

The second guardrail runs only if this one passes.

Value parameters

other: The guardrail to run after this one

Attributes

Returns: A composite guardrail that runs both in sequence
Inherited from:: Guardrail

Optional completion options for the judge LLM call. Override to customize temperature, max tokens, etc.

Attributes

Inherited from:: LLMGuardrail

Optional: Transform the output after validation. Default is identity (no transformation).

Value parameters

output: The validated output

Attributes

Returns: The transformed output
Inherited from:: OutputGuardrail

Validate content using the LLM as a judge.

The implementation:

Constructs a prompt with evaluation criteria and content
Calls the LLM to get a score
Parses the score and compares to threshold
Returns success if score >= threshold, error otherwise

Attributes

Definition Classes: LLMGuardrail -> Guardrail
Inherited from:: LLMGuardrail

Concrete fields

Optional description of what this guardrail validates.

Attributes

Natural language prompt describing the evaluation criteria.

The prompt should instruct the model to return a score between 0 and 1. The content being evaluated will be provided separately.

Attributes

Example: "Rate if this response is professional in tone. Return only a number between 0 and 1."

The LLM client to use for evaluation. Can be the same client used by the agent or a different one.

Attributes

Name of this guardrail for logging and error messages.

Attributes

Minimum score required to pass validation (0.0 to 1.0). Default is 0.7 (70% confidence).

Attributes

In this article

Generated with