LLMSafetyGuardrail

org.llm4s.agent.guardrails.builtin.LLMSafetyGuardrail
See theLLMSafetyGuardrail companion object
class LLMSafetyGuardrail(val llmClient: LLMClient, val threshold: Double, customCriteria: Option[String]) extends LLMGuardrail

LLM-based content safety validation guardrail.

Uses an LLM to evaluate whether content is safe, appropriate, and non-harmful. This provides more nuanced safety checking than keyword-based filters.

Safety categories evaluated:

  • Harmful or dangerous content
  • Inappropriate or offensive language
  • Misinformation or misleading claims
  • Privacy violations
  • Illegal activity promotion

Value parameters

customCriteria

Optional additional safety criteria to check

llmClient

The LLM client to use for evaluation

threshold

Minimum score to pass (default: 0.8 - higher for safety)

Attributes

Example
val guardrail = LLMSafetyGuardrail(client)
agent.run(query, tools, outputGuardrails = Seq(guardrail))
Companion
object
Graph
Supertypes
trait LLMGuardrail
trait Guardrail[String]
class Object
trait Matchable
class Any
Show all

Members list

Value members

Inherited methods

def andThen(other: Guardrail[String]): Guardrail[String]

Compose this guardrail with another sequentially.

Compose this guardrail with another sequentially.

The second guardrail runs only if this one passes.

Value parameters

other

The guardrail to run after this one

Attributes

Returns

A composite guardrail that runs both in sequence

Inherited from:
Guardrail

Optional completion options for the judge LLM call. Override to customize temperature, max tokens, etc.

Optional completion options for the judge LLM call. Override to customize temperature, max tokens, etc.

Attributes

Inherited from:
LLMGuardrail
def transform(output: String): String

Optional: Transform the output after validation. Default is identity (no transformation).

Optional: Transform the output after validation. Default is identity (no transformation).

Value parameters

output

The validated output

Attributes

Returns

The transformed output

Inherited from:
OutputGuardrail
override def validate(value: String): Result[String]

Validate content using the LLM as a judge.

Validate content using the LLM as a judge.

The implementation:

  1. Constructs a prompt with evaluation criteria and content
  2. Calls the LLM to get a score
  3. Parses the score and compares to threshold
  4. Returns success if score >= threshold, error otherwise

Attributes

Definition Classes
Inherited from:
LLMGuardrail

Concrete fields

override val description: Option[String]

Optional description of what this guardrail validates.

Optional description of what this guardrail validates.

Attributes

val evaluationPrompt: String

Natural language prompt describing the evaluation criteria.

Natural language prompt describing the evaluation criteria.

The prompt should instruct the model to return a score between 0 and 1. The content being evaluated will be provided separately.

Attributes

Example

"Rate if this response is professional in tone. Return only a number between 0 and 1."

The LLM client to use for evaluation. Can be the same client used by the agent or a different one.

The LLM client to use for evaluation. Can be the same client used by the agent or a different one.

Attributes

val name: String

Name of this guardrail for logging and error messages.

Name of this guardrail for logging and error messages.

Attributes

override val threshold: Double

Minimum score required to pass validation (0.0 to 1.0). Default is 0.7 (70% confidence).

Minimum score required to pass validation (0.0 to 1.0). Default is 0.7 (70% confidence).

Attributes