Detection Overview

What is Detection?

Avaliar’s detection system automatically analyzes LLM inputs and outputs to identify safety issues. Every trace — whether captured via the SDK or the Proxy — can be scanned by a suite of detectors that flag problems like prompt injection, toxicity, PII leakage, bias, jailbreak attempts, and hallucinations. Detection is the core of Avaliar’s safety layer. It turns raw traces into actionable findings with severity classifications and, when configured, triggers real-time alerts.

Three ways to use detection

1. SDK — integrated with tracing

Add detection directly to any @traceable LLM span. Results are automatically attached to your traces.

from avaliar import traceable
from avaliar.detectors import DetectorType

@traceable(
    "llm",
    model="gpt-4o",
    provider="openai",
    detection=True,
    detectors=[DetectorType.PROMPT_INJECTION, DetectorType.TOXICITY],
    detection_mode="cloud",
)
async def generate(messages: list) -> str:
    ...

2. SDK — standalone

Run detectors independently for custom validation workflows:

from avaliar.detectors import Detector, DetectorType

detector = Detector([DetectorType.PROMPT_INJECTION, DetectorType.PII])

result = await detector.evaluate_full(
    prompt="User input here",
    response="LLM response here",
)

if result.has_issues:
    for issue in result.issues:
        print(f"{issue.type}: {issue.message} (severity: {issue.severity})")

3. Proxy Detection

When you route LLM calls through the Avaliar Proxy, detection is automatic. Prompt injection detection runs synchronously (blocking harmful inputs), while all other detectors run asynchronously on Avaliar’s backend. No code changes are needed — the proxy handles everything.

Detection Pipeline

Every trace follows the same pipeline:

Input (prompt + response)
  → Detectors (6 types)
    → Issues identified
      → Severity classification (LOW → CRITICAL)
        → Alerts triggered (if configured)

Detector Types

Avaliar includes six built-in detector types:

Prompt Injection — Detects attempts to manipulate the LLM through crafted inputs
Jailbreak — Identifies attempts to bypass LLM safety constraints
Toxicity — Flags offensive, harmful, or inappropriate content
PII Detection — Finds personally identifiable information in inputs or outputs
Bias — Identifies biased or discriminatory content
Hallucination — Detects factually incorrect or fabricated information

See Detector Types for detailed descriptions and examples of each.

Severity Levels

Every detected issue is assigned a severity level:

Severity	Description
LOW	Minor issues that may warrant review but do not pose an immediate risk. Informational findings.
MEDIUM	Notable issues that should be addressed. May indicate a pattern that could escalate.
HIGH	Significant safety concerns that require prompt attention. Active risk to users or data.
CRITICAL	Severe issues demanding immediate action. Active exploitation, data exposure, or dangerous outputs.

What is Detection?

Three ways to use detection

1. SDK — integrated with tracing

2. SDK — standalone

3. Proxy Detection

Detection Pipeline

Detector Types

Severity Levels

Next Steps

Detector Types

Detection Modes

​What is Detection?

​Three ways to use detection

​1. SDK — integrated with tracing

​2. SDK — standalone

​3. Proxy Detection

​Detection Pipeline

​Detector Types

​Severity Levels

​Next Steps

Detector Types

Detection Modes

What is Detection?

Three ways to use detection

1. SDK — integrated with tracing

2. SDK — standalone

3. Proxy Detection

Detection Pipeline

Detector Types

Severity Levels

Next Steps