Skip to main content
Avaliar is built around a small set of core primitives. This page gives you a quick mental model of each concept. Each section links to the full documentation where you can go deeper.
A trace represents a single LLM interaction from input to output. Traces are the fundamental unit of observability in Avaliar. Every function decorated with @traceable (or every request routed through the Proxy) creates a trace that captures the full request-response lifecycle — including the prompt, completion, model, provider, latency, token counts, and any detected issues.Traces are stored in your organization’s workspace and visible in the Traces dashboard. You can filter, search, and inspect them to understand how your LLM application behaves in production.Traces on the Platform
Spans are individual operations within a trace. A trace can contain multiple nested spans in a parent-child hierarchy, letting you see exactly how a complex LLM pipeline executes step by step.Avaliar supports four span types:
TypeDescription
llmA direct call to a language model
toolA tool or function call made by the model
agentAn autonomous agent step that may contain child spans
genericAny other operation you want to track
Nesting spans lets you trace multi-step workflows such as RAG pipelines, agent loops, or chain-of-thought sequences. Parent-child relationships are tracked automatically — no manual wiring needed.Traceable Decorator
Detection is the automated analysis of traces for safety issues. Avaliar provides six built-in detectors:
DetectorWhat It Finds
Prompt InjectionAttempts to override system instructions via user input
JailbreakTechniques designed to bypass model safety guardrails
ToxicityHarmful, abusive, or offensive language in inputs or outputs
PIIPersonally identifiable information — emails, phone numbers, SSNs, etc.
BiasStereotyping, demographic bias, or unfair treatment in model outputs
HallucinationFactually incorrect or fabricated information in model responses
Detection runs in one of two modes — local (data stays on your infrastructure) or cloud (higher throughput, zero ops overhead).Detection Overview · Detector Reference · Detection Modes
An issue is a finding produced by a detector during trace analysis. Each issue contains:
  • Type — the detector that found it (e.g., toxicity, pii)
  • Severity — one of low, medium, high, or critical
  • Confidence — a score from 0 to 1 indicating certainty
  • Description — a human-readable explanation of the finding
  • Excerpt — the specific portion of the input or output that triggered detection
Issues are attached to their parent trace and visible in both the trace detail view and the aggregated Issues section of the dashboard.
Benchmarks are standardized tests that measure an LLM’s general capabilities across well-known academic datasets:
BenchmarkMeasures
MMLUBroad multi-task knowledge across 57 subjects
DROPDiscrete reasoning over paragraphs
HellaSwagCommonsense NLI and sentence completion
TruthfulQATendency to generate truthful vs. imitative-falsehood answers
BigBenchHardMulti-step reasoning tasks
HumanEvalFunctional code generation correctness
Evals are safety-focused evaluations that measure bias and harm rather than general capability:
EvalFocus
BBQSocial bias across 11 demographic categories
BOLDBias and toxicity in open-ended text generation
HExPHIWhether models follow harmful instructions
RealToxicityPromptsLikelihood of generating toxic continuations
Benchmarks Overview · Available Benchmarks · Evals Overview
Alerts are automated notifications triggered when issues meet conditions you define. Each alert rule specifies:
  • Condition typethreshold (count exceeds N), trend (rate increasing), pattern (repeated issue type), or anomaly (statistical outlier)
  • Channels — where to send the notification: email, Slack, or webhook
You can scope alerts to specific detectors, severity levels, models, or environments. Full alert history is stored and available for auditing.Alerts on the Platform
Reports are generated documents that aggregate trace data, issues, and benchmark results into a structured format for compliance and stakeholder review.
ReportPurpose
Security RiskDetected threats, attack patterns, and mitigation status
Model CostToken usage, latency, and spend by model and provider
Platform OpsSystem health, uptime, trace volume, and error rates
AI Risk PostureOverall safety score combining detection, benchmark, and eval results
Reports can be generated on-demand or scheduled, and exported as PDF or JSON.Reports on the Platform