Evals Overview - Avaliar AI

Avaliar provides dedicated evaluation suites for measuring bias and safety in LLM outputs. Unlike benchmarks which test knowledge and reasoning, evals focus on responsible AI properties.

Available evals

BBQ

Bias Benchmark for QA — Tests for demographic bias across 11 categories using ambiguous question-answering.

BOLD

Bias in Open-ended Language Generation — Measures toxicity, insult, stereotype bias, and negative regard in open-ended continuations.

HExPHI

Harmful Instructions — Tests whether models follow harmful instructions across 10 safety categories.

RealToxicityPrompts

Toxicity in Continuations — Measures toxicity, profanity, and insult rates in model-generated text.

Bias vs safety

Type	Evals	What they measure
Bias	BBQ, BOLD	Demographic stereotypes, unfair treatment, negative regard
Safety	HExPHI, RealToxicityPrompts	Harmful content generation, toxicity, refusal rates

Quick start

from avaliar.evals.bias import BBQ, BBQTask
from avaliar.models.base import AvaliarBaseLLM
from openai import OpenAI


class MyModel(AvaliarBaseLLM):
    def __init__(self):
        self.client = OpenAI()

    def generate(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
        )
        return response.choices[0].message.content


model = MyModel()

bbq = BBQ(
    tasks=[BBQTask.AGE, BBQTask.GENDER_IDENTITY],
    n_problems_per_task=100,
)
result = bbq.evaluate(model)
print(f"Accuracy: {result.overall_accuracy}")

Model interface

All evals require an AvaliarBaseLLM implementation. Some evals need additional methods:

Method	Required by	Purpose
`generate(prompt) -> str`	All evals	Single prompt generation
`generate_samples(prompt, n, temperature) -> list[str]`	BOLD, RealToxicityPrompts	Multiple samples per prompt
`batch_generate(prompts) -> list[str]`	HExPHI (optional)	Batch processing for speed

See AvaliarBaseLLM for implementation details.

Judge-based evaluation

BOLD, HExPHI, and RealToxicityPrompts use an LLM judge to score model outputs. Configure the judge when constructing the eval:

eval = HExPHI(
    judge_provider="openai",     # or "anthropic"
    judge_model="gpt-4o",
    judge_temperature=0.0,
)

Posting results

All evals support posting results to the Avaliar dashboard:

result = bbq.evaluate(model)
bbq.post_results(
    model_name="gpt-4o",
    tags=["bias-audit", "quarterly"],
)

​Available evals

BBQ

BOLD

HExPHI

RealToxicityPrompts

​Bias vs safety

​Quick start

​Model interface

​Judge-based evaluation

​Posting results