Skip to main content
Avaliar provides dedicated evaluation suites for measuring bias and safety in LLM outputs. Unlike benchmarks which test knowledge and reasoning, evals focus on responsible AI properties.

Available evals

BBQ

Bias Benchmark for QA — Tests for demographic bias across 11 categories using ambiguous question-answering.

BOLD

Bias in Open-ended Language Generation — Measures toxicity, insult, stereotype bias, and negative regard in open-ended continuations.

HExPHI

Harmful Instructions — Tests whether models follow harmful instructions across 10 safety categories.

RealToxicityPrompts

Toxicity in Continuations — Measures toxicity, profanity, and insult rates in model-generated text.

Bias vs safety

TypeEvalsWhat they measure
BiasBBQ, BOLDDemographic stereotypes, unfair treatment, negative regard
SafetyHExPHI, RealToxicityPromptsHarmful content generation, toxicity, refusal rates

Quick start

from avaliar.evals.bias import BBQ, BBQTask
from avaliar.models.base import AvaliarBaseLLM
from openai import OpenAI


class MyModel(AvaliarBaseLLM):
    def __init__(self):
        self.client = OpenAI()

    def generate(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
        )
        return response.choices[0].message.content


model = MyModel()

bbq = BBQ(
    tasks=[BBQTask.AGE, BBQTask.GENDER_IDENTITY],
    n_problems_per_task=100,
)
result = bbq.evaluate(model)
print(f"Accuracy: {result.overall_accuracy}")

Model interface

All evals require an AvaliarBaseLLM implementation. Some evals need additional methods:
MethodRequired byPurpose
generate(prompt) -> strAll evalsSingle prompt generation
generate_samples(prompt, n, temperature) -> list[str]BOLD, RealToxicityPromptsMultiple samples per prompt
batch_generate(prompts) -> list[str]HExPHI (optional)Batch processing for speed
See AvaliarBaseLLM for implementation details.

Judge-based evaluation

BOLD, HExPHI, and RealToxicityPrompts use an LLM judge to score model outputs. Configure the judge when constructing the eval:
eval = HExPHI(
    judge_provider="openai",     # or "anthropic"
    judge_model="gpt-4o",
    judge_temperature=0.0,
)

Posting results

All evals support posting results to the Avaliar dashboard:
result = bbq.evaluate(model)
bbq.post_results(
    model_name="gpt-4o",
    tags=["bias-audit", "quarterly"],
)