Skip to main content
BBQ (Bias Benchmark for QA) evaluates whether models exhibit social biases when answering ambiguous questions. It covers 11 demographic categories and uses multiple-choice questions where the correct answer for ambiguous contexts is “Unknown” rather than a stereotyped guess.

Import

from avaliar.evals.bias import BBQ, BBQTask, BBQEvalResult

Constructor

BBQ(
    tasks=None,                       # list[BBQTask] | None — default: all tasks
    n_shots=5,                        # int — few-shot examples (max 5)
    n_problems_per_task=None,         # int | None — limit problems per task
    verbose_mode=False,               # bool — print detailed output
    confinement_instructions=None,    # str | None — custom output instructions
)

Categories

class BBQTask(Enum):
    AGE = "Age"
    DISABILITY_STATUS = "Disability_status"
    GENDER_IDENTITY = "Gender_identity"
    NATIONALITY = "Nationality"
    PHYSICAL_APPEARANCE = "Physical_appearance"
    RACE_ETHNICITY = "Race_ethnicity"
    RACE_X_SES = "Race_x_SES"
    RACE_X_GENDER = "Race_x_gender"
    RELIGION = "Religion"
    SES = "SES"
    SEXUAL_ORIENTATION = "Sexual_orientation"

Usage

from avaliar.evals.bias import BBQ, BBQTask
from avaliar.models.base import AvaliarBaseLLM
from openai import OpenAI


class MyModel(AvaliarBaseLLM):
    def __init__(self):
        self.client = OpenAI()

    def generate(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
        )
        return response.choices[0].message.content


bbq = BBQ(
    tasks=[BBQTask.AGE, BBQTask.GENDER_IDENTITY, BBQTask.RACE_ETHNICITY],
    n_shots=5,
    n_problems_per_task=200,
)

result = bbq.evaluate(MyModel())
print(f"Overall accuracy: {result.overall_accuracy}")
print(f"Overall score: {result.overall_score}")

Results

BBQEvalResult

FieldTypeDescription
overall_scorefloatOverall evaluation score
overall_accuracyfloatPercentage of correct answers

Accessing predictions

# Detailed predictions (pandas DataFrame)
print(bbq.predictions)
# Columns: Task, Input, Prediction, Expected Output, Correct

# Per-task scores
print(bbq.task_scores)
# Columns: Task, Score

How it works

  1. For each task, loads test cases from HuggingFace (heegyu/bbq).
  2. Generates prompts with few-shot examples.
  3. Model selects an answer (A, B, or C).
  4. Scores with exact match — for ambiguous questions, the expected answer is “Unknown”.

Posting results

bbq.post_results(
    model_name="gpt-4o",
    tags=["bias-audit", "quarterly"],
)