Running Benchmarks

The AvaliarBaseLLM interface

Every benchmark requires a model that implements AvaliarBaseLLM:

from avaliar.models.base import AvaliarBaseLLM

class MyModel(AvaliarBaseLLM):
    def __init__(self):
        # Initialize your LLM client
        ...

    def generate(self, prompt: str) -> str:
        # Return model response as a string
        ...

    def batch_generate(self, prompts: list[str], **kwargs) -> list[str]:
        # Optional: process multiple prompts concurrently
        ...

See AvaliarBaseLLM for full interface details and provider examples.

Batch generation

If your model supports parallel requests, implement batch_generate for significant speedups. Benchmarks automatically detect and use it when available.

import concurrent.futures
from avaliar.models.base import AvaliarBaseLLM
from openai import OpenAI


class MyModel(AvaliarBaseLLM):
    def __init__(self):
        self.client = OpenAI()

    def generate(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
        )
        return response.choices[0].message.content

    def batch_generate(self, prompts: list[str], **kwargs) -> list[str]:
        with concurrent.futures.ThreadPoolExecutor(max_workers=10) as pool:
            return list(pool.map(self.generate, prompts))

Use temperature=0.0 for deterministic results during benchmarking. This ensures consistent scores across runs.

Configuration options

Few-shot examples

Most benchmarks support few-shot prompting. The n_shots parameter controls how many examples are included in the prompt.

benchmark = MMLU(
    tasks=[MMLUTask.MACHINE_LEARNING],
    n_shots=5,   # Include 5 examples in each prompt
)

Benchmark	Max shots
MMLU	5
HellaSwag	10
DROP	5
BigBenchHard	3

Limiting test cases

For faster iteration, limit the number of problems evaluated per task:

benchmark = HellaSwag(
    tasks=[HellaSwagTask.ACTIVITY_NET],
    n_problems_per_task=50,  # Only evaluate 50 problems per task
)

Custom output instructions

Override the default output format instructions:

benchmark = MMLU(
    tasks=[MMLUTask.MACHINE_LEARNING],
    confinement_instructions="Answer with only the letter (A, B, C, or D).",
)

Scoring

Exact match

MMLU, HellaSwag, and BigBenchHard use exact match scoring. The model’s output is normalized (lowercased, punctuation removed) and compared to the expected answer.

DROP scoring

DROP uses custom metrics that handle:

Numerical answers (with tolerance)
Span-based answers
Multi-span answers

The primary metrics are exact match (EM) and F1 score.

HumanEval scoring

HumanEval uses pass@k scoring — the generated code is executed against hidden test cases. k is the number of code samples generated per problem.

Accessing results

After evaluation:

result = benchmark.evaluate(model)

# Overall accuracy
print(result.overall_accuracy)

# Detailed predictions (pandas DataFrame)
print(benchmark.predictions)
# Columns: task, input, prediction, expected_output, score

# Per-task accuracy breakdown
print(benchmark.task_scores)

Benchmark datasets are downloaded from HuggingFace on first run and cached locally. Subsequent runs use the cache. Make sure you have an internet connection for the initial run.

Posting to the dashboard

Send results to Avaliar for tracking and comparison across runs:

benchmark.post_results(
    model_name="gpt-4o",
    tags=["nightly", "v2.1"],
)

Results appear in the Avaliar dashboard where you can compare models, view historical trends, and include benchmark data in compliance reports.

Comparing models

Run the same benchmark against multiple models to make data-driven selection decisions:

from avaliar.benchmarks.compare import compare_models

result1, result2 = compare_models(
    model1=GPT4o(),
    model2=Claude(),
    model1_name="GPT-4o",
    model2_name="Claude Sonnet",
    benchmark=MMLU(tasks=[MMLUTask.MACHINE_LEARNING]),
)

The AvaliarBaseLLM interface

Batch generation

Configuration options

Few-shot examples

Limiting test cases

Custom output instructions

Scoring

Exact match

DROP scoring

HumanEval scoring

Accessing results

Posting to the dashboard

Comparing models

Next Steps

Available Benchmarks

Safety & Bias Evals

​The AvaliarBaseLLM interface

​Batch generation

​Configuration options

​Few-shot examples

​Limiting test cases

​Custom output instructions

​Scoring

​Exact match

​DROP scoring

​HumanEval scoring

​Accessing results

​Posting to the dashboard

​Comparing models

​Next Steps

Available Benchmarks

Safety & Bias Evals

The AvaliarBaseLLM interface

Batch generation

Configuration options

Few-shot examples

Limiting test cases

Custom output instructions

Scoring

Exact match

DROP scoring

HumanEval scoring

Accessing results

Posting to the dashboard

Comparing models

Next Steps