Skip to main content

The AvaliarBaseLLM interface

Every benchmark requires a model that implements AvaliarBaseLLM:
from avaliar.models.base import AvaliarBaseLLM

class MyModel(AvaliarBaseLLM):
    def __init__(self):
        # Initialize your LLM client
        ...

    def generate(self, prompt: str) -> str:
        # Return model response as a string
        ...

    def batch_generate(self, prompts: list[str], **kwargs) -> list[str]:
        # Optional: process multiple prompts concurrently
        ...
See AvaliarBaseLLM for full interface details and provider examples.

Batch generation

If your model supports parallel requests, implement batch_generate for significant speedups. Benchmarks automatically detect and use it when available.
import concurrent.futures
from avaliar.models.base import AvaliarBaseLLM
from openai import OpenAI


class MyModel(AvaliarBaseLLM):
    def __init__(self):
        self.client = OpenAI()

    def generate(self, prompt: str) -> str:
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0,
        )
        return response.choices[0].message.content

    def batch_generate(self, prompts: list[str], **kwargs) -> list[str]:
        with concurrent.futures.ThreadPoolExecutor(max_workers=10) as pool:
            return list(pool.map(self.generate, prompts))
Use temperature=0.0 for deterministic results during benchmarking. This ensures consistent scores across runs.

Configuration options

Few-shot examples

Most benchmarks support few-shot prompting. The n_shots parameter controls how many examples are included in the prompt.
benchmark = MMLU(
    tasks=[MMLUTask.MACHINE_LEARNING],
    n_shots=5,   # Include 5 examples in each prompt
)
BenchmarkMax shots
MMLU5
HellaSwag10
DROP5
BigBenchHard3

Limiting test cases

For faster iteration, limit the number of problems evaluated per task:
benchmark = HellaSwag(
    tasks=[HellaSwagTask.ACTIVITY_NET],
    n_problems_per_task=50,  # Only evaluate 50 problems per task
)

Custom output instructions

Override the default output format instructions:
benchmark = MMLU(
    tasks=[MMLUTask.MACHINE_LEARNING],
    confinement_instructions="Answer with only the letter (A, B, C, or D).",
)

Scoring

Exact match

MMLU, HellaSwag, and BigBenchHard use exact match scoring. The model’s output is normalized (lowercased, punctuation removed) and compared to the expected answer.

DROP scoring

DROP uses custom metrics that handle:
  • Numerical answers (with tolerance)
  • Span-based answers
  • Multi-span answers
The primary metrics are exact match (EM) and F1 score.

HumanEval scoring

HumanEval uses pass@k scoring — the generated code is executed against hidden test cases. k is the number of code samples generated per problem.

Accessing results

After evaluation:
result = benchmark.evaluate(model)

# Overall accuracy
print(result.overall_accuracy)

# Detailed predictions (pandas DataFrame)
print(benchmark.predictions)
# Columns: task, input, prediction, expected_output, score

# Per-task accuracy breakdown
print(benchmark.task_scores)
Benchmark datasets are downloaded from HuggingFace on first run and cached locally. Subsequent runs use the cache. Make sure you have an internet connection for the initial run.

Posting to the dashboard

Send results to Avaliar for tracking and comparison across runs:
benchmark.post_results(
    model_name="gpt-4o",
    tags=["nightly", "v2.1"],
)
Results appear in the Avaliar dashboard where you can compare models, view historical trends, and include benchmark data in compliance reports.

Comparing models

Run the same benchmark against multiple models to make data-driven selection decisions:
from avaliar.benchmarks.compare import compare_models

result1, result2 = compare_models(
    model1=GPT4o(),
    model2=Claude(),
    model1_name="GPT-4o",
    model2_name="Claude Sonnet",
    benchmark=MMLU(tasks=[MMLUTask.MACHINE_LEARNING]),
)

Next Steps

Available Benchmarks

Full constructor reference for all 6 benchmarks.

Safety & Bias Evals

Evaluate your model for bias, toxicity, and safety risks.