The AvaliarBaseLLM interface
Every benchmark requires a model that implements AvaliarBaseLLM:
from avaliar.models.base import AvaliarBaseLLM
class MyModel ( AvaliarBaseLLM ):
def __init__ ( self ):
# Initialize your LLM client
...
def generate ( self , prompt : str ) -> str :
# Return model response as a string
...
def batch_generate ( self , prompts : list[ str ], ** kwargs ) -> list[ str ]:
# Optional: process multiple prompts concurrently
...
See AvaliarBaseLLM for full interface details and provider examples.
Batch generation
If your model supports parallel requests, implement batch_generate for significant speedups. Benchmarks automatically detect and use it when available.
import concurrent.futures
from avaliar.models.base import AvaliarBaseLLM
from openai import OpenAI
class MyModel ( AvaliarBaseLLM ):
def __init__ ( self ):
self .client = OpenAI()
def generate ( self , prompt : str ) -> str :
response = self .client.chat.completions.create(
model = "gpt-4o" ,
messages = [{ "role" : "user" , "content" : prompt}],
temperature = 0.0 ,
)
return response.choices[ 0 ].message.content
def batch_generate ( self , prompts : list[ str ], ** kwargs ) -> list[ str ]:
with concurrent.futures.ThreadPoolExecutor( max_workers = 10 ) as pool:
return list (pool.map( self .generate, prompts))
Use temperature=0.0 for deterministic results during benchmarking. This ensures consistent scores across runs.
Configuration options
Few-shot examples
Most benchmarks support few-shot prompting. The n_shots parameter controls how many examples are included in the prompt.
benchmark = MMLU(
tasks = [MMLUTask. MACHINE_LEARNING ],
n_shots = 5 , # Include 5 examples in each prompt
)
Benchmark Max shots MMLU 5 HellaSwag 10 DROP 5 BigBenchHard 3
Limiting test cases
For faster iteration, limit the number of problems evaluated per task:
benchmark = HellaSwag(
tasks = [HellaSwagTask. ACTIVITY_NET ],
n_problems_per_task = 50 , # Only evaluate 50 problems per task
)
Custom output instructions
Override the default output format instructions:
benchmark = MMLU(
tasks = [MMLUTask. MACHINE_LEARNING ],
confinement_instructions = "Answer with only the letter (A, B, C, or D)." ,
)
Scoring
Exact match
MMLU, HellaSwag, and BigBenchHard use exact match scoring. The model’s output is normalized (lowercased, punctuation removed) and compared to the expected answer.
DROP scoring
DROP uses custom metrics that handle:
Numerical answers (with tolerance)
Span-based answers
Multi-span answers
The primary metrics are exact match (EM) and F1 score.
HumanEval scoring
HumanEval uses pass@k scoring — the generated code is executed against hidden test cases. k is the number of code samples generated per problem.
Accessing results
After evaluation:
result = benchmark.evaluate(model)
# Overall accuracy
print (result.overall_accuracy)
# Detailed predictions (pandas DataFrame)
print (benchmark.predictions)
# Columns: task, input, prediction, expected_output, score
# Per-task accuracy breakdown
print (benchmark.task_scores)
Benchmark datasets are downloaded from HuggingFace on first run and cached locally. Subsequent runs use the cache. Make sure you have an internet connection for the initial run.
Posting to the dashboard
Send results to Avaliar for tracking and comparison across runs:
benchmark.post_results(
model_name = "gpt-4o" ,
tags = [ "nightly" , "v2.1" ],
)
Results appear in the Avaliar dashboard where you can compare models, view historical trends, and include benchmark data in compliance reports.
Comparing models
Run the same benchmark against multiple models to make data-driven selection decisions:
from avaliar.benchmarks.compare import compare_models
result1, result2 = compare_models(
model1 = GPT4o(),
model2 = Claude(),
model1_name = "GPT-4o" ,
model2_name = "Claude Sonnet" ,
benchmark = MMLU( tasks = [MMLUTask. MACHINE_LEARNING ]),
)
Next Steps
Available Benchmarks Full constructor reference for all 6 benchmarks.
Safety & Bias Evals Evaluate your model for bias, toxicity, and safety risks.