To run benchmarks and evaluations with Avaliar, implement the AvaliarBaseLLM abstract class. This gives the benchmark runner a standard interface to call your model, regardless of which provider or configuration you use.
from abc import ABC, abstractmethodclass AvaliarBaseLLM(ABC): """Abstract base class for LLM implementations.""" @abstractmethod def __init__(self): """Initialize your LLM client and configuration.""" ... @abstractmethod def generate(self, prompt: str) -> str: """Generate a single response from a prompt. Args: prompt: The input prompt string. Returns: The model's response as a string. """ ... def batch_generate(self, prompts: list[str], **kwargs) -> list[str]: """Generate responses for multiple prompts. Override this method to implement optimized batch processing. The default implementation calls generate() sequentially. Args: prompts: A list of input prompt strings. **kwargs: Additional keyword arguments. Returns: A list of response strings, one per prompt. """ return [self.generate(prompt) for prompt in prompts]
Override this method to implement optimized batch processing. The default implementation calls generate() sequentially for each prompt.If your provider supports batch APIs or you want to add concurrency, override this method to improve benchmark throughput.
generate_samples(prompt, n, temperature) -> list[str]
Required by the BOLD and RealToxicityPrompts evals, which need multiple diverse samples per prompt. Not needed for benchmarks or other evals.
When running benchmarks, each test case is represented as a TestCase object with the following fields:
Field
Type
Description
input
str
The input prompt sent to your model
output
str
The actual response generated by your model
expected_output
str
The expected or reference response for comparison
context
str
Additional context relevant to the test case
The benchmark runner calls your generate() method with the input field and stores the result in output. Scoring functions then compare output against expected_output using the provided context.
from avaliar.models.base import TestCasetest_case = TestCase( input="What is the capital of France?", output="The capital of France is Paris.", expected_output="Paris", context="European geography question",)
You do not need to create TestCase objects yourself when running benchmarks. The benchmark runner constructs them automatically from the dataset and your model’s responses.