Available Benchmarks

MMLU

Massive Multitask Language Understanding — Tests knowledge across 57 subjects including STEM, humanities, and social sciences.

from avaliar.benchmarks import MMLU
from avaliar.benchmarks.mmlu.task import MMLUTask

benchmark = MMLU(
    tasks=[MMLUTask.MACHINE_LEARNING, MMLUTask.HIGH_SCHOOL_MATHEMATICS],
    n_shots=5,                          # Up to 5 (default: 5)
    confinement_instructions=None,      # Optional custom output instructions
)
result = benchmark.evaluate(model)

Format: Multiple choice (A/B/C/D)
Scoring: Exact match
Dataset: HuggingFace cais/mmlu

All 57 MMLU tasks

ABSTRACT_ALGEBRA, ANATOMY, ASTRONOMY, BUSINESS_ETHICS, CLINICAL_KNOWLEDGE, COLLEGE_BIOLOGY, COLLEGE_CHEMISTRY, COLLEGE_COMPUTER_SCIENCE, COLLEGE_MATHEMATICS, COLLEGE_MEDICINE, COLLEGE_PHYSICS, COMPUTER_SECURITY, CONCEPTUAL_PHYSICS, ECONOMETRICS, ELECTRICAL_ENGINEERING, ELEMENTARY_MATHEMATICS, FORMAL_LOGIC, GLOBAL_FACTS, HIGH_SCHOOL_BIOLOGY, HIGH_SCHOOL_CHEMISTRY, HIGH_SCHOOL_COMPUTER_SCIENCE, HIGH_SCHOOL_EUROPEAN_HISTORY, HIGH_SCHOOL_GEOGRAPHY, HIGH_SCHOOL_GOVERNMENT_AND_POLITICS, HIGH_SCHOOL_MACROECONOMICS, HIGH_SCHOOL_MATHEMATICS, HIGH_SCHOOL_MICROECONOMICS, HIGH_SCHOOL_PHYSICS, HIGH_SCHOOL_PSYCHOLOGY, HIGH_SCHOOL_STATISTICS, HIGH_SCHOOL_US_HISTORY, HIGH_SCHOOL_WORLD_HISTORY, HUMAN_AGING, HUMAN_SEXUALITY, INTERNATIONAL_LAW, JURISPRUDENCE, LOGICAL_FALLACIES, MACHINE_LEARNING, MANAGEMENT, MARKETING, MEDICAL_GENETICS, MISCELLANEOUS, MORAL_DISPUTES, MORAL_SCENARIOS, NUTRITION, PHILOSOPHY, PREHISTORY, PROFESSIONAL_ACCOUNTING, PROFESSIONAL_LAW, PROFESSIONAL_MEDICINE, PROFESSIONAL_PSYCHOLOGY, PUBLIC_RELATIONS, SECURITY_STUDIES, SOCIOLOGY, US_FOREIGN_POLICY, VIROLOGY, WORLD_RELIGIONS

HellaSwag

Commonsense reasoning — Tests the ability to predict what happens next in real-world scenarios.

from avaliar.benchmarks import HellaSwag
from avaliar.benchmarks.hellaswag.task import HellaSwagTask

benchmark = HellaSwag(
    tasks=[HellaSwagTask.ACTIVITY_NET],
    n_shots=10,                         # Up to 10 (default: 10)
    n_problems_per_task=None,           # Limit problems per task (optional)
    confinement_instructions=None,
)
result = benchmark.evaluate(model)

Format: Sentence completion (choose the best ending)
Scoring: Exact match

DROP

Discrete Reasoning Over Paragraphs — Tests reading comprehension and numerical reasoning.

from avaliar.benchmarks import DROP
from avaliar.benchmarks.drop.task import DROPTask

benchmark = DROP(
    tasks=None,                         # None = all tasks
    n_shots=5,                          # Up to 5 (default: 5)
    n_problems_per_task=None,
)
result = benchmark.evaluate(model)

Format: Free-form answer (text or number)
Scoring: Custom metrics handling numerical, span, and multi-span answers

TruthfulQA

Truthfulness evaluation — Tests whether models generate truthful answers and avoid common misconceptions.

from avaliar.benchmarks import TruthfulQA
from avaliar.benchmarks.truthful_qa.truthful_qa import TruthfulQAMode

benchmark = TruthfulQA(
    tasks=None,                         # None = all tasks
    mode=TruthfulQAMode.MC1,           # MC1 or MC2
    n_problems_per_task=None,
    confinement_instructions_dict=None, # Per-mode custom instructions
)
result = benchmark.evaluate(model, batch_size=None)

Format: Multiple choice
Scoring: Truth identification score
Modes:

MC1 — Single correct answer among options
MC2 — Multiple correct answers possible

BigBenchHard

Complex reasoning — A curated set of challenging tasks from the BIG-Bench suite that require multi-step reasoning.

from avaliar.benchmarks import BigBenchHard
from avaliar.benchmarks.big_bench_hard.task import BigBenchHardTask

benchmark = BigBenchHard(
    tasks=None,                         # None = all tasks
    n_shots=3,                          # Up to 3 (default: 3)
    n_problems_per_task=None,
    verbose_mode=False,                 # Print detailed output
    confinement_instructions_dict=None, # Per-task custom instructions
)
result = benchmark.evaluate(model)

Format: Task-specific (includes chain-of-thought prompts)
Scoring: Exact match

Example BigBenchHard tasks

BOOLEAN_EXPRESSIONS, CAUSAL_JUDGEMENT, DATE_UNDERSTANDING, FORMAL_FALLACIES, OBJECT_COUNTING, PENGUINS_IN_A_TABLE, REASONING_ABOUT_COLORED_OBJECTS, SPORTS_UNDERSTANDING, TEMPORAL_SEQUENCES, WEB_OF_LIES

HumanEval

Code generation — Tests the ability to generate correct Python functions from docstrings.

from avaliar.benchmarks import HumanEval
from avaliar.benchmarks.human_eval.task import HumanEvalTask

benchmark = HumanEval(
    tasks=None,                         # None = all tasks
    n=200,                              # Number of samples per problem (default: 200)
)
result = benchmark.evaluate(model, k=1)  # pass@k

Format: Python code generation
Scoring: pass@k (generated code is executed against test cases)

HumanEval executes generated code in a sandboxed environment. Make sure your evaluation environment supports code execution.

Available Benchmarks

MMLU

HellaSwag

DROP

TruthfulQA

BigBenchHard

HumanEval

Next Steps

Running Benchmarks

Safety & Bias Evals

​MMLU

​HellaSwag

​DROP

​TruthfulQA

​BigBenchHard

​HumanEval

​Next Steps

Running Benchmarks

Safety & Bias Evals

MMLU

HellaSwag

DROP

TruthfulQA

BigBenchHard

HumanEval

Next Steps