MMLU
Massive Multitask Language Understanding — Tests knowledge across 57 subjects including STEM, humanities, and social sciences.Scoring: Exact match
Dataset: HuggingFace
cais/mmlu
All 57 MMLU tasks
All 57 MMLU tasks
ABSTRACT_ALGEBRA, ANATOMY, ASTRONOMY, BUSINESS_ETHICS, CLINICAL_KNOWLEDGE, COLLEGE_BIOLOGY, COLLEGE_CHEMISTRY, COLLEGE_COMPUTER_SCIENCE, COLLEGE_MATHEMATICS, COLLEGE_MEDICINE, COLLEGE_PHYSICS, COMPUTER_SECURITY, CONCEPTUAL_PHYSICS, ECONOMETRICS, ELECTRICAL_ENGINEERING, ELEMENTARY_MATHEMATICS, FORMAL_LOGIC, GLOBAL_FACTS, HIGH_SCHOOL_BIOLOGY, HIGH_SCHOOL_CHEMISTRY, HIGH_SCHOOL_COMPUTER_SCIENCE, HIGH_SCHOOL_EUROPEAN_HISTORY, HIGH_SCHOOL_GEOGRAPHY, HIGH_SCHOOL_GOVERNMENT_AND_POLITICS, HIGH_SCHOOL_MACROECONOMICS, HIGH_SCHOOL_MATHEMATICS, HIGH_SCHOOL_MICROECONOMICS, HIGH_SCHOOL_PHYSICS, HIGH_SCHOOL_PSYCHOLOGY, HIGH_SCHOOL_STATISTICS, HIGH_SCHOOL_US_HISTORY, HIGH_SCHOOL_WORLD_HISTORY, HUMAN_AGING, HUMAN_SEXUALITY, INTERNATIONAL_LAW, JURISPRUDENCE, LOGICAL_FALLACIES, MACHINE_LEARNING, MANAGEMENT, MARKETING, MEDICAL_GENETICS, MISCELLANEOUS, MORAL_DISPUTES, MORAL_SCENARIOS, NUTRITION, PHILOSOPHY, PREHISTORY, PROFESSIONAL_ACCOUNTING, PROFESSIONAL_LAW, PROFESSIONAL_MEDICINE, PROFESSIONAL_PSYCHOLOGY, PUBLIC_RELATIONS, SECURITY_STUDIES, SOCIOLOGY, US_FOREIGN_POLICY, VIROLOGY, WORLD_RELIGIONSHellaSwag
Commonsense reasoning — Tests the ability to predict what happens next in real-world scenarios.Scoring: Exact match
DROP
Discrete Reasoning Over Paragraphs — Tests reading comprehension and numerical reasoning.Scoring: Custom metrics handling numerical, span, and multi-span answers
TruthfulQA
Truthfulness evaluation — Tests whether models generate truthful answers and avoid common misconceptions.Scoring: Truth identification score
Modes:
MC1— Single correct answer among optionsMC2— Multiple correct answers possible
BigBenchHard
Complex reasoning — A curated set of challenging tasks from the BIG-Bench suite that require multi-step reasoning.Scoring: Exact match
Example BigBenchHard tasks
Example BigBenchHard tasks
BOOLEAN_EXPRESSIONS, CAUSAL_JUDGEMENT, DATE_UNDERSTANDING, FORMAL_FALLACIES, OBJECT_COUNTING, PENGUINS_IN_A_TABLE, REASONING_ABOUT_COLORED_OBJECTS, SPORTS_UNDERSTANDING, TEMPORAL_SEQUENCES, WEB_OF_LIESHumanEval
Code generation — Tests the ability to generate correct Python functions from docstrings.Scoring: pass@k (generated code is executed against test cases)
Next Steps
Running Benchmarks
Configuration options, batch generation, and best practices.
Safety & Bias Evals
Evaluate your model for bias, toxicity, and safety risks.