flexMMLU Fine-tune vs. Base Model Evaluator

A fast, flexible, fully subject-balanced MMLU benchmark comparison tool for fine-tuned adapters and their base models

Compare your adapter to its base model head-to-head in just minutes using this accelerated MMLU (Massive Multitask Language Understanding) benchmark evaluation.

This app uses an efficient and effective, subject-balanced random sampling, zero-shot evaluation across all 57 academic and professional subjects in the MMLU. It provides a relative performance comparison of an adapter and its base model with a side-by-side score breakdown of overall, subject-by-subject, and domain-by-domain scores.

Whether you’re validating a new LoRA, demonstrating gains for a project, or just curious how much your fine-tuning really moved the needle, this tool gives you a fast, standardized report and visualization of its results.