flexMMLU Fine-tune vs. Base Model Evaluator
A fast, flexible, fully subject-balanced MMLU benchmark comparison tool for fine-tuned adapters and their base models
Compare your adapter to its base model head-to-head in just minutes using this accelerated MMLU (Massive Multitask Language Understanding) benchmark evaluation.
This app uses an efficient and effective, subject-balanced random sampling, zero-shot evaluation across all 57 academic and professional subjects in the MMLU. It provides a relative performance comparison of an adapter and its base model with a side-by-side score breakdown of overall, subject-by-subject, and domain-by-domain scores.
Whether you’re validating a new LoRA, demonstrating gains for a project, or just curious how much your fine-tuning really moved the needle, this tool gives you a fast, standardized report and visualization of its results.
This Space runs on the free “CPU basic” (2 vCPU, 16 GB RAM) hardware option. You can duplicate this Space in your own account and configure it to use a more powerful paid hardware configuration for an even faster evaluation.
In order to use your own private models or public gated models, obtain necessary gated model user access, duplicate this space, and set your own account HF_TOKEN environment variable/secret for the duplicated space.