EleutherAI/lm-evaluation-harness: v0.4.10

Lintang Sutawika; Hailey Schoelkopf; Leo Gao; Baber Abbasi; Stella Biderman; Jonathan Tow; ben fattori; Charles Lovering; farzanehnakhaee70; Jason Phang; Anish Thite; Fazz; Aflah; Niklas; Thomas Wang; sdtblck; nopperl; gakada; tttyuntian; researcher2; Julen Etxaniz; Chris; Hanwool Albert Lee; James A. Michaelov; Leonid Sinev; Janna; Zdeněk Kasner; Kiersten Stokes; Khalid; KonradSzafer

doi:10.5281/zenodo.18394108

Published January 27, 2026 | Version v0.4.10

Software Open

EleutherAI/lm-evaluation-harness: v0.4.10

1. Language Technologies Institute, CMU
2. Booz Allen Hamilton, EleutherAI
3. playscape.gg
4. Max Planck Institute for Software Systems: MPI SWS
5. MistralAI
6. Hitz Zentroa EHU
7. @azurro
8. Shinhan Securities Co.
9. MIT
10. Charles University
11. Open Source Developer @ IBM

Highlights

The big change this release: the base package no longer installs model backends by default. We've also added new benchmarks and expanded multilingual support.

Breaking Change: Lightweight Core with Optional Backends

pip install lm_eval no longer installs the HuggingFace/torch stack by default. (#3428)

The core package no longer includes backends. Install them explicitly:

pip install lm_eval          # core only, no model backends
pip install lm_eval[hf]      # HuggingFace backend (transformers, torch, accelerate)
pip install lm_eval[vllm]    # vLLM backend
pip install lm_eval[api]     # API backends (OpenAI, Anthropic, etc.)

Additional breaking change: Accessing model classes via attribute no longer works:

# This still works:
from lm_eval.models.huggingface import HFLM

# This now raises AttributeError:
import lm_eval.models
lm_eval.models.huggingface.HFLM

CLI Refactor

The CLI now uses explicit subcommands and supports YAML config files (#3440):

lm-eval run --model hf --tasks hellaswag      # run evaluations
lm-eval run --config my_config.yaml           # load args from YAML config
lm-eval ls tasks                               # list available tasks
lm-eval validate --tasks hellaswag,arc_easy   # validate task configs

Backward compatible when omitting run still works: lm-eval --model hf --tasks hellaswag

See lm-eval --help or the CLI documentation for details.

Other Improvements

Decoupled ContextSampler with new build_qa_turn helper (#3429)
Normalized gen_kwargs with truncation_side support for vLLM (#3509)

New Benchmarks & Tasks

PISA task by @HallerPatrick in #3412
SLR-Bench (Scalable Logical Reasoning Benchmark) by @Ahmad21Omar in #3305
OpenAI Multilingual MMLU by @Helw150 in #3473
ULQA benchmark by @keramjan in #3340
IFEval in Spanish and Catalan by @juliafalcao in #3467
TruthfulQA-VA for Catalan by @sgs97ua in #3469
Multiple Bangla benchmarks by @Ismail-Hossain-1 in #3454
NeurIPS E2LM Competition submissions: Team Shaikespear, Morai, and Noor by @younesbelkada in #3437, #3443, #3444

Model Support

Ministral-3 adapter (hf-mistral3) by @medhakimbedhief in #3487

Fixes & Improvements

Task Fixes

Fixed leading whitespace leakage in MMLU-Pro by @baberabb in #3500
Fixed gen_prefix delimiter handling in multiple-choice tasks by @baberabb in #3508
Fixed MGSM stop criteria in Iberian languages by @juliafalcao in #3465
Fixed a=0 as valid answer index in build_qa_turn by @ezylopx5 in #3488
Fixed fewshot_config not being applied to fewshot docs by @baberabb in #3461
Updated GSM8K, WinoGrande, and SuperGLUE to use full HF dataset paths by @baberabb in #3523, #3525, #3527
Fixed gsm8k_cot_llama target_delimiter issue by @baberabb in #3526
Updated LIBRA task utils by @bond005 in #3520

Backend Fixes

Fixed vLLM off-by-one max_length error by @baberabb in #3503
Resolved deprecated vllm.transformers_utils.get_tokenizer import by @DarkLight1337 in #3482
Fixed SGLang import and removed duplicate tasks by @baberabb in #3492
Removed deprecated AutoModelForVision2Seq by @baberabb in #3522
Fixed Anthropic chat model mapping by @lucafossen in #3453
Fixed bug preventing = sign in checkpoint names by @mrinaldi97 in #3517
Fixed pretty_print_task for external custom configs by @safikhanSoofiyani in #3436
Fixed CLI regressions by @fxmarty-amd in #3449

New Contributors

@safikhanSoofiyani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3436
@lucafossen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3453
@Ahmad21Omar made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3305
@ezylopx5 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3488
@juliafalcao made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3467
@medhakimbedhief made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3487
@ntenenz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3489
@keramjan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3340
@bond005 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3520
@mrinaldi97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3517
@wogns3623 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3523

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.2...v0.4.10

Files