Published January 27, 2026 | Version v0.4.10
Software Open

EleutherAI/lm-evaluation-harness: v0.4.10

  • 1. Language Technologies Institute, CMU
  • 2. Booz Allen Hamilton, EleutherAI
  • 3. playscape.gg
  • 4. Max Planck Institute for Software Systems: MPI SWS
  • 5. MistralAI
  • 6. Hitz Zentroa EHU
  • 7. @azurro
  • 8. Shinhan Securities Co.
  • 9. MIT
  • 10. Charles University
  • 11. Open Source Developer @ IBM

Description

Highlights

The big change this release: the base package no longer installs model backends by default. We've also added new benchmarks and expanded multilingual support.

Breaking Change: Lightweight Core with Optional Backends

pip install lm_eval no longer installs the HuggingFace/torch stack by default. (#3428)

The core package no longer includes backends. Install them explicitly:

pip install lm_eval          # core only, no model backends
pip install lm_eval[hf]      # HuggingFace backend (transformers, torch, accelerate)
pip install lm_eval[vllm]    # vLLM backend
pip install lm_eval[api]     # API backends (OpenAI, Anthropic, etc.)

Additional breaking change: Accessing model classes via attribute no longer works:

# This still works:
from lm_eval.models.huggingface import HFLM

# This now raises AttributeError:
import lm_eval.models
lm_eval.models.huggingface.HFLM

CLI Refactor

The CLI now uses explicit subcommands and supports YAML config files (#3440):

lm-eval run --model hf --tasks hellaswag      # run evaluations
lm-eval run --config my_config.yaml           # load args from YAML config
lm-eval ls tasks                               # list available tasks
lm-eval validate --tasks hellaswag,arc_easy   # validate task configs

Backward compatible when omitting run still works: lm-eval --model hf --tasks hellaswag

See lm-eval --help or the CLI documentation for details.

Other Improvements

  • Decoupled ContextSampler with new build_qa_turn helper (#3429)
  • Normalized gen_kwargs with truncation_side support for vLLM (#3509)

New Benchmarks & Tasks

  • PISA task by @HallerPatrick in #3412
  • SLR-Bench (Scalable Logical Reasoning Benchmark) by @Ahmad21Omar in #3305
  • OpenAI Multilingual MMLU by @Helw150 in #3473
  • ULQA benchmark by @keramjan in #3340
  • IFEval in Spanish and Catalan by @juliafalcao in #3467
  • TruthfulQA-VA for Catalan by @sgs97ua in #3469
  • Multiple Bangla benchmarks by @Ismail-Hossain-1 in #3454
  • NeurIPS E2LM Competition submissions: Team Shaikespear, Morai, and Noor by @younesbelkada in #3437, #3443, #3444

Model Support

  • Ministral-3 adapter (hf-mistral3) by @medhakimbedhief in #3487

Fixes & Improvements

Task Fixes

  • Fixed leading whitespace leakage in MMLU-Pro by @baberabb in #3500
  • Fixed gen_prefix delimiter handling in multiple-choice tasks by @baberabb in #3508
  • Fixed MGSM stop criteria in Iberian languages by @juliafalcao in #3465
  • Fixed a=0 as valid answer index in build_qa_turn by @ezylopx5 in #3488
  • Fixed fewshot_config not being applied to fewshot docs by @baberabb in #3461
  • Updated GSM8K, WinoGrande, and SuperGLUE to use full HF dataset paths by @baberabb in #3523, #3525, #3527
  • Fixed gsm8k_cot_llama target_delimiter issue by @baberabb in #3526
  • Updated LIBRA task utils by @bond005 in #3520

Backend Fixes

  • Fixed vLLM off-by-one max_length error by @baberabb in #3503
  • Resolved deprecated vllm.transformers_utils.get_tokenizer import by @DarkLight1337 in #3482
  • Fixed SGLang import and removed duplicate tasks by @baberabb in #3492
  • Removed deprecated AutoModelForVision2Seq by @baberabb in #3522
  • Fixed Anthropic chat model mapping by @lucafossen in #3453
  • Fixed bug preventing = sign in checkpoint names by @mrinaldi97 in #3517
  • Fixed pretty_print_task for external custom configs by @safikhanSoofiyani in #3436
  • Fixed CLI regressions by @fxmarty-amd in #3449

New Contributors

  • @safikhanSoofiyani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3436
  • @lucafossen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3453
  • @Ahmad21Omar made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3305
  • @ezylopx5 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3488
  • @juliafalcao made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3467
  • @medhakimbedhief made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3487
  • @ntenenz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3489
  • @keramjan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3340
  • @bond005 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3520
  • @mrinaldi97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3517
  • @wogns3623 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3523

Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.2...v0.4.10

Files

EleutherAI/lm-evaluation-harness-v0.4.10.zip

Files (10.6 MB)

Name Size Download all
md5:cabd407d013a3b1433334760c7f30465
10.6 MB Preview Download

Additional details

Related works