EleutherAI/lm-evaluation-harness: v0.4.10
Authors/Creators
- Lintang Sutawika1
- Hailey Schoelkopf
- Leo Gao
- Baber Abbasi
- Stella Biderman2
- Jonathan Tow
- ben fattori
- Charles Lovering
- farzanehnakhaee70
- Jason Phang
- Anish Thite3
- Fazz
- Aflah4
- Niklas
- Thomas Wang5
- sdtblck
- nopperl
- gakada
- tttyuntian
- researcher2
- Julen Etxaniz6
- Chris7
- Hanwool Albert Lee8
- James A. Michaelov9
- Leonid Sinev
- Janna
- Zdeněk Kasner10
- Kiersten Stokes11
- Khalid
- KonradSzafer
- 1. Language Technologies Institute, CMU
- 2. Booz Allen Hamilton, EleutherAI
- 3. playscape.gg
- 4. Max Planck Institute for Software Systems: MPI SWS
- 5. MistralAI
- 6. Hitz Zentroa EHU
- 7. @azurro
- 8. Shinhan Securities Co.
- 9. MIT
- 10. Charles University
- 11. Open Source Developer @ IBM
Description
Highlights
The big change this release: the base package no longer installs model backends by default. We've also added new benchmarks and expanded multilingual support.
Breaking Change: Lightweight Core with Optional Backends
pip install lm_eval no longer installs the HuggingFace/torch stack by default. (#3428)
The core package no longer includes backends. Install them explicitly:
pip install lm_eval # core only, no model backends
pip install lm_eval[hf] # HuggingFace backend (transformers, torch, accelerate)
pip install lm_eval[vllm] # vLLM backend
pip install lm_eval[api] # API backends (OpenAI, Anthropic, etc.)
Additional breaking change: Accessing model classes via attribute no longer works:
# This still works:
from lm_eval.models.huggingface import HFLM
# This now raises AttributeError:
import lm_eval.models
lm_eval.models.huggingface.HFLM
CLI Refactor
The CLI now uses explicit subcommands and supports YAML config files (#3440):
lm-eval run --model hf --tasks hellaswag # run evaluations
lm-eval run --config my_config.yaml # load args from YAML config
lm-eval ls tasks # list available tasks
lm-eval validate --tasks hellaswag,arc_easy # validate task configs
Backward compatible when omitting run still works: lm-eval --model hf --tasks hellaswag
See lm-eval --help or the CLI documentation for details.
Other Improvements
- Decoupled
ContextSamplerwith newbuild_qa_turnhelper (#3429) - Normalized
gen_kwargswithtruncation_sidesupport for vLLM (#3509)
New Benchmarks & Tasks
- PISA task by @HallerPatrick in #3412
- SLR-Bench (Scalable Logical Reasoning Benchmark) by @Ahmad21Omar in #3305
- OpenAI Multilingual MMLU by @Helw150 in #3473
- ULQA benchmark by @keramjan in #3340
- IFEval in Spanish and Catalan by @juliafalcao in #3467
- TruthfulQA-VA for Catalan by @sgs97ua in #3469
- Multiple Bangla benchmarks by @Ismail-Hossain-1 in #3454
- NeurIPS E2LM Competition submissions: Team Shaikespear, Morai, and Noor by @younesbelkada in #3437, #3443, #3444
Model Support
- Ministral-3 adapter (
hf-mistral3) by @medhakimbedhief in #3487
Fixes & Improvements
Task Fixes
- Fixed leading whitespace leakage in MMLU-Pro by @baberabb in #3500
- Fixed
gen_prefixdelimiter handling in multiple-choice tasks by @baberabb in #3508 - Fixed MGSM stop criteria in Iberian languages by @juliafalcao in #3465
- Fixed
a=0as valid answer index inbuild_qa_turnby @ezylopx5 in #3488 - Fixed
fewshot_confignot being applied to fewshot docs by @baberabb in #3461 - Updated GSM8K, WinoGrande, and SuperGLUE to use full HF dataset paths by @baberabb in #3523, #3525, #3527
- Fixed
gsm8k_cot_llamatarget_delimiterissue by @baberabb in #3526 - Updated LIBRA task utils by @bond005 in #3520
Backend Fixes
- Fixed vLLM off-by-one
max_lengtherror by @baberabb in #3503 - Resolved deprecated
vllm.transformers_utils.get_tokenizerimport by @DarkLight1337 in #3482 - Fixed SGLang import and removed duplicate tasks by @baberabb in #3492
- Removed deprecated
AutoModelForVision2Seqby @baberabb in #3522 - Fixed Anthropic chat model mapping by @lucafossen in #3453
- Fixed bug preventing
=sign in checkpoint names by @mrinaldi97 in #3517 - Fixed
pretty_print_taskfor external custom configs by @safikhanSoofiyani in #3436 - Fixed CLI regressions by @fxmarty-amd in #3449
New Contributors
- @safikhanSoofiyani made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3436
- @lucafossen made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3453
- @Ahmad21Omar made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3305
- @ezylopx5 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3488
- @juliafalcao made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3467
- @medhakimbedhief made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3487
- @ntenenz made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3489
- @keramjan made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3340
- @bond005 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3520
- @mrinaldi97 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3517
- @wogns3623 made their first contribution in https://github.com/EleutherAI/lm-evaluation-harness/pull/3523
Full Changelog: https://github.com/EleutherAI/lm-evaluation-harness/compare/v0.4.9.2...v0.4.10
Files
EleutherAI/lm-evaluation-harness-v0.4.10.zip
Files
(10.6 MB)
| Name | Size | Download all |
|---|---|---|
|
md5:cabd407d013a3b1433334760c7f30465
|
10.6 MB | Preview Download |
Additional details
Related works
- Is supplement to
- Software: https://github.com/EleutherAI/lm-evaluation-harness/tree/v0.4.10 (URL)
Software
- Repository URL
- https://github.com/EleutherAI/lm-evaluation-harness