ClinDEF is a dynamic framework for assessing clinical reasoning in large language models (LLMs) through simulated multi-turn doctor-patient diagnostic dialogues. Unlike existing evaluation paradigms primarily rely on static benchmarks, ClinDEF models diagnosis as an interactive process involving hypothesis generation, information gathering, test justification, and differential revision. By integrating disease knowledge graphs with a multi-agent dialogue environment, ClinDEF enables contamination-resistant case generation and process-aware evaluation across accuracy, efficiency, and reasoning quality. Its fine-grained, rubric-based assessment exposes clinically meaningful reasoning gaps in state-of-the-art LLMs, providing a robust and scalable paradigm for advancing reliable clinical AI evaluation.
First clone the repository:
git clone https://github.com/HICAI-ZJU/ClinDEFNext, set up a conda environment:
conda create -n ClinDEF python=3.10.9
conda activate ClinDEFThen, install the dependencies using pip:
pip install -r requirements.txtStep 1: API and Model Endpoint Setup
- Deploy your model as an OpenAI-compatible server.
- Configure API settings in
request.py.
Step 2: Run Multi-Agent Simulation
cd src
bash run.shrun.sh iterates over different Doctor Agents and calls interact.py to simulate up to 15 turns. The interaction proceeds as follows:
- Patient Agent (
patient.py) generates the initial chief complaint and subsequent answers. - Doctor Agent (
doctor.py) reads the evolvingchat_historyand produces Thought + Action; the dialogue terminates when[!Diag!]is output. - Examiner Agent (
assistant.py) returns formatted[!Positive!]/[!Negative!]examination results and counts findings.
Step 3: Fine-Grained Quality Assessment
cd ../eval
python quality_eval.pyquality_eval.py reads result_<model>.jsonl and uses LLM as judge to score 7 dimensions and a total score. Results are written to:
score/<model>_quality_evaluation_scored.jsonlscore/<model>_quality_evaluation_failed.jsonlall_models_quality_summary.json
Step 4: Patient-Side Binary Checks
python patient_quality_eval.pyThis script evaluates two boolean criteria: Information Leakage and Clinical Fidelity. It outputs:
patient_score/new/<model>_binary_eval.jsonlpatient_score/new/<model>_binary_eval_failed.jsonlbinary_evaluation_summary.json
Step 5: Retry Missing Quality Scores (Optional)
python retry_quality_eval.pyWhen score/<model>_quality_evaluation_scored.jsonl has missing items, this script re-evaluates only the missing entries and appends successful results to the original file.
- KG dataset format:
source/data.jsonlis the dataset constructed from knowledge graph (KG). Each line is a JSON object with at least:name: disease namedesc: case profile + disease description
- Dialogue result format:
src/result_<model>.jsonlstores per-case outputs:{"name":"<disease>","final_answer":"<diagnosis>","turns":"<int>","positive_findings":"<int>","negative_findings":"int","chat_history":"<full dialogue text>"} - Evaluation artifacts:
eval/score/contains 7-dimension scores.eval/patient_score/new/contains patient-side binary checks.- Summary files:
all_models_quality_summary.jsonandbinary_evaluation_summary.json.
- Model access: Set
OPENAI_API_BASE/OPENAI_API_KEY,DASHSCOPE_API_BASE/DASHSCOPE_API_KEY, andMODELSCOPE_API_BASE/MODELSCOPE_API_KEYas needed (different models may use different API KEY providers).
If you use ClinDEF in your research, please cite our paper:
@article{tang2025clindef,
title={ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning},
author={Tang, Yuqi and Yu, Jing and Su, Zichang and Feng, Kehua and Zhu, Zhihui and Wang, Libin and Liang, Lei and Zhang, Qiang and Ding, Keyan and Chen, Huajun},
journal={arXiv preprint arXiv:2512.23440},
year={2025}
}
