Topology-Regularized Multi-Hop Clinical Reasoning Benchmark for Exposing Shortcut Learning in Medical LLMs
10,558 Questions 21 LLMs Evaluated Bilingual EN & ZH 700 Expert Validated
While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning," where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades.
To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel k-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination.
Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/
An end-to-end framework combining topology-regularized KG construction with constrained multi-hop benchmark synthesis.
Figure 1: Overview of the ShatterMed-QA construction pipeline. Phase I builds a topology-regularized Knowledge Graph via dynamic semantic chunking, hierarchical soft clustering, and k-Shattering regularization. Phase II synthesizes multi-hop clinical questions through implicit bridge entity path mining and LLM-based vignette generation with topology-driven hard negative distractors.
10,558 meticulously synthesized clinical questions across 5 primary clinical tasks, bifurcated into bilingual and difficulty-based splits.
Utilizing our framework, we introduce a bilingual (English and Chinese) dataset of 10,558 multi-hop clinical QA pairs. This includes a rigorously physician-vetted Golden Subset of 264 highly complex diagnostic vignettes, establishing a pristine evaluation ground for frontier LLMs.
| Metric | EN Easy | EN Hard | ZH Easy | ZH Hard | Overall |
|---|---|---|---|---|---|
| Total QA Pairs | 5,923 | 1,692 | 2,616 | 327 | 10,558 |
| Avg. Question Length | 155.9 | 199.9 | 77.5 | 107.2 | — |
| Avg. Explanation Length | 805.3 | 869.8 | 629.7 | 593.9 | — |
| Avg. Distractor Similarity | 0.579 | 0.606 | 0.629 | 0.658 | 0.598 |
| ROUGE-1 vs Content A | 0.096 | 0.104 | 0.081 | 0.076 | 0.093 |
| ROUGE-1 vs Content B | 0.103 | 0.113 | 0.088 | 0.082 | 0.100 |
| BLEU-1 vs Evidence | 0.704 | 0.681 | 0.412 | 0.311 | 0.609 |
| LLM-Judged Clarity (1-5) | 4.72 | 4.60 | 4.71 | 4.52 | 4.70 |
| LLM-Judged Validity (1-5) | 4.89 | 4.83 | 4.84 | 4.73 | 4.87 |
| LLM-Judged Difficulty (1-5) | 3.11 | 3.11 | 3.15 | 3.18 | 3.11 |
Comprehensive zero-shot evaluation of 21 LLMs reveals systemic performance degradation on multi-hop tasks.
Explore real questions from the ShatterMed-QA benchmark. Click cards to expand detailed model outputs.