ShatterMed-QA: A Topology-Regularized Multi-Hop Clinical Benchmark

Abstract

While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning," where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades.

To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel k-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination.

Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/

Large Language Models Medical QA Multi-hop Reasoning Shortcut Learning Knowledge Graphs Topology Regularization

Methodology

An end-to-end framework combining topology-regularized KG construction with constrained multi-hop benchmark synthesis.

ShatterMed-QA Pipeline: Phase I (Topology-Regularized KG Construction) and Phase II (Synthesizing ShatterMed-QA)

Figure 1: Overview of the ShatterMed-QA construction pipeline. Phase I builds a topology-regularized Knowledge Graph via dynamic semantic chunking, hierarchical soft clustering, and k-Shattering regularization. Phase II synthesizes multi-hop clinical questions through implicit bridge entity path mining and LLM-based vignette generation with topology-driven hard negative distractors.

Phase I: KG Construction

Dynamic Semantic Chunking — Preserves clinical cascades using 95th-percentile cosine distance thresholds
Hierarchical Soft Clustering — GMM + BIC optimization for overlapping medical topics
k-Shattering Regularization — Physically prunes hub nodes ensuring d_shattered(u,v) ≥ d_original(u,v)

Phase II: QA Synthesis

Bridge Entity Masking — The implicit bridge e_bridge is strictly excluded from the vignette
Topology-Driven Distractors — Sibling nodes sampled from the pathological hierarchy as hard negatives
Evidence-Grounded — Every reasoning chain anchored to exact sentence-level source text

k-Shattering: From Shortcuts to Deep Reasoning

Dataset Statistics

10,558 meticulously synthesized clinical questions across 5 primary clinical tasks, bifurcated into bilingual and difficulty-based splits.

Utilizing our framework, we introduce a bilingual (English and Chinese) dataset of 10,558 multi-hop clinical QA pairs. This includes a rigorously physician-vetted Golden Subset of 264 highly complex diagnostic vignettes, establishing a pristine evaluation ground for frontier LLMs.

Clinical Task Distribution

Task Distribution by Split

Dataset Quality Metrics

Overall Statistics

Metric	EN Easy	EN Hard	ZH Easy	ZH Hard	Overall
Total QA Pairs	5,923	1,692	2,616	327	10,558
Avg. Question Length	155.9	199.9	77.5	107.2	—
Avg. Explanation Length	805.3	869.8	629.7	593.9	—
Avg. Distractor Similarity	0.579	0.606	0.629	0.658	0.598
ROUGE-1 vs Content A	0.096	0.104	0.081	0.076	0.093
ROUGE-1 vs Content B	0.103	0.113	0.088	0.082	0.100
BLEU-1 vs Evidence	0.704	0.681	0.412	0.311	0.609
LLM-Judged Clarity (1-5)	4.72	4.60	4.71	4.52	4.70
LLM-Judged Validity (1-5)	4.89	4.83	4.84	4.73	4.87
LLM-Judged Difficulty (1-5)	3.11	3.11	3.15	3.18	3.11

Comparison with Existing Benchmarks

          1.4%
          Error Rate — Expert Validation by Board-Certified Physicians (1,500 questions)
        
          4.87
          Validity Score — Highest among all evaluated benchmarks
        
          0.598
          Distractor Similarity — High cosine similarity confirms effective hard negatives

Experimental Results

Comprehensive zero-shot evaluation of 21 LLMs reveals systemic performance degradation on multi-hop tasks.

Zero-Shot Accuracy by Model Category

Easy→Hard Performance Drop

RAG vs Direct Accuracy (Hard)

RAG Accuracy Improvement (Hard Split)

Dataset Showcase

Explore real questions from the ShatterMed-QA benchmark. Click cards to expand detailed model outputs.

Loading questions...