Under Review

ShatterMed-QA

Topology-Regularized Multi-Hop Clinical Reasoning Benchmark for Exposing Shortcut Learning in Medical LLMs

10,558 Questions 21 LLMs Evaluated Bilingual EN & ZH 700 Expert Validated

Abstract

While Large Language Models (LLMs) achieve expert-level performance on standard medical benchmarks through single-hop factual recall, they severely struggle with the complex, multi-hop diagnostic reasoning required in real-world clinical settings. A primary obstacle is "shortcut learning," where models exploit highly connected, generic hub nodes (e.g., "inflammation") in knowledge graphs to bypass authentic micro-pathological cascades.

To address this, we introduce ShatterMed-QA, a bilingual benchmark of 10,558 multi-hop clinical questions designed to rigorously evaluate deep diagnostic reasoning. Our framework constructs a topology-regularized medical Knowledge Graph using a novel k-Shattering algorithm, which physically prunes generic hubs to explicitly sever logical shortcuts. We synthesize the evaluation vignettes by applying implicit bridge entity masking and topology-driven hard negative sampling, forcing models to navigate biologically plausible distractors without relying on superficial elimination.

Comprehensive evaluations of 21 LLMs reveal massive performance degradation on our multi-hop tasks, particularly among domain-specific models. Crucially, restoring the masked evidence via Retrieval-Augmented Generation (RAG) triggers near-universal performance recovery, validating ShatterMed-QA's structural fidelity and proving its efficacy in diagnosing the fundamental reasoning deficits of current medical AI. Explore the dataset, interactive examples, and full leaderboards at our project website: https://shattermed-qa-web.vercel.app/

Large Language Models Medical QA Multi-hop Reasoning Shortcut Learning Knowledge Graphs Topology Regularization

Methodology

An end-to-end framework combining topology-regularized KG construction with constrained multi-hop benchmark synthesis.

ShatterMed-QA Pipeline: Phase I (Topology-Regularized KG Construction) and Phase II (Synthesizing ShatterMed-QA)

Figure 1: Overview of the ShatterMed-QA construction pipeline. Phase I builds a topology-regularized Knowledge Graph via dynamic semantic chunking, hierarchical soft clustering, and k-Shattering regularization. Phase II synthesizes multi-hop clinical questions through implicit bridge entity path mining and LLM-based vignette generation with topology-driven hard negative distractors.

Phase I: KG Construction

  • Dynamic Semantic Chunking — Preserves clinical cascades using 95th-percentile cosine distance thresholds
  • Hierarchical Soft Clustering — GMM + BIC optimization for overlapping medical topics
  • k-Shattering Regularization — Physically prunes hub nodes ensuring dshattered(u,v) ≥ doriginal(u,v)

Phase II: QA Synthesis

  • Bridge Entity Masking — The implicit bridge ebridge is strictly excluded from the vignette
  • Topology-Driven Distractors — Sibling nodes sampled from the pathological hierarchy as hard negatives
  • Evidence-Grounded — Every reasoning chain anchored to exact sentence-level source text

k-Shattering: From Shortcuts to Deep Reasoning

Dataset Statistics

10,558 meticulously synthesized clinical questions across 5 primary clinical tasks, bifurcated into bilingual and difficulty-based splits.

Utilizing our framework, we introduce a bilingual (English and Chinese) dataset of 10,558 multi-hop clinical QA pairs. This includes a rigorously physician-vetted Golden Subset of 264 highly complex diagnostic vignettes, establishing a pristine evaluation ground for frontier LLMs.

Clinical Task Distribution

Task Distribution by Split

Dataset Quality Metrics

Overall Statistics

MetricEN EasyEN HardZH EasyZH HardOverall
Total QA Pairs5,9231,6922,61632710,558
Avg. Question Length155.9199.977.5107.2
Avg. Explanation Length805.3869.8629.7593.9
Avg. Distractor Similarity0.5790.6060.6290.6580.598
ROUGE-1 vs Content A0.0960.1040.0810.0760.093
ROUGE-1 vs Content B0.1030.1130.0880.0820.100
BLEU-1 vs Evidence0.7040.6810.4120.3110.609
LLM-Judged Clarity (1-5)4.724.604.714.524.70
LLM-Judged Validity (1-5)4.894.834.844.734.87
LLM-Judged Difficulty (1-5)3.113.113.153.183.11

Comparison with Existing Benchmarks

1.4% Error Rate — Expert Validation by Board-Certified Physicians (1,500 questions)
4.87 Validity Score — Highest among all evaluated benchmarks
0.598 Distractor Similarity — High cosine similarity confirms effective hard negatives

Experimental Results

Comprehensive zero-shot evaluation of 21 LLMs reveals systemic performance degradation on multi-hop tasks.

Zero-Shot Accuracy by Model Category

Easy→Hard Performance Drop

RAG vs Direct Accuracy (Hard)

RAG Accuracy Improvement (Hard Split)

Dataset Showcase

Explore real questions from the ShatterMed-QA benchmark. Click cards to expand detailed model outputs.

Loading questions...