Diagnostic imaging AI is the largest category of FDA-cleared AI medical devices by a significant margin. Of the 950+ AI-enabled devices cleared by FDA through 2024, the majority are radiology AI — covering CT, MRI, X-ray, ultrasound, mammography, and pathology imaging. The European market follows a similar pattern.
This volume of cleared devices might suggest a mature, well-understood field. The engineering reality is more complicated. Published radiology AI frequently fails at external validation. Devices cleared through regulatory submission perform materially worse in real clinical deployment than in their validation studies. And the gap between benchmark performance and deployment performance is consistently larger for imaging AI than for most other clinical AI categories.
This article is about why that gap exists and what building imaging AI correctly requires — from the data pipeline through model architecture to regulatory submission.
Companion piece to our broader health AI series: the engineering reality of clinical-grade AI for technical principles that apply across modalities; clinical validation for the evidence package structure; the data problem for strategies when training data is scarce; post-market surveillance for what happens after CE marking; and FDA pathway selection for the US-side strategy.
Why Imaging AI Fails at External Validation
The gap between internal validation performance and external validation performance for imaging AI has been documented extensively in the medical literature. Published meta-analyses of AI diagnostic studies have repeatedly found that the majority show significant performance degradation when evaluated on datasets from sites other than those used for development. The degradation is not explained by technical problems with the models. It's explained by the structure of medical imaging data.
Scanner-specific features are learned as clinical features. Medical images contain information about the disease of interest. They also contain information about the scanner that produced them — texture characteristics, noise patterns, contrast profiles, spatial resolution, artifacts specific to manufacturer and acquisition protocol. Deep neural networks trained on images from a limited range of scanners learn both types of features. When deployed on images from different scanners, the scanner-specific features don't match the training distribution, but the model applies them as if they do. Performance degrades in proportion to the divergence between training and deployment scanner characteristics.
This problem is more severe for CT and MRI, where acquisition parameters (kernel, slice thickness, reconstruction algorithm, field of view) vary enormously between sites and protocols. It is less severe for mammography, where acquisition is more standardised, and least severe for pathology images, where standardised staining and acquisition protocols reduce variability.
Site-specific clinical practice creates systematic label differences. "Ground truth" in imaging AI is typically expert radiologist annotation. Different radiologists at different institutions have different thresholds for what constitutes a reportable finding, different conventions for how to describe ambiguous cases, and different access to clinical context when annotating images. A model trained on annotations from Institution A may learn the annotation conventions of Institution A's radiologists — conventions that differ from those of Institution B.
For binary classification (finding present / absent), this creates systematic threshold differences between sites. For severity grading or measurement tasks, it creates systematic calibration differences. The model is correctly predicting what Institution A's radiologists would say — but deployed at Institution B, this doesn't match what Institution B's radiologists say.
Patient population differences between sites. Referral patterns, geographic demographics, and disease prevalence differ between institutions. A model trained at a tertiary referral centre, where referred patients have higher disease prevalence and often more severe or complex presentations, may underperform at a community hospital with a different case mix and disease prevalence. The model calibration — the relationship between confidence score and actual prevalence — shifts with disease prevalence in ways that aren't corrected by the model's internal parameters.
What the Training Data Pipeline Must Include
Multi-scanner coverage from the training set. The most effective technical mitigation for scanner-specific feature learning is training on data from multiple scanners and acquisition protocols. This is more effective than domain adaptation techniques applied post-training. If your training data comes from one or two institutions, the scanner diversity in your training set almost certainly doesn't reflect the range of equipment your device will encounter in deployment.
For a commercially deployed imaging AI, a training dataset covering at least 5–10 distinct scanner manufacturers, models, or acquisition protocols in each target imaging modality is a reasonable minimum. Regulatory submissions that rely on single-site imaging data face increasing scrutiny from notified bodies.
Harmonisation and normalisation. Image normalisation approaches — intensity normalisation for MRI, dose normalisation for CT, stain normalisation for pathology — reduce inter-scanner variability in the training data and in the inference pipeline. Site-wise normalisation, histogram matching, and deep learning-based domain adaptation (CycleGAN and similar) are used to reduce the scanner signature in images before the clinical feature extraction. Each approach has trade-offs: aggressive normalisation that reduces scanner-specific features may also reduce clinically relevant signal. The choice of normalisation approach should be documented and justified in the technical file.
Prospectively labelled data, not opportunistic PACS pulls. Many imaging AI training datasets are assembled by pulling images from PACS (Picture Archiving and Communication System) archives and using associated radiology reports as labels. This is fast but produces lower-quality labels than prospective annotation. Radiology reports are written for clinical communication, not for ML training — they use variable terminology, they reference clinical context not visible in the image, and they may not describe all relevant findings (a radiologist may note the primary finding without documenting an incidental finding that would be relevant to the ML classifier).
Prospective labelling — having annotators review images specifically for the purpose of ML training, with a documented annotation protocol, multiple annotators, and adjudication for disagreements — produces higher-quality training data at significantly higher cost. For regulatory-grade training datasets, prospective labelling or a structured retrospective labelling process with the same quality controls is expected.
Model Architecture Considerations for Imaging AI
CNNs remain the workhorse for 2D imaging tasks. ResNet, EfficientNet, DenseNet, and their variants are mature, well-characterised architectures for 2D image classification and detection. Pre-trained weights on ImageNet (or on large medical imaging datasets like CheXpert or RadImageNet) provide strong initialisation for fine-tuning on medical imaging tasks. The performance of CNN-based architectures on standard medical imaging benchmarks is well-established and provides a defensible comparison point for regulatory evidence.
Vision Transformers (ViTs) are gaining ground. Self-attention mechanisms in Vision Transformers are better at capturing long-range spatial relationships in images — relevant for imaging tasks where the diagnostic feature may involve spatial relationships between distant regions of the image (e.g., cardiac silhouette relative to lung fields, lymph node distribution patterns). Pre-trained ViTs on medical imaging datasets are increasingly available and competitive with CNNs on many tasks.
3D architectures for volumetric data. CT and MRI are inherently 3D — they produce stacks of 2D slices that together form a 3D volume. 2D slice-level analysis loses the 3D contextual information that radiologists use when reading volumetric data. 3D CNNs (3D U-Net, for segmentation) and 3D attention mechanisms process volumetric data natively. The computational cost is substantially higher, and the training data requirements (annotated 3D volumes rather than 2D slices) are also higher.
Multi-task learning for related findings. Many imaging AI use cases involve related but distinct clinical findings — a lung nodule detection model might simultaneously detect and characterise nodules (solid, ground-glass, part-solid) and measure nodule size. Multi-task learning, where the model is trained to predict multiple outputs simultaneously, often improves performance on each individual task through shared representation learning. For regulatory documentation, multi-task architectures require careful intended use scoping — each clinical output requires its own performance validation and risk assessment.
Annotation: The Make-or-Break of Imaging AI
Inter-annotator agreement should be measured and reported. Medical image annotation is subjective. Two radiologists annotating the same dataset will not agree on every case. The degree of disagreement — quantified by Cohen's kappa, Fleiss' kappa, or intraclass correlation coefficient depending on the task — tells you how hard the annotation task is and bounds the maximum achievable model performance. If radiologists agree 85% of the time, a model that is right 90% of the time is performing above human consistency — which requires careful clinical justification. If radiologists agree 70% of the time, a model that is right 75% of the time is roughly at the level of expert human performance.
Notified bodies reviewing clinical validation evidence for imaging AI increasingly expect to see inter-annotator agreement data for the ground truth labels. Without it, the clinical evaluation doesn't establish what "correct" performance looks like for the task.
Adjudication protocol for disagreements. When annotators disagree on a case, a defined adjudication process determines the ground truth label. Options: majority vote (where three or more annotators are used), adjudication by a designated senior clinician, or exclusion from the training/validation set for genuinely ambiguous cases. Each option has implications for the training dataset composition and the validation dataset. Document the adjudication protocol in your technical file.
Region-of-interest vs. whole-image annotation. For detection and localisation tasks (finding and localising a nodule, tumour, or lesion in an image), annotation requires delineation of the region of interest — typically a bounding box or a pixel-level segmentation mask. This annotation is significantly more time-consuming than binary classification labels. Annotation tools, quality controls, and the training of annotators to use the tools consistently are part of the training data development process that needs to be documented.
What Regulators Will Ask
Based on the AI diagnostic device submissions reviewed by notified bodies and FDA to date, the questions most consistently raised for imaging AI:
On training data: How many distinct acquisition sites and scanner models are represented? What was the labelling methodology? What was the inter-annotator agreement? How was the train/validation/test split enforced at the patient level?
On performance evidence: What are the primary performance metrics, and were they pre-specified? What is performance on relevant clinical subgroups (sex, age, disease severity)? How does performance compare to radiologist performance on the same dataset? Is there external validation evidence (data from sites not involved in model development)?
On generalisation: What happens when the model receives images from a scanner not represented in training? How is out-of-distribution image quality detected and handled?
On explainability: Can the model's outputs be traced to identifiable image features? What visualisation or attribution techniques are used? How are these communicated to clinical users?
On post-market surveillance: How will performance degradation from scanner changes or population shifts be detected post-market?
On AI Act conformity (EU-bound programs): How is the training data governance documented? What are the human oversight provisions, the transparency information provided to deployers, and the AI-specific post-market monitoring under Article 72? For Class IIa+ medical device AI, the AI Act high-risk requirements apply on top of MDR/IVDR — and notified bodies are now probing for AI Act conformity alongside MDR clinical evaluation.
The External Validation Requirement
Notified bodies are increasingly requiring external validation evidence — performance data from at least one site that had no involvement in model development — for Class IIa and above imaging AI devices. For devices without this, PMCF studies providing external validation evidence are expected as a condition of initial CE marking.
Planning external validation into your development programme — not as a post-hoc addition after discovering the notified body expects it — is what allows an efficient regulatory timeline. The external validation site needs data sharing agreements, ethics approval for the validation study, and sufficient patient volume in your clinical category. These take time to establish.
For programs planning ongoing model improvement post-launch — including expansion to new scanner families, new clinical indications, or refined performance — a Predetermined Change Control Plan (PCCP) pre-authorised at the time of initial submission is increasingly the right structural choice. FDA finalised the PCCP framework in December 2024, and EU regulators are converging on similar principles. For imaging AI specifically, where new scanner models and improved acquisition protocols emerge continuously, a well-designed PCCP turns what would otherwise be cascading conformity assessment cycles into a manageable change-management workstream.
Where Vector Labs Fits
Vector Labs builds imaging AI systems for regulated clinical applications, with experience across cardiac imaging, oncology, and women's health. We work with imaging AI teams at three points: data and annotation strategy (multi-scanner data acquisition, annotation protocols, adjudication, inter-rater documentation), model and pipeline engineering (architecture selection, normalisation, multi-site validation, calibration), and regulatory execution (clinical evaluation, technical file construction, external validation studies, PCCP design).
If you're developing a diagnostic imaging AI and want to understand the regulatory evidence requirements before you commit to a study design, get in touch at vector-labs.ai.
For the broader series: the engineering reality of clinical-grade AI covers technical principles applicable across modalities; clinical validation covers the evidence package structure; the data problem covers strategies when training data is scarce; post-market surveillance covers what happens after CE marking; and FDA pathway selection covers the US-side strategy.
FAQs
For a commercially deployed imaging AI, a training dataset covering at least 5–10 distinct scanner manufacturers, models, or acquisition protocols in each target imaging modality is a reasonable minimum. Submissions with single-vendor training data face increasing notified body scrutiny. Scanner diversity matters more for CT and MRI (large acquisition parameter variability) than for mammography or pathology (more standardised acquisition).
Yes, and it works better than training from scratch on small medical datasets. However, medical-imaging-specific pretrained models (RadImageNet, CheXpert-pretrained, MedSAM, BioMedCLIP) typically outperform ImageNet-pretrained models on medical tasks. The performance gap matters more for tasks with smaller training datasets. Document the pretraining source in the technical file regardless.
Depends on the task. For findings that exist in a single slice (focal lesion classification, slice-level detection), 2D models are often sufficient and computationally cheaper. For findings that require spatial context across slices (organ volume measurement, vessel tracking, 3D lesion segmentation), 3D models capture context that 2D models miss. Hybrid 2.5D approaches (using a few adjacent slices) bridge the two with moderate computational cost.
For diagnostic tasks where the disease can be confirmed by a definitive test, the definitive test is the gold standard — pathology for cancer detection, cath lab for cardiac findings, follow-up imaging or outcome data for ambiguous cases. Where definitive confirmation isn't available, multi-rater consensus annotation by board-certified radiologists with documented inter-rater agreement is the next-best ground truth. Single-annotator labels are increasingly unacceptable for regulatory submission.
Each modality has its own preprocessing, normalisation, and architectural considerations. Cross-modality models (one model for multiple modalities) are typically weaker than modality-specific models. For regulatory submission, the intended use should specify the modalities the device is validated for, and validation evidence should be provided per modality. Extending to a new modality typically requires a separate clinical evaluation and may trigger a new conformity assessment.
Cohen's kappa values for radiologist agreement on common imaging tasks typically fall in the 0.6–0.8 range for clear binary findings (e.g., presence or absence of consolidation on chest X-ray) and 0.4–0.6 for grading tasks or ambiguous findings (e.g., BI-RADS assessment, nodule character). Disagreement is not a defect — it's the inherent uncertainty in the task. Reporting it transparently in the clinical evaluation is more defensible than pretending the ground truth is unambiguous.
Targeted enrichment of rare findings in the validation set is required. For a finding with 1% prevalence, validating on a representative sample would require 5,000+ cases to give meaningful sensitivity estimates. Instead, enrich the validation set with cases known to have the rare finding (with appropriate adjustment of reported metrics) and report performance separately for the rare-finding subgroup. The intended-use statement should be precise about the prevalence range in which the device is validated.
Not reliably without validation. Performance on new scanners is unpredictable in advance — sometimes near-equivalent, sometimes dramatically worse. For regulatory deployment, validation evidence is required for each scanner family the device is intended to be used with. Expanding the validated scanner list typically requires additional clinical evaluation data and may trigger a technical file update or notified body notification. A Predetermined Change Control Plan (PCCP) can pre-authorise expansion to additional scanner families if the validation methodology is pre-specified.

