Your radiology department receives an email from an AI vendor claiming 96% accuracy on chest x-rays. Your procurement team asks: accurate at what? For which pathologies? On which patient population?
These questions matter more than the vendor realizes.
I've spent the last five years validating AI models across hospital networks, training deployment pipelines, and reviewing the clinical studies that get cited in vendor contracts. The gap between what marketing teams claim and what hospital radiologists actually need is enormous—and it's costing healthcare systems millions in poorly-performing deployments.
This is the validation checklist I'd personally use before signing any AI radiology contract.
The Four Types of Accuracy Claims (And Why Most Are Misleading)
When a vendor says their model is "96% accurate," they're almost certainly reporting one of four things. Three of them tell you almost nothing about whether the system will work in your hospital.
Metric 1: Overall accuracy on their internal test set. This is the least useful metric. Their internal test set was likely drawn from the same distribution as their training data, meaning their model has already seen variations of these cases. In my experience deploying these models across hospital networks, internal test accuracy inflates real-world performance by 4–8 percentage points. They're measuring something real, but not what matters to your radiology department.
Metric 2: Accuracy on a public benchmark. This is better, but still limited. Public benchmarks like ImageNet or CheXpert are fixed datasets. A model that achieves 94% accuracy on CheXpert might perform at 89% when you deploy it in your hospital, where your MRI scanners are different models, your technicians have different scan protocols, and your patient population is different. When we were validating the chest X-ray engine at Fractify, we noticed accuracy dropped 2–3% moving from CheXpert to our own multi-hospital dataset—not because the model was poor, but because real clinical data is messier than benchmarks.
Metric 3: Sensitivity alone. Many vendors report sensitivity ("true positive rate") and bury or omit specificity ("true negative rate"). For a hospital, specificity is often more important. A system that flags 95% of actual pneumothorax cases but also flags 20% of normal scans as positive will overwhelm your radiologists with false alarms. You need both numbers, and you need them for each pathology the system claims to detect.
Metric 4: External, multi-site validation on a held-out test set from diverse sources. This is the metric that actually predicts deployment performance. If a vendor has validation from three independent hospitals with different equipment, different scan protocols, and different patient demographics, and the model maintains 97% sensitivity and 96% specificity across all three sites, you have real evidence of performance. This is what we pursued with Fractify's 97.9% brain MRI tumor detection accuracy and 97.7% bone fracture detection accuracy—not because marketing demanded it, but because radiologists need to trust the system they're putting into their PACS workflow.
Ask your vendor explicitly: "Was your validation dataset independent of your training data? Did you validate on multiple hospital sites with different equipment?" If the answer is no, or if they're vague, their accuracy claim is not deployment-grade evidence.
The Specificity Question That Most Procurement Teams Don't Ask
Here's the honest truth: I haven't seen enough data to say definitively whether sensitivity or specificity drives higher clinical adoption, because it depends on the workflow. For urgent pathologies like intracranial hemorrhage or aortic dissection, radiologists want high sensitivity—they can tolerate false alarms because the cost of missing a bleed or dissection is catastrophic. For lower-stakes findings like bone fractures or certain liver lesions, false alarms waste radiologist time and erode trust.
This workflow dependency is why you need to see the confusion matrix, not just a single accuracy percentage. A confusion matrix shows you exactly how many cases the AI system misclassified, broken down by type. If the system detected 98% of tension pneumothorax cases but also called 5% of normal scans as pneumothorax, you know what you're working with. You can make an informed decision about whether that's acceptable for your department.
Expert Insight: Validation Data Must Span Your Actual Equipment
Fractify's 18+ pathology detection capability on chest X-rays was validated across GE, Philips, Canon, and Fujifilm systems—the four most common vendors in North American hospitals. A vendor claiming 97% accuracy on chest X-rays who only trained on GE equipment will degrade significantly when deployed on your Philips system. Ask for vendor-specific validation results.
Prior-Study Comparison Reveals Whether the AI Actually Understands Imaging
The best AI systems don't just look at a single scan in isolation. They compare the current study to prior scans from the same patient. This is how radiologists work. A finding might be innocuous on its own but critical if it's grown since the prior study six months ago.
Ask your vendor: "Does your system support prior-study comparison? Can it flag interval changes?" If the answer is no, you're deploying a system that's blindfolded compared to your current workflow. Fractify's architecture integrates dicom-native prior comparison, meaning the system automatically retrieves and processes prior scans from your PACS, and its output flags interval changes with urgency scoring.
Equally important: can you see why the AI made its prediction? Request access to Grad-CAM heatmaps or other explainability outputs. Radiologists need to understand the model's reasoning, especially when the AI disagrees with initial interpretation. A system that shows you which pixels drove its decision earns radiologist trust. A black box does not.
The Deployment Reality That Benchmark Accuracy Doesn't Capture
Benchmark studies measure accuracy under ideal conditions. Real deployment adds friction.
Can the system handle your DICOM variant? Does it integrate natively with your PACS using HL7/FHIR standards, or does it require manual image export and re-import? Will radiologists need to log into a separate web portal, or does the AI alert appear in their existing workflow? Does the system support role-based access control (RBAC) so you can manage who sees AI outputs in different departments?
In my experience, I've seen hospitals deploy high-accuracy AI systems that ended up unused because radiologists had to step out of their PACS workflow to use them. A 98% accurate system that requires five extra clicks per case will be abandoned by shift two. A 94% accurate system integrated seamlessly into PACS gets used because it adds value without adding friction.
Before signing, request a technical integration assessment: deployment on your actual PACS, with your actual data, integrated into your radiologist workflows. Have your IT team run this assessment. The vendor's accuracy metrics matter only if the system actually gets used.
Six Critical Conditions Where Accuracy Varies Most
Different pathologies have dramatically different detection accuracies, and vendors often hide this variation behind an overall percentage.
| Pathology | Typical Benchmark Sensitivity | Real-World Variability | Clinical Consequence of Miss |
|---|---|---|---|
| Intracranial Hemorrhage | 94–98% | Highest variability; subtypes (epidural, subdural, subarachnoid) perform differently | Catastrophic; urgent neurosurgery |
| Acute Stroke (early ischemic changes) | 72–88% | High variability; depends on scan timing and patient factors | Severe disability; limited treatment window |
| Aortic Dissection | 90–96% | Moderate variability; depends on imaging protocol | Catastrophic; immediate cardiothoracic surgery |
| Tension Pneumothorax | 96–99% | Low variability; anatomically clear | Severe; requires immediate decompression |
| Bone Fractures | 94–98% | Moderate variability; hairline fractures harder to detect | Varies; most are not surgical emergencies |
| Pulmonary Nodules | 85–92% | Highest variability; small nodules (<6mm) frequently missed | Cancer detection delay; depends on nodule type |
Notice the pattern: life-threatening findings have higher accuracy requirements because the cost of a miss is so high. A system that detects acute stroke at 81% sensitivity might be appropriate for initial screening, but not as a replacement for radiologist review.
Your contract should specify minimum acceptable sensitivity and specificity for each pathology you're deploying the AI system to detect. If Fractify detects intracranial hemorrhage at 97.9% sensitivity and 6 subtypes of ICH (epidural, subdural, subarachnoid, traumatic, nontraumatic, and microhemorrhage) within that, you have specificity that protects against missed life-threatening bleeds.
The Honest Caveat: When You Shouldn't Deploy AI Radiology
My take: AI radiology is powerful, but it's not universally appropriate for every hospital right now. If your radiologist workforce is stretched and you're deploying AI to replace radiologist review, you're making a mistake. If your department doesn't have the IT infrastructure to integrate DICOM and HL7/FHIR, deployment will be painful and adoption will be low. If your radiologists haven't had prior exposure to AI outputs and you're not providing training before launch, they'll resist the system or ignore its outputs, and both waste money.
AI radiology works best when it's deployed to augment, not replace. It works best when your IT infrastructure is solid. It works best when radiologists are trained and given time to build trust in the system's outputs. Before signing a contract, honestly assess whether your hospital meets these conditions.
The Contract Itself: What to Demand in Writing
Independent Validation Report
Multi-site external validation with sensitivity/specificity for each pathology. Published peer-reviewed study preferred. Databoost Sdn Bhd (Fractify's parent company) publishes validation results in medical journals, not just marketing docs.
DICOM and HL7/FHIR Integration Specification
Technical documentation showing PACS integration approach, data flow, and whether prior-study comparison is supported natively or requires external API calls.
Deployment Success Metrics
Agreed-upon accuracy targets for your specific use case, with measurable KPIs tracked in production. Example: "Detect intracranial hemorrhage at ≥97% sensitivity, ≥95% specificity on our imaging equipment within 30 days of deployment."
Training and Change Management Plan
How radiologists will be trained, how long the "learning curve" period lasts, and how the vendor supports adoption. Your radiologists need to trust the system, and that takes time.
Performance Guarantee or SLA
What happens if real-world performance falls below agreed targets? Does the vendor offer a refund, a discount, or continued support until performance targets are met?
Data Governance and Compliance Addendum
HIPAA compliance (if US-based), data residency (where your scans are processed), and explicitness about whether anonymized data is used for model improvement. Radiologists need assurance that patient data isn't being used to train competing vendors' models.
Questions to Ask in Your Pre-Contract Conversation
Before your procurement team signs, run through this set of questions with the vendor's clinical and technical leads. Pay attention to vagueness or deflection—it's often a sign that the vendor doesn't have rigorous validation to back up their claims.
"On which datasets was your model trained? How many studies?" They should know exact numbers. If they say "proprietary," ask what peer-reviewed publications validate the training approach.
"Can you show us the sensitivity and specificity for each pathology you claim to detect?" They should have this in a table format. If they only give overall accuracy, ask why pathology-specific metrics aren't available.
"How did performance change when you tested on equipment different from training data?" A rigorously validated model will show results across vendors. A model that only performs well on one vendor's equipment will degrade in your hospital.
"What's your false positive rate on truly normal studies?" This is specificity. High sensitivity paired with high false positives destroys radiologist adoption.
"Can radiologists see the AI's reasoning? Are Grad-CAM heatmaps available?" Explainability is non-negotiable for clinical trust. Radiologists need to understand why the AI flagged something.
"How is prior-study comparison handled? Is it DICOM-native or API-based?" DICOM-native integration is faster and more reliable in a clinical PACS.
"What's your integration timeline with our specific PACS vendor and version?" Get a written commitment. I've seen integrations promised in weeks and delivered in months.
"What training does your team provide to our radiologists?" And how long does the initial learning period typically last before radiologists are comfortable with the system?
"What happens to our data? Is it anonymized immediately? Is it used to improve your model?" Data governance is a compliance and trust issue.
"What's your performance guarantee? If real-world accuracy is lower than validation results, what recourse do we have?" If the vendor won't commit to performance SLAs in writing, the validation results don't mean much.
The Evidence Trail: Where to Look Beyond Marketing
Don't rely on the vendor's slides. Here's where to find independent evidence:
Peer-reviewed medical journals: Radiology, European Radiology, JAMA Radiology, American Journal of Neuroradiology. Search PubMed for the vendor's clinical validation studies. If they claim 97.9% accuracy on brain MRI tumors, there should be a published study showing the methodology and results. Ask the vendor: "Which of your accuracy claims are published in peer-reviewed journals?"
DICOM standards documentation: Understanding DICOM integration requirements is technical, but critical. The DICOM Standard website lists all official attributes and how image data must be structured. If your vendor claims DICOM compliance but you find they're actually exporting images and re-importing them, that's a red flag for integration complexity.
Healthcare IT publications: Journal of the American Medical Informatics Association (JAMIA) and Healthcare IT News often publish vendor implementations and real-world case studies. These are far more honest than vendor press releases.
WHO workforce reports: The WHO's reports on radiology workforce gaps show that most countries lack sufficient radiologists. This context matters: AI is filling a real need, but that need shouldn't pressure you into deploying systems with unproven real-world performance. The WHO report on health professions education and capacity-building provides context on radiology staffing constraints globally.
Pull the contract's validation study itself and read the methodology section. Did they use an independent test set? Did they validate on multiple sites? Did they report both sensitivity and specificity? How did they handle borderline cases? A rigorous validation study will answer all these questions explicitly.
The Path Forward: From Vendor Claims to Confidence
The goal isn't to distrust AI vendors. The goal is to move from marketing claims to evidence. Fractify exists because radiologists deserve AI systems they can trust, built on rigorous validation, integrated seamlessly into their workflows, and backed by transparent accuracy metrics.
When you see a vendor's accuracy claim, the question isn't "Is this number real?" It's "What does this number actually measure, and does it predict how the system will perform in my hospital?" The checklist above answers that question.
Run the validation assessment. Examine the technical integration. Talk to radiologists at hospitals using the system. Get performance SLAs in writing. Insist on multi-site, equipment-diverse validation. And remember: the most expensive AI system is one that doesn't get used because it doesn't fit your workflow or radiologists don't trust its outputs.
What's the difference between sensitivity and specificity, and why does it matter for AI radiology?
Sensitivity is the percentage of actual abnormal cases the AI detects (true positive rate). Specificity is the percentage of normal cases correctly identified as normal (true negative rate). High sensitivity without high specificity causes false alarms and radiologist distrust. High specificity without high sensitivity causes missed findings. You need both metrics for each pathology to evaluate deployment readiness.
Should we demand validation on our specific PACS system before contract signing?
Ideally yes, but realistically, request a pre-deployment technical assessment on your actual PACS and equipment. Vendors should provide integration documentation specifying how their system connects to your PACS version, what DICOM variants they support, and how prior studies are retrieved. Demand written timelines for integration completion.
What does "external validation" mean, and why is it better than internal test set accuracy?
External validation tests the AI model on data it has never seen, ideally from different hospitals with different equipment. This prevents overfitting (the model memorizing training data) and predicts real-world performance more accurately. Internal test accuracy often inflates real-world performance by 4–8 percentage points.
How can we verify a vendor's published accuracy claims?
Check whether the validation study is peer-reviewed and published in a reputable medical journal. Read the methodology to confirm: Was the test set truly independent? Were multiple hospital sites included? Were both sensitivity and specificity reported? What was the study's sample size? Peer-review ensures methodological rigor that marketing claims alone cannot.
What performance guarantees should be in our contract with the AI vendor?
Include specific, measurable SLAs: minimum sensitivity and specificity for each pathology, defined on your equipment within 30 days of deployment. Add a clause specifying vendor responsibility if real-world performance falls below agreed targets—refund, discount, or continued support at no cost until targets are met.
Does AI radiology accuracy improve over time with deployment?
Possibly, depending on the vendor's model update strategy. Ask whether the vendor improves their model using data from your hospital (with your consent and proper anonymization). Some vendors freeze their model at deployment; others continuously improve. Understand which model you're contracting for, and whether updates are included in your licensing agreement.
How do we evaluate whether AI radiology is appropriate for our specific hospital?
Assess your IT infrastructure (PACS integration capability), radiologist readiness (have they worked with AI systems before?), staffing pressure (is AI meant to augment or replace?), and clinical priorities (which pathologies matter most to your department?). AI works best as augmentation in well-staffed, well-integrated departments, not as a solution to staffing shortages.
What should we ask about data privacy and how our scans are used?
Require clarity on: When and how patient data is anonymized, where scans are processed (on-premise or cloud), whether anonymized data is used for model improvement, and how the vendor complies with HIPAA (US) or GDPR (EU). Data governance must be contractually explicit. Patient trust depends on knowing their imaging data isn't being shared without consent.
See Fractify working on your own scans — live demo takes 15 minutes.
Request a Free Demo →