AI Analysis: Diagnostic Accuracy & Reproducibility
Exploratory Evaluation of Diagnostic Accuracy and Temporal Reproducibility of Multimodal Large Language Models in the Image-Based Assessment of Oral Mucosal Lesions
This analysis investigates the performance and consistency of leading multimodal large language models (LLMs) in diagnosing oral mucosal lesions from clinical images. We evaluate their diagnostic accuracy across various lesion types and assess the temporal stability of their outputs over multiple testing cycles. Our findings provide critical insights into the readiness of LLMs for clinical integration, highlighting both their potential and the imperative for robust reproducibility.
Executive Impact: Enhancing Clinical Diagnostics with AI
Understanding the capabilities and limitations of AI in medical imaging is crucial for enterprise adoption. This study reveals significant performance disparities and temporal variability among LLMs, underscoring the need for careful validation and strategic integration to ensure reliable diagnostic support.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Gemini significantly outperformed ChatGPT (55-57%) and Perplexity (28-31%) in image-based oral lesion diagnosis, demonstrating strong visual interpretation capabilities for specific lesion categories, particularly normal anatomy/variations (88-92%) and oral cancer (68-84%).
| Feature | ChatGPT | Gemini | Perplexity |
|---|---|---|---|
| Overall Accuracy (Cycles 2&3) | 55-57% | 78% | 28-31% |
| Intra-Model Agreement (Fleiss Kappa) | Moderate (κ=0.525) | Fair (κ=0.338) | Fair (κ=0.409) |
| Responses Correct Across All 3 Cycles | 39% | 51% | 16% |
| Strengths in Subgroups | Moderate for Benign Lesions (S2), Substantial for OPMDs (S3), Fair for Oral Cancer (S4) | Highest in Normal Anatomy (S1), OPMDs (S3), Oral Cancer (S4); Similar to ChatGPT in Benign Lesions (S2) | Lowest accuracy across all subgroups |
| Weaknesses | Lower overall accuracy compared to Gemini, lower consistency for Oral Cancer | Lower overall consistency compared to ChatGPT | Very low accuracy and consistency across all measures |
Despite achieving the highest diagnostic accuracy, Gemini's outputs showed only fair reproducibility over time (κ=0.338). This temporal variability, common across all models (ranging from fair to moderate agreement), highlights a significant challenge for consistent clinical application in real-world scenarios.
Navigating AI in Oral Medicine: From Support to Validation
Multimodal LLMs offer promising diagnostic support for oral mucosal lesions, particularly for visually distinct conditions like normal variations and advanced cancers. However, their observed temporal variability and inconsistent performance across different cycles, even with identical inputs, mean they should be viewed as supportive tools rather than independent diagnostic systems. Clinicians must interpret LLM outputs in conjunction with their own expertise. The stochastic nature of LLMs and continuous model updates contribute to this variability, necessitating ongoing validation and transparency for safe and effective clinical integration. Further research is crucial to enhance stability and consistency in real-world settings.
Enterprise Process Flow: Temporal Reproducibility Evaluation
The study protocol involved re-evaluating 100 anonymized clinical images of oral lesions across three commercially available LLMs (ChatGPT-5.1 Plus, Gemini 3 Pro, Perplexity Pro) using a standardized prompt. This entire process was repeated in three independent cycles, each one month apart, to rigorously assess temporal reproducibility and diagnostic stability.
Advancing Robust AI for Clinical Diagnostics
This study's findings, while significant, are subject to certain limitations that guide future research. The dataset size (100 images) from a single center, though adequate for detecting moderate-to-large differences post-hoc, limits generalizability. The use of varied image acquisition conditions, while reflective of real-world scenarios, may have influenced diagnostic accuracy, but not temporal variability. Crucially, the evaluation focused solely on visual features, intentionally omitting clinical history to isolate LLM image interpretation. Future studies should incorporate larger, multicenter datasets, standardize imaging conditions where appropriate, and integrate structured clinical information to evaluate combined diagnostic performance, ensuring a more comprehensive and robust AI solution for oral medicine.
Calculate Your Potential AI Impact
Estimate the ROI your enterprise could achieve by integrating advanced AI for diagnostic support and operational efficiency.
Your AI Implementation Roadmap
A phased approach ensures seamless integration and maximum value from your AI investment.
Discovery & Strategy
In-depth analysis of your current diagnostic workflows, data infrastructure, and specific clinical challenges. Define clear objectives and success metrics for AI integration in oral medicine.
Pilot & Validation
Develop and deploy a pilot AI solution using your clinical images. Rigorous testing against expert diagnoses and evaluation of reproducibility across different datasets and models.
Integration & Training
Seamlessly integrate the validated AI models into your existing PACS or EMR systems. Provide comprehensive training for your clinical team on AI-assisted diagnostics.
Monitoring & Optimization
Continuous monitoring of AI performance and user feedback. Iterative improvements and model updates to adapt to evolving clinical needs and data patterns, ensuring long-term reliability.
Ready to Transform Your Diagnostic Capabilities?
Our experts are ready to guide you through the complexities of AI adoption. Book a personalized strategy session to explore how our solutions can enhance precision and efficiency in your enterprise.