First, do NOHARM: towards clinically safe large language models
Revolutionizing Medical AI Safety: The NOHARM Framework
Our analysis of 'First, do NOHARM' reveals critical insights into the clinical safety profiles of Large Language Models (LLMs). The NOHARM benchmark, encompassing 100 real primary-care-to-specialist consultation cases across 10 specialties, uncovers significant findings on harm frequency, severity, and mitigation strategies for AI-generated medical recommendations.
Executive Summary: Navigating AI Risks in Healthcare
This study is a foundational step towards understanding and mitigating harm from AI in clinical decision support. It establishes that existing benchmarks are insufficient for measuring safety, and that a multi-agent approach significantly reduces error. This has profound implications for AI deployment strategies in healthcare.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
The NOHARM benchmark found that severe harm occurs in up to 22.2% of cases across 31 LLMs, highlighting a significant safety gap not captured by traditional AI and medical knowledge benchmarks. Performance on NOHARM's safety metric was only moderately correlated (r = 0.61–0.64) with existing evaluations, underscoring the need for explicit safety measurement.
Remarkably, the best LLMs outperformed generalist physicians on safety (mean difference 9.7%), suggesting AI's potential for safer clinical decision support when properly evaluated.
A critical finding is that errors of omission account for 76.6% of severely harmful errors. This means LLMs are more likely to cause harm by failing to recommend necessary actions (e.g., critical tests, follow-up) rather than by recommending inappropriate ones.
Analysis of intervention categories showed that top models' performance advantage came from reducing severe diagnostic and counseling errors of omission, further emphasizing the importance of comprehensive recommendation generation.
Multi-agent orchestration, where models review and revise each other's outputs, was found to be a highly effective strategy. These configurations had 5.9-fold higher odds of achieving top-quartile Safety performance than solo models.
The study also revealed an inverted-U relationship between Safety and Restraint (precision). Models that were either too precise (too few recommendations) or too permissive (too many, some inappropriate) performed worse on safety, with optimal safety achieved at intermediate levels of restraint.
Enterprise Process Flow
| Feature | Traditional Benchmarks | NOHARM Benchmark |
|---|---|---|
| Focus |
|
|
| Data Source |
|
|
| Evaluation |
|
|
Case Study: Urinary Tract Infection Management
A 25-year-old woman presents with urinary urgency and burning. An LLM might recommend only 'reassurance'. NOHARM identifies this as a Moderate Harm of Omission, as it fails to recommend crucial steps like urinalysis with reflex culture and appropriate antibiotics (nitrofurantoin or TMP/SMX). Conversely, an LLM recommending a CT abdomen pelvis with contrast for this case would be a Mild Harm of Commission.
The benchmark highlights that failing to act can be as harmful, if not more, than inappropriate actions.
Calculate Your Potential AI Safety ROI
Estimate the impact of improved AI clinical safety on your organization. By minimizing harmful errors and optimizing clinical recommendations, you can achieve significant savings in operational costs and reallocate physician time more effectively.
Your Journey to Safer AI Clinical Integration
Our phased approach ensures a secure, compliant, and impactful integration of AI into your clinical workflows, prioritizing patient safety and operational efficiency.
Phase 1: Safety Assessment & Gap Analysis
Leverage NOHARM-like evaluations to identify current AI safety risks and establish a baseline. This involves expert review of AI outputs in simulated clinical scenarios specific to your practice.
Phase 2: Multi-Agent Orchestration Design
Architect multi-agent systems using diverse LLMs to review and refine clinical recommendations, significantly reducing errors of both commission and omission.
Phase 3: Pilot Deployment & Continuous Monitoring
Implement AI-powered clinical decision support in a controlled pilot, continuously monitoring safety metrics and integrating feedback for iterative improvement and scaling.
Ready to Ensure AI Safety in Your Practice?
Don't let unquantified AI risks compromise patient care. Partner with us to implement robust safety benchmarks and multi-agent solutions that protect your patients and empower your clinicians.