• About
  • Newsletter
  • Education
    ML Summer SchoolsML Postgrad Schools
  • Careers
    IndustryPostDoc
HomeAboutNewsletterEducationCareers

© 2026 Dr Ezekiel Dinama. Code & Cure. All rights reserved.

May 25, 2026·6 min read

Clinician-written evidence makes clinical LLMs safer as prostate AI moves into care

A newsletter on the latest in AI for healthcare.

Welcome back,

A new clinical large language model study lands a blunt warning: safety does not scale neatly with accuracy. Clean, clinician-written evidence did more than bigger models or extra compute to cut risky answers.

Tempus launched the ArteraAI Prostate Test for metastatic patients, bringing prostate digital pathology into its clinical ecosystem.

Also inside: the MONAI repo for medical imaging AI, plus health AI and healthcare data moves from Kordata Dynamics, Century Health, and Qualtrics.

Here is what you need to know,

SUMMARY

Top Research Paper

  • Clinical LLM safety and accuracy respond differently to scaling, with clean clinician-written evidence delivering the largest gains.

Top AI News

  • Tempus clinically launched the ArteraAI Prostate Test, an AI-powered prostate digital pathology test that analyses clinical data and biopsy images to estimate prostate cancer-specific mortality risk in metastatic hormone-sensitive prostate cancer.

Top Model

  • MONAI is a PyTorch-based open-source toolkit for medical imaging AI, with ready-made components for building, training, validating, and sharing imaging models more easily.

Bedside Bets

Startup rounds, deals, and moves.

  • Kordata Dynamics builds AI-powered clinical trial infrastructure and emerged from stealth with pre-seed backing. It uses BIOS Health’s neural biomarker technology for faster precision medicine studies. Deal value not disclosed.

  • Century Health turns clinical records into research-ready data and raised $5m to scale its AI abstraction platform. Its CHARM model reports 97% accuracy against expert review.

  • Qualtrics uses experience data AI to predict patient and workforce needs and bought Press Ganey Forsta for $6.75bn, adding healthcare experience data to its AI platform.

Pulse Check

Quick reads across health AI.

  • A multitask AI system supports basal cell carcinoma diagnosis with dual explanations.

  • Cedars-Sinai is deploying OpenEvidence, an AI-enabled clinical reference tool that links medical evidence to patient electronic health record context.

  • NEJM asks whether AI can say “I don’t know”, a key safety question for clinical decision support.

  • Microsoft Agent Framework helps healthcare AI builders orchestrate production agents in .NET and Python, with use cases like EHR summarisation, prior authorisation workflows, and clinical trial screening.

TOP PAPER

⚖️ Safety and accuracy follow different scaling laws in clinical large language models

Source: arXiv · 5 May 2026

The study challenges the common assumption that scaling clinical large language models for higher accuracy automatically improves safety. It introduces a dedicated evaluation framework to measure how safety metrics behave across practical deployment variables in radiology question-answering tasks.

Question

  • How do accuracy and clinically relevant safety metrics, including high-risk errors, evidence contradictions, and dangerous overconfidence, scale with model size, evidence quality, retrieval strategy, context length, and inference-time compute in clinical large language models?

Approach

  • Developed SaFE-Scale and RadSaFE-200, a benchmark of 200 multiple-choice radiology questions with clinician-curated clean and conflict evidence plus option-level safety labels.

  • Evaluated 34 locally deployed large language models across Qwen, Llama, Gemma/MedGemma, DeepSeek, Mistral, and OpenAI-OSS families.

  • Tested six deployment conditions: closed-book, clean evidence, conflict evidence, standard retrieval-augmented generation, agentic retrieval-augmented generation, and max-context prompting.

  • Secondary tests included self-consistency and fixed three-model ensembles. Outcomes tracked accuracy, high-risk error rate, unsafe answers, contradictions, and dangerous overconfidence.

Source: Wind, Nguyen et al

Results

  • Clean evidence raised mean accuracy from 73.5% to 94.1% and cut high-risk error from 12.0% to 2.6%.

  • Contradictions fell from 12.7% to 2.3%, while dangerous overconfidence dropped from 8.0% to 1.6%.

  • Standard and agentic retrieval-augmented generation improved accuracy modestly, from 76.0% to 78.1%, and reduced some contradictions, but left high-risk error and overconfidence elevated versus clean evidence.

  • Max-context prompting increased latency without closing safety gaps. Self-consistency gave small gains. Ensembles improved aggregate scores but preserved synchronised failures on hard cases.

Caveat

  • The benchmark focuses on radiology question-answering, so results may not transfer cleanly to other clinical settings.

Potential impact: Clinical large language model deployment should measure safety directly under the target evidence and retrieval conditions, rather than relying on accuracy benchmarks as a proxy

READ PAPER >>

TOP NEWS

Tempus brings prostate digital pathology into its clinical ecosystem

Source: Tempus via Business Wire · 21 May 2026

Tempus has made the ArteraAI Prostate Test for metastatic hormone-sensitive prostate cancer clinically available. It is the first externally developed digital pathology algorithm in the Tempus ecosystem.

The CLIA-certified, CAP-accredited test combines patient clinical data and histopathology images to generate personalised risk estimates of prostate cancer-specific mortality.

  • Size: Roughly 25,000 US patients are newly diagnosed with metastatic prostate cancer each year.

  • Scope: Metastatic hormone-sensitive prostate cancer risk estimation using clinical data and digital pathology.

  • Opportunity: Tempus can pair the test with its next-generation sequencing assays, giving clinicians a more complete view of tumour biology and risk.

Why it matters: Therapy intensity decisions in metastatic prostate cancer are hard, and risk estimation can be fragmented across genomics, pathology, and clinical features. Tempus is pushing toward a more multimodal commercial model, where tests, data, and algorithms sit inside one clinical ordering ecosystem.

 NEWS SOURCE »

TOP REPO FOR BUILDERS

MONAI gives medical imaging teams a production-ready PyTorch foundation

MONAI is an open-source PyTorch framework for deep learning in healthcare imaging. It provides domain-specific tools that bridge research and clinical deployment.

Useful details

  • Flexible multi-dimensional pre-processing, compositional APIs, and domain-specific networks, losses, and metrics.

  • Supports multi-GPU and multi-node parallelism, bundles for reproducible workflows, and a model zoo.

  • Includes tutorials, Docker images, and integration with the broader PyTorch ecosystem.

Why it matters: MONAI gives imaging AI teams a standardised foundation for building, evaluating, and sharing medical imaging models across academic and clinical environments.

TINKER WITH THE REPO >>

BEDSIDE BETS

Bedside Bets

Startup rounds, deals, and moves in healthcare AI.

  • Kordata Dynamics builds AI-powered clinical trial infrastructure and emerged from stealth with pre-seed backing. It uses BIOS Health’s neural biomarker technology for faster precision medicine studies. Deal value not disclosed.

  • Century Health turns clinical records into research-ready data and raised $5m to scale its AI abstraction platform. Its CHARM model reports 97% accuracy against expert review.

  • Qualtrics uses experience data AI to predict patient and workforce needs and bought Press Ganey Forsta for $6.75bn, adding healthcare experience data to its AI platform.

Quick reads across health AI

  • A multitask AI system supports basal cell carcinoma diagnosis with dual explanations.

  • Cedars-Sinai is deploying OpenEvidence, an AI-enabled clinical reference tool that links medical evidence to patient electronic health record context.

  • NEJM asks whether AI can say “I don’t know”, a key safety question for clinical decision support.

  • Microsoft Agent Framework helps healthcare AI builders orchestrate production agents in .NET and Python, with use cases like EHR summarisation, prior authorisation workflows, and clinical trial screening.

Explore Education and Careers resources to build a career in healthcare AI/ML.

How was today’s issue?

👍

👎

If this newsletter was forwarded to you, subscribe here or see more

NEWSLETTER BY:
Dr Ezekiel Dinama

MD and PhD Researcher at Cambridge University applying physics-informed ML/AI to neurophysiological research.

Leave feedbackView on Beehiiv →

Related Issues

AI aided CT Reporting Gets More Transparent as Prostate MRI AI Clears Regulators

AI aided CT Reporting Gets More Transparent as Prostate MRI AI Clears Regulators

May 23, 2026 · 6 min

NHS opens £900m AI framework as Google researchers test conversational AI with real patients

NHS opens £900m AI framework as Google researchers test conversational AI with real patients

May 20, 2026 · 7 min

LungIMPACT RCT shows AI chest X-ray prioritisation adds no speed to lung cancer diagnosis

LungIMPACT RCT shows AI chest X-ray prioritisation adds no speed to lung cancer diagnosis

May 12, 2026 · 7 min

Enjoyed this newsletter?

Get the next code and cure briefing

Useful healthcare AI signal in your inbox, without the general tech noise.