Medical Benchmark Leadership

Orinn consistently outperforms leading AI models across HealthBench, MedQA, PubMedQA, and MMLU (Medical), delivering superior clinical reasoning, diagnostic accuracy, and real-world healthcare intelligence. Built with an advanced medical agentic system, Orinn is designed to handle complex clinical workflows, decision support, and healthcare operations with precision and reliability. From diagnosis assistance to medical documentation and workflow automation, Orinn is optimized for high-stakes healthcare environments where accuracy matters most. Its benchmark performance reflects not just strong academic results, but practical readiness for enterprise-scale healthcare deployment.

HealthBench

HealthBench Hard

Hallucination

Orinn-1.7

75.3

48.1

2.1

Baichuan-M3

65.1

44.4

3.5

GPT-5.2-High

63.0

42.0

3.8

Gemini-3-Pro

46.2

15.2

7.1

Open Access

Update 17-05-2026

HealthBench Professional

HealthBench Professional (from OpenAI) is a benchmark of 525 physician-authored tasks testing LLMs on real clinician workflows across care consult, documentation, and medical research, graded by multiple physicians.

Agentic / EHR

525 samples

Text-only

Models

Orinn-1.7

64.8

GPT-5.4 for Clinicians

GPT-5.4

48.1

Opus 4.7

46.2

GPT-5

46.2

arXiv GitHub

Open Access

Update 17-05-2026

MedAgentBench

MedAgentBench (from Stanford) is a benchmark of 300 cases testing LLMs as autonomous clinical agents across 10 EHR tasks, evaluating FHIR API use, clinical decisions, and protocol adherence.

Agentic / EHR

300 samples

Text-only

Models

Orinn-1.7

99.67

Gemini 3.1 Pro

91.3

GPT-5.5

89.4

Claude Opus 4.7

Gemini 3 Flash

arXiv GitHub

Open Access

Update 17-05-2026

Medical Coding

Orinn 1.7 outperformed GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6, and Corti Symphony on real-world medical coding tasks using our internal evaluation datasets, delivering higher accuracy and stronger clinical performance.

Others

180 samples

Text-only

Models

Orinn-1.7

GPT-5.2

Gemini-3.1-Pro

GitHub