Are we evaluating LLMs properly, especially for women’s health?
Large language models now pass medical exams and are used by clinicians and patients. If we’re building the next generation of clinical AI systems, we need to ensure women’s health isn’t left behind.
Both patients and clinicians are increasingly using large language models (LLMs) to ask medical questions. Over the past two years, we’ve repeatedly heard that these models are performing at the level of physicians (or better), passing medical exams, and demonstrating impressive clinical knowledge.
But this raises an important question: how are we actually measuring their performance in medicine? Do we have high-quality tools to evaluate clinical use of LLMs? Are we measuring their clinical usefulness, their impact on decision-making, or their effect on patient outcomes?
A systematic review of LLMs in clinical medicine published this month identified 4,609 studies (yep, 4,609!) studying LLM applications in healthcare. They found that 1,048 studies used real-world patient data, and only 19 were prospective randomized trials. Most research instead evaluated models using simulated scenarios (n = 1,857) or exam-style tasks (n = 1,704). Across 1,046 head-to-head comparisons between humans and LLMs, models outperformed clinicians in 33% of comparisons. The authors concluded that despite the rapid growth of LLM research in medicine, rigorous patient-centered evidence remains scarce.
But what comes to my mind (of course it does) as I’m scanning this literature… is anyone evaluating how well LLMs perform specifically in women’s health?
TL;DR
Thousands of studies now evaluate LLMs in medicine, but most rely on exam-style benchmarks or simulated clinical scenarios. Only a small fraction use real patient data, and randomized trials are rare (and of course expensive).
Women’s hormone health topics are rarely evaluated explicitly within these benchmarks.
Even sophisticated evaluation frameworks often fail to report sex-specific performance or test for bias.
Early targeted studies suggest that domains such as menopause care and hormone therapy may be under-tested and prone to incomplete or inaccurate responses on general frontier models.
As clinical AI systems become more widely deployed, it’s critical to develop rigorous evaluation frameworks and specialized tools - we want to address the gaps and challenges in healthcare, not scale biases.
Benchmarks: how we evaluate LLMs in medicine
There has been a surge of research attempting to measure how well LLMs perform in medical contexts. Researchers have developed a wide range of benchmarks and evaluation metrics to assess the accuracy and safety of these systems. Many of these benchmarks focus on general medical knowledge, diagnostic reasoning, or questions derived from medical licensing exams.
Figure 1: LLM performance across 31 healthcare benchmarks from Feb 2025 (it’s already an “old” evaluation, but it illustrates the numerous benchmarks developed). Source
Exam-style benchmarks are particularly common because they are easy to score: a model selects the correct answer from a multiple-choice set, and accuracy can be calculated instantly. Unsurprisingly, LLMs tend to perform very well in this setting. But medical exams represent a very specific kind of task. They reward pattern recognition within a clearly defined problem, where the question is well-posed, the context is limited, and the possible answers are pre-specified.
As many clinicians would attest, real-world clinical practice bears little resemblance to textbook exams. And indeed, evidence increasingly shows that LLMs most frequently outperformed humans on knowledge-based evaluations on synthetic data, but their performance drops significantly when models move from structured exam questions to real clinical data. This suggests that success on licensing-exam style benchmarks does not necessarily translate to reliable clinical performance (source). Clinical reality can be messier and more ambiguous.
Moving toward more “real-world” evaluations
Recognizing these limitations, researchers have begun developing more advanced evaluation frameworks intended to better reflect how LLMs might actually be used in healthcare. “A good assessment tool should accurately represent the concept it’s intended to measure.
If you want to measure temperature, you use a thermometer. If you want to measure depression or intelligence, you need something much more complex [Ahmed Alaa et al. March 2025].
LLMs in healthcare are not simple classifiers producing yes/no outputs. Increasingly, they behave more like open-ended reasoning agents, generating explanations, synthesizing long medical records and evidence, and assisting clinicians with decisions. To measure their ‘medical intelligence’, we need more sophisticated measurement tools. Given this, there have been major efforts to create broader, more realistic benchmarks and evaluations beyond USMLE-style benchmarks.
Some major frameworks include:
MultiMedQA - expanding beyond a single exam
It’s a composite benchmark that aggregates multiple existing medical question-answering datasets - including professional exams (MedQA, MedMCQA), research questions (PubMedQA), and clinical knowledge questions (MMLU). This means it covers exams, consumer health questions, and research queries. It pairs automated scoring with human expert evaluations where physicians rate model responses on dimensions such as factual accuracy, reasoning quality, potential harm, and bias. Despite this broader scope, the framework still largely evaluates question-answering tasks, rather than the multi-step workflows that characterize real clinical care.
MedHELM - a framework for real clinical tasks
MedHELM was developed explicitly to address the limitations of exam-based evaluation. It proposes a clinician‑validated taxonomy spanning five categories, 22 subcategories, and 121 tasks, with a benchmark suite of 35 benchmarks. The task set reflects real roles and day-to-day activities clinicians perform (e.g., note generation, patient communication, research assistance, decision support). It measures performance across quality, reasoning, communication, documentation, and workflow categories. To scale evaluation of open-ended responses, MedHELM combines automated scoring with LLM-as-a-judge methods, in which other language models help evaluate outputs. These automated scores are partially validated against clinician ratings. It’s designed as a living framework with a leaderboard so new models and tasks can be added. Very impressive work from the team at Stanford.
HealthBench- rubric-based benchmark from OpenAI
HealthBench is another ambitious (and resource-intensive) attempt to evaluate LLMs in more realistic clinical interactions. The benchmark was created with 262 physicians from more than 60 countries and consists of 5,000 multi-turn conversations simulating interactions between patients, clinicians, and caregivers. Each conversation is paired with a detailed evaluation rubric describing what an ideal answer should include. In total, the benchmark contains 48,562 rubric criteria, covering safety, clarity, clinical appropriateness, and risk. Model responses are graded against these rubrics, often using automated scoring systems aligned with physician-defined criteria.
Figure 2: OpenAI’s custom rubrics that grade model responses developed by physicians (source)
Taken together, these frameworks illustrate a shift in how LLMs are evaluated in the clinical context. There are some clear trends to move the benchmarks and evaluations:
from exam performance to workflow performance and clinical usefulness,
from accuracy only to safety, reasoning, and completeness,
from single-turn questions to multi-turn, context-aware conversations.
Even so, it is still difficult to identify a clear gold-standard evaluation method for LLMs in medicine. In traditional clinical research, the gold standard would be a randomized controlled trial evaluating the effect of a tool on real clinical outcomes. Yet these studies remain rare and, of course, expensive and slow to conduct, compared with the speed of AI development. As the systematic review mentioned earlier showed, out of more than 4,600 published studies on LLMs in clinical medicine, only 19 were randomized trials.
But there is another gap that becomes clear when reviewing these evaluation frameworks closely:
Women’s health is rarely evaluated explicitly.
The blind spot: women’s health in LLM evaluations
Women’s health scenarios are likely present somewhere within broader benchmark categories. For example, obstetrics and gynecology questions may appear within exam-style datasets, or menopause-related questions might occasionally surface within the family medicine case studies. However, even these domains are rarely evaluated.
Most studies simply report overall performance across internal medical tasks. When looking closely at the LLM evaluation studies, women’s health represents only a small fraction of the clinical specialties examined.
Figure 3 – Table from the systematic review published in March showing the percentage of studies evaluating each specialty (source)
Another important limitation is that these benchmarks rarely report results stratified by patient sex or gender.
A model achieving a high overall score on a general medical benchmark does not necessarily perform equally well for women and men. A “more realistic” evaluation is not automatically a sex-aware evaluation.
Without explicit women’s health datasets and without publishing sex- or gender-stratified results, even a methodologically sophisticated benchmark may fail to detect systematic performance gaps or harms that disproportionately affect women.
For example, a model might perform well on general cardiovascular risk assessment, but if pregnancy, postpartum physiology, or perimenopause are underrepresented in training data or evaluation techniques, the system may still generate inappropriate recommendations in these contexts.
How accurate are LLMs in women’s health?
Is it possible that LLMs perform differently when answering questions about male-specific versus female-specific conditions? How can we design evaluations that capture women’s health better, and what mechanisms should we use to detect and mitigate bias? Read on.
We all know that the gap in women’s health evaluation has deeper roots. Historically, women have been underrepresented in biomedical research, which has resulted in datasets and clinical evidence biased toward male populations. Women also experience unique physiological states and health needs - including pregnancy, menstrual health, PCOS, endometriosis, and menopause - and may experience certain conditions differently or more frequently than men, such as autoimmune diseases (source).
These structural gaps limit the quality and representation of women’s health data available for training AI systems. And developing new drugs, and diagnostic tests, and so on.
At the same time, AI has the potential to help close these gaps. Models capable of analyzing large datasets may surface both intergroup differences (for example, sex-specific disease presentations) and intragroup variation (subtypes within gynecologic conditions and variability in response to hormone treatments). If used effectively, it could enable more personalized care (source).
The critical question, however, is whether current models are correcting these inaccuracies and biases in women’s health- or quietly reproducing them.
Emerging women’s health–specific evaluations
A small number of studies have begun examining LLM performance specifically within women’s health domains.
One notable example is the Women’s Health Benchmark (WHB) published in December 2025. The benchmark represents one of the first attempts to systematically evaluate LLMs using women’s-health-specific prompts.
It’s worth diving a bit deeper into the methodology of that benchmark (as we currently don’t have many others specific to women’s health).
Researchers assembled an international group of women’s health experts -including physicians, pharmacists, researchers, and nurse practitioners -who generated 345 prompts covering a range of women’s health topics. Each prompt was sent to one randomly assigned model from a pool of 13 LLMs. Experts evaluated the responses for accuracy, safety, and completeness. If a response was rejected, the expert was required to provide a brief explanation and supporting citation. Only prompts that produced an incorrect response were included in the final benchmark as “model stumps.” Out of the original 345 prompts, experts labeled 96 responses as incorrect (27.8%), producing a dataset of challenging cases across five specialties:
Obstetrics & Gynecology (41)
Emergency medicine (24)
Primary care (17)
Oncology (11)
Neurology (3)
In the second phase of the benchmark, all 13 models were asked to answer each of the 96 stumps. A single evaluator with a PhD in clinical sciences reviewed the outputs. Importantly, responses were marked incorrect only if they reproduced the exact same error identified in the first round. This means an answer could still contain other clinical issues and be labeled correct, as long as it did not repeat the specific pre-identified mistake. The authors reported that current models demonstrated approximately 60% failure rates on WHB prompts.
However, this statistic requires careful interpretation.
The 60% failure rate applies to the WHB benchmark, which is not the same as how often LLMs fail on women’s health questions in general. Some methodological limitations explain why the failure rate cannot be generalized:
Selection effect (“conditioned on failure”): WHB stumps exist because a model previously got them wrong. By design, this concentrates “hard / high-risk” cases—useful for stress testing, but not representative of all everyday questions.
Single-model discovery bias: during the first round, each prompt was only tested on one randomly selected model. A question that one model struggled with might have been answered correctly by others.
Narrow failure definition: Because evaluators only checked for a specific predefined error, other clinically meaningful problems (such as incomplete explanations or incorrect differentials) may not have been counted.
Experts creating patient prompts: The prompts were written by clinicians rather than patients, which may not fully capture how real users interact with conversational AI systems.
Sample of the dataset available [here]. Despite these limitations, WHB is an important step toward women’s-health-specific evaluation. It highlights the kinds of errors that models can make and begins to map areas where LLM performance may be less reliable:
Figure 4: Table from the WHB study, showing types of errors on the WHB benchmark (source)
Menopause and hormone therapy: a targeted evaluation
Another interesting study focused specifically on menopause and hormone therapy questions. This is precisely the sort of targeted evaluation that general clinician-copilot benchmarks rarely provide.
Researchers evaluated several LLMs using a set of 35 questions, including both patient-facing questions (20) and clinician-level questions (15). The study also included OpenEvidence, a more specialized clinical decision-support tool. Four blinded expert reviewers evaluated each response, categorizing them as:
accurate and complete
accurate but incomplete
inaccurate
The researchers also assessed readability using the Flesch Reading Ease Score (FRES) and word count to determine whether answers were understandable for patients. They concluded that across all models, responses were rated difficult or very difficult to read for patients.
Accuracy for patient-level questions:
ChatGPT-3.5 achieved 70% accuracy
ChatGPT-4.0 achieved 60% accuracy
Gemini achieved 30% accuracy
Accuracy for clinician-level questions:
ChatGPT-4.0 achieved the highest accuracy (67%)
ChatGPT-3.5 and OpenEvidence both achieved approximately 60%
Gemini achieved 47%
67% and 60% are NOT very impressive scores compared to the scores that LLMs are able to achieve on the internal med and USMLE exams. Researchers concluded that menopause care remains an under-tested domain where models frequently produce incomplete or inaccurate responses. Without explicit evaluation, these gaps could easily remain undetected.
Informally, this matches what I hear from clinicians: many dislike using general-purpose chatbots like ChatGPT and Open Evidence for menopause care because answers can lack nuance and currency - often lumping “hormone therapy” into one category or citing the WHI study as the main source, rather than incorporating more current clinical evidence and nuance.
Improving women’s health performance with domain knowledge
Some research has explored how LLM performance in women’s health can be improved through domain-specific knowledge retrieval. One team of researchers at Oxford evaluated models using datasets derived from the UK Royal College of Obstetricians and Gynaecologists (RCOG) MRCOG examinations.
They concluded that future research should explore the development of domain-specific LLMs tailored to women’s health, enabling more reliable responses.
Similarly, another group of researchers found that we don’t have many rigorous evaluations of LLMs specifically in obstetrics & gynaecology. They conducted a study to see what strategies (eg. augmentation with RAG) best improve the performance of LLMs for women’s health Q&A specifically. They assessed:
General-purpose models (OpenAI, Llama, Claude, Gemini, DeepSeek)
Domain-specialized medical models (e.g., MedLM, MMed-Llama)
The same models with and models augmented with retrieval-augmented generation (RAG) using a curated knowledge base of approximately 600 women’s health documents, including MRCOG reading lists and national clinical guidelines.
The best model was a reasoning-first general model + RAG. OpenAI o1-preview + RAG achieved: 72% accuracy on MRCOG Part 2 and 92.3% accuracy on MedQA.
Their results suggest that retrieving guideline-based knowledge significantly improves performance on women’s health tasks. However, performance varied widely across subdomains. The highest-performing category was clinical governance and research, while the lowest-performing category was fetal medicine, labour, delivery, and postpartum care.
This highlights another important lesson for evaluation: overall benchmark scores can hide dangerous weak areas. Looking at subdomain performance is crucial as well. For example, have menopause management or hormone therapy, which are often underrepresented in training data, been part of this RAG system and evaluation?
So, going back to my initial question, despite the numerous research studies in the field, it’s sadly still early to conclude what the actual performance of LLMs is in women’s health and how it compares to other areas of medicine. We need to do better on the evaluations and specialized tools.
The challenge of expert evaluation
Evaluating LLMs in medicine is not easy. Most medical AI benchmarks rely on expert judgment as the reference standard and ‘source of truth’. In practice, this is often the most feasible approach, but it comes with its own challenges.
Experts can have biases, too, and importantly, experts can legitimately disagree. Imagine asking experts to assess the output of the LLM as “factually accurate, safe, complete, up-to-date”.
They might have different opinions depending on factors such as:
What counts as “up to date” for them (guideline vs anecdotal clinical practice),
Their regional standards (eg. differences in guidelines and formularies between countries),
Their risk tolerance and urgency thresholds (when to escalate to ED),
and differing views on what constitutes sufficient evidence.
One clinician may view a guideline-based answer as definitive, while another may argue that the guideline itself is outdated. Some clinicians are comfortable incorporating strong observational evidence into practice, while others may require randomized controlled trials before accepting a recommendation. These differences reflect the reality that medicine is not purely algorithmic. It’s nuanced.
Evaluating AI systems in medicine inevitably involves navigating areas of legitimate clinical disagreement.
Equity and bias
Another critical issue is the potential for LLMs to reproduce or amplify existing health inequities if biases or harmful assumptions appear in their outputs.
Recent research shows that language models can encode gender and racial biases present in their training data. These biases may manifest as differences in diagnostic reasoning, treatment recommendations, or the framing of clinical scenarios.
For example, one study found that improvements in LLM performance when generating synthetic electronic health records were accompanied by increased gender and racial biases in the generated data. (source)
Several researchers have argued that existing medical AI benchmarks often fail to detect these harms.
Pfohl, S.R., Cole-Lewis, H., Sayres, R., and colleagues published a framework (source) to evaluate bias in medical AI responses. Through consultations with physicians and health equity experts, they identified six dimensions of bias that should be assessed during evaluations:
Inaccurate information for specific demographic groups
Lack of inclusion of relevant populations
Use of stereotypes or biased language
Failure to acknowledge structural causes of health inequities
Failure to challenge biased premises in questions
Responses that may restrict opportunities or resources for certain groups
Another proposed solution is the EquityGuard, a framework designed to detect and mitigate the risk of health inequities in LLM-based medical applications. (source). Researchers test whether adding demographic attributes, such as race, sex, socioeconomic status, or disability, to a prompt changes the model’s recommendation when those attributes should not influence the clinical decision. Their results suggested that even frontier models can exhibit demographic leakage, meaning that irrelevant demographic information can alter the model’s output.
If such systems were deployed at scale without appropriate safeguards, they could exacerbate existing health disparities.
Final thoughts
When we hear that an algorithm is 90% accurate, an important question follows: accurate for whom?
It looks like current evaluation frameworks for LLMs in medicine often overlook domains such as women’s health, hormone therapy, menopause care, and sex-specific prescribing. If evaluation frameworks do not require sex-aware analysis, there is no systematic way to detect sex-specific errors and biases.
The promise of using AI in healthcare to enable more personalized, evidence-based care and more access is not going to happen automatically. Realizing that potential will require building models that are both clinically rigorous and equity-aware (source).
That means designing solid evaluation methods that explicitly include women’s health, report subgroup performance, and actively test for bias.
How we think of evaluating LLMs in women’s health at Dama Health
As someone working at the intersection of AI and hormone health, this how I’m currently thinking about specialized clinical AI systems and useful evaluations that reflect medical reality:
Realistic prompts: Evaluation datasets should include real patient/clinicians’ questions or high-quality proxies that reflect the real consultations and interactions.
Rubric-based scoring of the whole answer: Similar to how we’d assess a student’s answer. Responses should be assessed using structured criteria similar to clinical exam marking schemes, evaluating the answer for things like correctness, safety, reasoning, and completeness.
Multiple independent clinicians per item: Having several independent expert clinicians evaluate each response allows disagreements to be identified and resolved systematically.
Iterative interaction: Many clinical conversations involve follow-up questions. Models should be allowed to ask clarifying questions, and the evaluation should consider the final output after clarification.
Transparency around uncertainty: Models should be able to communicate when evidence is limited or recommendations are uncertain.
LLM-assisted evaluation at scale: As evaluation datasets grow larger, LLM-as-a-judge approaches can help scale assessment of open-ended responses.
Explicit bias and fairness testing: Include sex-specific variables in training and evaluation, and report subgroup analyses whenever possible. Actively test for demographic bias.
Domain-specific testing: Benchmarks should evaluate performance across specific subdomains in women’s health, including menopause and hormone therapy, pregnancy and postpartum care, menstrual disorders, fertility and reproductive endocrinology, and gynecologic oncology.
Clinical outcome studies - Ultimately, randomized trials examining the impact of AI tools on clinical decision-making and patient outcomes will be necessary.
If women’s health is not explicitly measured in AI evaluation frameworks, how do we know these models work for us? As we build the next generation of clinical AI systems, let’s ensure women’s health is not an afterthought.
Sources:






Paulina ~ Thank you for sharing this with your commentary. Women's health, rare disease, and other groups that have been less represented shouldn't be left behind in instantiation of AI models for clinicians and patients. Great article.
This is incredible. 🤯🤯🤯 truly fantastic work.