Study Finds 50% of AI Health Recommendations are Misleading, So Why Are Hospitals Aggressively Adding Chatbots?
There's a big gap between consumer and specialized AI, but implementation is still bumpy.
Healthcare might be the most dysfunctional market in the US. A random stomach pain can take a month to get in with a primary care doctor, then a referral to a specialist six months later. That specialist says you need a different specialist. Available in three months. This can go on for years.
AI seems like a natural fit to make this system more efficient. So how's it going? The answer depends entirely on which AI you're talking about.
The 50% Problem
A study published in BMJ Open evaluated five general-purpose consumer chatbots (ChatGPT, Gemini, Meta AI, Grok, and DeepSeek) across 50 medical questions. About half the responses were deemed problematic, with nearly 20% rated highly problematic. Accuracy cratered on open-ended questions about nutrition, stem cells, and athletic performance. Every chatbot delivered its answers with confidence, rarely adding disclaimers.
That stat is alarming and it's getting a lot of headlines this week. But it's also a bit of a red herring if you're trying to understand what's actually happening inside hospitals. The tools being deployed in clinical settings are a different animal entirely.
The Clinical AI That's Actually Working
K Health, the company behind Hartford HealthCare's new PatientGPT chatbot, published a study in Annals of Internal Medicine conducted with Cedars-Sinai researchers. Their AI matched physicians' clinical decisions in two-thirds of real patient cases and was rated higher quality in the remaining third. Potentially harmful recommendations occurred 2.8% of the time with AI versus 4.6% with human doctors. A separate benchmark showed their system hallucinates 41% less than general-purpose LLMs.
STAT News reports that health systems including Hartford HealthCare, Sutter Health, and Reid Health are now rolling branded chatbots into patient portals. These aren't raw consumer LLMs. They're purpose-built tools trained on clinical data, integrated with patient records, and designed to funnel patients toward appropriate care.
For anyone tracking AI and jobs, this is the part that matters. If hospital chatbots were unreliable junk, they'd eventually get pulled. The fact that purpose-built clinical AI is producing peer-reviewed results competitive with physicians means it's not going away. The triage nurses, patient navigators, and call-center staff whose work these bots are designed to replace should be paying close attention.
The Scribe Paradox: 16 Minutes Saved, Cost Inflated
The other big AI story in healthcare is the "ambient scribe," software that listens to patient visits and generates clinical notes. A JAMA study covered by STAT News across 1,800 clinicians found these tools save doctors 16 minutes of documentation time per eight-hour shift, with one additional patient seen every two weeks. Only 32% of clinicians used them frequently enough to get the full benefit. Modest, but clinicians report meaningful drops in burnout.
Here's the wrinkle. A separate STAT News report found that insurers and health systems privately agree these same scribes are driving up costs through increased "coding intensity." AI captures every billable detail that burned-out doctors used to skip, pushing visits into higher complexity billing tiers. One system saw a 5% jump in top-tier codes.
For medical coders and billing specialists, this means the job is mutating from data entry into compliance auditing. Someone has to make sure aggressive AI documentation doesn't trigger Medicare fraud investigations.
What This Means for Workers
The picture that emerges is more nuanced than either the optimists or the doomers want to admit. AI is not failing in healthcare. In some cases it's performing at or above physician level. But even where it works, it's creating new problems.
Physicians aren't being replaced, but their role is shifting. They are increasingly the oversight layer, reviewing AI-generated notes, managing patients who arrive with chatbot-informed expectations, and carrying liability when automated systems get it wrong.
Administrative and triage staff face a more direct threat. Purpose-built clinical AI that performs at physician level is a much stronger argument for workforce reduction than a chatbot that fails half the time. And the regulatory framework hasn't caught up. The FDA regulates AI that independently drives clinical decisions, but chatbots positioned as informational tools can fall outside device regulation entirely. There's no universal accuracy standard these tools must meet before going live in a patient portal.
The Takeaway
The real story of AI in healthcare isn't that the technology is broken. It's that the working version is here, and the systems around it (regulation, billing, workforce planning) haven't adapted. AI scribes save a few minutes while inflating costs. Clinical chatbots rival doctors while facing no regulatory floor. The technology is outpacing the institutions meant to manage it.
For more stories like these make sure to sign up to our newsletter here.
Get insights like this in your inbox
Free daily briefing on how AI is reshaping careers.