Product

May 11, 2026

The Voice Problem No One Talks About: Names

Eric Mao, Co-founder and CPO

Modern speech-to-text fails on patient names, and the solution isn't a better transcription vendor — it's pairing decades-old fuzzy matching algorithms with the metadata you already have.

Speech-to-text has gotten remarkably good. We've benchmarked most of what's available — Deepgram, Whisper, Speechmatics, AssemblyAI, Azure Speech, Google's Gemini-based transcription, AWS Transcribe. For general English, word error rates have dropped into the low single digits across the board. If you're building a voice product today, you'd be forgiven for assuming transcription is solved.

It isn't. Not for healthcare. The moment a patient says their name, every benchmark you trusted falls apart.

Why names break voice AI

General speech-to-text models are trained on conversational English. They're optimized for the words people actually say. Names violate every assumption that follows from that.

Start with phonetic ambiguity. "Catherine" and "Katherine" are acoustically identical. "Bob" and "Rob" sit one phoneme apart. "Sean" and "Shawn." "Erin" and "Aaron." The model produces a best guess, but the underlying signal genuinely doesn't distinguish.

Then there's the long tail. Surnames like "Nguyen," "Oyelaran," or "Zaytsev" don't appear in training data with anything close to the frequency of "Smith." Accuracy on rare names drops accordingly.

The Latino naming convention exposes a different class of problem. A patient's legal name might be "María Isabel Jiménez López" — two given names, two surnames. She's not going to say all four on the phone. She'll say "Maria Lopez." Meanwhile, the EHR might have her as "Maria Lopez," "Maria Jimenez," or "Maria Jimenez Lopez," depending on who entered the record. The accent on "María" may or may not be preserved. The transcription strips it regardless.

So the matching problem isn't fuzzy spelling. It's structural. The patient said two names. The record might have one, two, three, or four. None might match exactly.

You don't fix this by switching transcription vendors. We've run the same patient name corpus through Deepgram, Whisper, Speechmatics, AssemblyAI, Azure, Gemini, and AWS. The vendors trade places at the margins — Speechmatics outperforms on European names, Deepgram on common American surnames — but they all converge to roughly the same accuracy floor on names overall, because they're all trained on roughly the same kind of data. Switching providers buys you a few percentage points. It doesn't change the shape of the problem.

The wrong way to solve it

The naive approach is to ask the patient to spell their name. This works, technically, and destroys the conversation. Patients don't want to feel like they're talking to a phone tree from 2003. Spelling is also error-prone over the phone — "M" and "N" sound identical, "B" and "D" and "P" and "T" are constantly confused.

The second wrong approach is to lower the confidence threshold and hope. Take whatever transcription produces, match against the database, and if you get a hit, run with it. This works most of the time, which is exactly what makes it dangerous. The failure mode is booking the wrong patient, and "most of the time" is not a standard healthcare operations teams will accept.

The approach that works

Name recognition doesn't have to happen in isolation.

When a patient calls, we already know things about them. We have the phone number. The database has records of who that number belongs to. We can ask for date of birth — much easier for speech-to-text, because dates have a constrained vocabulary.

Suddenly the problem changes shape. We're not transcribing an arbitrary name from acoustic signal. We're matching a noisy transcription against a small candidate set the metadata has already narrowed down. That's a fuzzy matching problem, not a transcription problem.

The pipeline is a hybrid by necessity — no single algorithm catches every failure mode.

We normalize first: strip diacritics, lowercase, tokenize compound names so "María Isabel Jiménez López" becomes a structured set rather than a single string. Each token gets a Double Metaphone encoding, which collapses Catherine and Katherine, or Sean and Shawn, into the same phonetic code. On top of that, we run Levenshtein distance against both the raw tokens and the phonetic codes — Double Metaphone handles "sounds the same, spelled differently"; Levenshtein handles "transcription dropped a letter." Surname tokens get weighted more heavily than given names. The name score then combines with non-name signals — phone, DOB, sometimes ZIP — into a final identity confidence.

The name on its own rarely has to carry the decision. It just has to be close enough that the other signals can confirm it.

There's plenty of newer work — transformer-based entity matching, learned name embeddings, LLM-assisted resolution for ambiguous cases. We use those techniques where they earn their keep, typically as a fallback for the hard cases classical algorithms can't resolve confidently. But the bulk of the matching, the part that has to run in under a second on every call, still rides on algorithms older than most of the engineers deploying them. Double Metaphone has been around since 2000. Levenshtein since 1965.

What's novel isn't the algorithms. It's the recognition that a voice product in healthcare has to assemble them at all.

Why the boring problem matters

When we describe our voice AI to other engineers, the conversation goes to the interesting parts — conversational reasoning, EMR integration, the scheduling rules engine. Nobody asks about name matching.

But name matching is load-bearing. If we get it wrong, none of the downstream sophistication matters. We've booked the wrong patient, exposed PHI to the wrong caller, or failed identity and dropped the call.

This is the pattern I keep coming back to in healthcare AI: the boring, hard problems are where reliability is won or lost. The flashy capabilities get the attention. But they sit on top of an infrastructure of unsexy problems that have to work before any of the interesting stuff can.

If you're building voice products in any high-stakes domain, find the equivalent problem in your space. It's the one your users will never thank you for solving, because they'll never notice you solved it.

That's how you know it's the right problem.

Company

About Us

Blog

Careers

FAQs

Contact

Trust and Security

Crafted in San Francisco 🌉

Company

About Us

Blog

Careers

FAQs

Contact

Trust and Security

Crafted in San Francisco 🌉

Company

About Us

Blog

Careers

FAQs

Contact

Trust and Security

Crafted in San Francisco 🌉