I've stopped trying to eliminate hallucinations. I design around them.
That sentence used to feel like giving up. It does not anymore. The models I shipped against in 2023 hallucinated constantly, the ones I shipped against last quarter hallucinate less, and the ones I will ship against next quarter will hallucinate less than that. The curve is real and it is moving in the right direction. The asymptote, though, is not zero. It is some small, stubborn number that will follow us for a long time, and the product design question I actually care about is not how we get rid of that number. It is what the experience looks like when the model is wrong.
The framing matters because it changes who is responsible. If hallucinations are a model problem, the answer is to wait. Sit on your hands until the next checkpoint, then the next, then the next. If hallucinations are a UX problem, the answer is to build something today. Treat the wrong answer as a known mode of the system, the same way you treat a 500 from your backend or a dropped connection on a mobile network. You do not pretend those do not happen. You design the loading state, the retry, the empty state, the recovery. You make the failure survivable.
Once you sit with that framing for a week or two, the worst answer in any AI product becomes obvious. It is the confident wrong answer with no recourse. A clean paragraph that sounds right, contains a factual error, and offers the user no way to push back, no link to verify, no signal that the model itself is unsure. That is the shape of an answer that goes into a slide deck, into an email, into a contract. That is the shape that erodes trust slowly and then all at once.
The best answer is the opposite shape. Here is my best guess. Here is where I got it. Let me know if it's off. That is not a weaker product. That is a more honest one, and in my experience users reward honesty with a lot more patience than we expect.
Three patterns get me most of the way there.
1. Show the source
Every fact the model states should be clickable to the source document. Every one. Not a footer link to a citations panel, not a hover card that shows a snippet, an inline affordance that lets the user jump to the paragraph in the PDF, the row in the table, the message in the thread. If the model cannot point to a source, the model says so. Out loud, in the same sentence. "I don't have a source for this, I'm reasoning from context."
The cost of building this is real. You need retrieval that actually returns citations, a UI that renders them without cluttering the answer, and a model prompt that punishes uncited claims. The payoff is that the user stops treating the model as an oracle and starts treating it as a research assistant who shows their work. Wrong answers become discoverable in seconds instead of going unnoticed for a week.
2. Invite challenge
Put a UI affordance next to every answer that says "this doesn't look right." Make it one tap. When the user taps it, the model gets a structured signal that something is off and a chance to recover, re-retrieve, ask a clarifying question, or escalate to a human if you have one in the loop.
The interesting part is what happens upstream of that button. Knowing the button exists changes how the model is prompted, because now the system has a recovery path and does not need to bet everything on the first answer being right. It changes how the user reads the answer, because the affordance gives them permission to push back without feeling rude. It changes how the team measures the product, because tap rates on that button are a much better signal than thumbs up and thumbs down ever were.
The wrong way to do this is to hide the feedback in a settings menu or behind a small icon. The right way is to make pushing back as easy as accepting. Two buttons, same weight, same affordance. The product is telling the user that disagreement is a first class action, not a complaint.
3. Confidence in plain English
Stop showing users a 0.7. Numbers like that are precise about something the model is not precise about, and they leak the math into the experience for no benefit. Use language.
"I'm pretty sure." "I'm guessing." "I found this in two sources that agree." "I found this in one source and it could be outdated." "I don't know, but here is what I would check." These are sentences a careful human would say to a colleague. They communicate uncertainty without pretending to be a calibrated probability, which is good, because the underlying probability is not calibrated either.
The technical side of this is less hard than it looks. You do not need the model to output a true confidence score. You need a small set of plain-English bands and a prompt that asks the model to choose one based on retrieval recall, source agreement, and how much of the answer is grounded versus generated. It will not be perfect. It will be a thousand times better than no signal at all, and it will save your users from the failure mode where they read a fluent paragraph and assume it is true because nothing in the UI suggested otherwise.
Design for the bad day
Most AI demos are demos of the good day. The model knows the answer, the retrieval works, the user phrases their question the way the system likes. The product looks magical, the screenshots get retweeted, the round closes. Then real users show up with weird questions, your retrieval misses, the model fills the gap with something plausible, and the product looks very different.
The teams that ship durable AI products are the ones that designed for the bad day from the start. They assumed the model would be wrong sometimes, they built sources and challenge and honest uncertainty into the core experience, and when the bad day arrived, the product handled it. Users saw the system acknowledge the limit, point to where it got confused, and offer a way forward. Trust went up, not down, because the product told the truth.
Design for the bad day, not just the demo. The bad day reveals the product.