How can we prevent LLMs from hallucinating?

tl;dr - LLMs hallucinate because they have to guess. Here are four ways to reduce the guessing:

Prompt engineering: Give the model clear instructions and a narrow scope. Tell it to say "I don't know" when it's unsure.
RAG (Retrieval-Augmented Generation): Search your own documents for relevant information and hand it to the model before it answers.
Tool use: Let the model query databases, call APIs, and access real data instead of relying on memory.
Structured outputs: Force the response into a predictable format so it's easier to verify.

Think back to school for a second. Remember the difference between a closed-book test and an open-book test?

On a closed-book test, you're relying entirely on memory. If you studied hard and the material stuck, great. But for the stuff that didn't stick? You guess. You write something that sounds right and hope for partial credit. We've all been there.

An open-book test is a completely different experience. You still need to understand the material, but when it comes to specific facts, dates, or formulas, you can just look them up. You don't have to guess, so you don't.

Right now, most people use AI like a closed-book test. They type a question into ChatGPT and hope the model remembers the right answer from its training data. Sometimes it does. Sometimes it confidently makes something up.

The good news: we don't have to use AI that way. There are well-established techniques for turning that closed-book test into an open-book one. And when you do, the hallucination problem gets a lot more manageable.

The core idea

Before we get into specifics, it helps to understand the common thread across every technique we'll cover.

Hallucination happens when the model has to guess. The less it has to guess, the less it hallucinates.

That's really it. Every technique in this post is a different way of reducing the amount of guessing the model has to do. Some techniques narrow the scope of what the model should even attempt to answer. Others give the model access to real information so it doesn't have to rely on memory. Others force the output into a format that's easy to verify.

Different tools for different situations, but the same underlying principle. Let's walk through them.

Technique 1: Prompt engineering

The simplest and most accessible technique is telling the model what to do (and what not to do) through its instructions. This is called prompt engineering, and it's the equivalent of giving a new employee a clear job description on their first day.

Without guidance, an LLM will try to be helpful about anything you ask it. That's how you end up with a chatbot confidently citing fake statute numbers. But if you tell the model upfront, "You are an assistant that answers questions about the City of Zacville's municipal code. Only answer questions that relate to the municipal code. If you don't know the answer, say so," you've dramatically reduced the surface area for hallucination.

The model still might get things wrong. It's still working from memory, after all. But giving it a narrow lane to operate in, and explicit permission to say "I don't know," goes a surprisingly long way.

If you've ever written instructions for a new intern, you already know the basics of prompt engineering. Be specific. Anticipate the ways they might go off track. And tell them it's OK to ask for help instead of making something up.

Technique 2: Retrieval-Augmented Generation (RAG)

Prompt engineering constrains the model's behavior, but it doesn't give the model new information to work with. That's where retrieval-augmented generation comes in. The name is a mouthful, but the concept is straightforward.

Instead of asking the model to answer a question from memory, you first search your own documents for relevant information, then hand that information to the model along with the question. The model's job shifts from "recall the answer" to "read these documents and answer based on what they say."

If you read our earlier post on semantic search, this might sound familiar. It's the same idea: find the most relevant content based on meaning, not just keywords. RAG just adds a second step: after finding the relevant documents, pass them to an LLM to generate a natural language answer.

Here's a city example. Imagine a staff member asks, "What's our policy on comp time for exempt employees?" Without RAG, the model would generate an answer based on what it learned during training, which almost certainly doesn't include your city's specific HR policies. With RAG, the system first searches your policy manual, finds the relevant section, and the model answers based on your actual policy. Big difference.

Technique 3: Tool use

RAG is powerful when your source of truth lives in documents. But what about data that lives in databases, spreadsheets, or external systems? That's where tool use comes in.

Tool use means giving the model the ability to take actions: query a database, call an API, read a file, or pull data from an external system. Instead of the model guessing that your city's general fund balance is "approximately $4.2 million" based on a vague pattern it learned during training, it can look it up in your finance system and give you the actual number.

This is a big conceptual leap from how most people think about AI. The model isn't just generating text anymore. It's deciding what information it needs, going and getting that information, and then generating a response based on real data.

Think of it this way: RAG is like handing someone a stack of relevant documents before they answer your question. Tool use is like giving them a phone and a login so they can look things up themselves.

You might be wondering: if tool use is more flexible than RAG, why bother with RAG at all? In practice, they're complementary. RAG is great for unstructured information (policies, meeting minutes, memos). Tool use is great for structured data (budgets, permit records, tax rolls). Most production AI systems use both.

Technique 4: Structured outputs

The first three techniques are about giving the model better information. Structured outputs take a different angle: they constrain the model's response format so that what it produces is easier to verify.

Instead of asking the model to write a free-form paragraph, you tell it to fill in specific fields. Think of it like the difference between asking someone to "write up their thoughts on this permit application" vs. giving them a form with labeled boxes to fill in: applicant name, address, zoning classification, recommended action.

When the output has a predictable structure, you can check it programmatically. Does the zoning classification match a valid code? Is the address in your jurisdiction? Does the recommended action match one of the allowed options? A free-form paragraph can hide errors in fluent prose. A structured output puts every claim in a box where it can be inspected.

This won't prevent the model from filling in a wrong value. But it makes wrong values much easier to catch, both for humans reviewing the output and for automated systems that validate it before it goes anywhere.

Putting it all together

These four techniques aren't mutually exclusive. In fact, the most reliable AI systems layer several of them together. A well-designed system might use prompt engineering to define the model's role, RAG to pull in relevant policy documents, tool use to fetch live data from a database, and structured outputs to ensure the response can be validated before anyone acts on it.

The spectrum from "I typed something into ChatGPT and hoped for the best" to "we built a reliable AI tool for our staff" is largely about how many of these guardrails you put in place. Each one reduces the amount of guessing the model has to do, and less guessing means fewer hallucinations.

There are also two more important pieces of this puzzle that we haven't touched on yet: human oversight and evaluation. Guardrails are great, but someone still needs to be in the loop, and you need a way to measure whether the whole system is actually working. We'll get to those in future posts.

Layer them together for the best results. We'll be doing a deep dive on each of these techniques in upcoming posts.

What you don't need to worry about

You don't need to understand the technical implementation of any of these techniques to use them effectively. If you're evaluating an AI product for your city, what matters is whether the vendor is using these approaches (or something similar) and whether they can explain how their system handles accuracy. If a vendor can't tell you how their product deals with hallucination, that's a red flag worth paying attention to.

Next up in the series, we'll take a closer look at prompt engineering: what makes a good prompt, what makes a bad one, and how a few well-chosen sentences can meaningfully change the quality of the output you get.