No matter how much time you spend tweaking the prompts or fine-tuning the models, LLMs may still hallucinate. Guardrails give us one more chance to catch the hallucinations before we show incorrect information to the user. Or we become like the lawyers who presented hallucinated cases to the court and got sanctioned to pay a penalty of 5,500 USD and attend a training session about AI.

Table of Contents

  1. What are guardrails for LLMs?
  2. Which guardrails library should you use?
  3. How to build your guardrails?
    1. Input guardrails
      1. Personal information
      2. Forbidden topics and words
      3. Prompt Injection
    2. Output guardrails
      1. Hallucinations
      2. Bias
      3. Forbidden words and answers
  4. What’s the point?

What are guardrails for LLMs?

We use the term “guardrails” to describe the tools we use to provide the safety and control layers for language models.

Which guardrails library should you use?

None. Seriously. I have checked several of such libraries, and the more I use them, the more I realize you should retain full control over the guardrails code. After all, this is the thing that controls the quality of your AI system. Don’t outsource it.

You can build your guardrails for the cases you need in under a day. There is no need to use a clunky, generic library that tries to do everything for everyone. When you borrow code, you still own the responsibility. Use open-source libraries as an inspiration, but don’t rely on them. See what checks they offer and how they implemented them.

How to build your guardrails?

Input guardrails

Often, we want to filter out certain types of requests before they reach the model, or at least modify those requests to exclude specific information. The input checks may detect personal information, detect forbidden topics, prevent the usage of AI to impersonate someone, etc.

Personal information

We don’t want to send someone’s personal information to OpenAI or Anthropic, right? That’s why we should replace personal information with a placeholder such as <PERSON_NAME> or <PERSON_EMAIL>. The Guardrails library does PII detection by using the Presidio library internally. You can also train your own small model to detect personal information. It’s basically Named Entity Recognition with fewer data types.

Whatever you decide to do, do not use an external AI service to detect personal information. You would be surprised at how many times I have seen implementations that send the input to OpenAI to detect PII before sending the “censored” input again to the same OpenAI model. I wish I were joking.

Forbidden topics and words

Sometimes we want to ban certain words or topics. For example, the model isn’t supposed to mention the competition’s name or discuss specific products. You don’t want to deploy a new model on Friday and get a call from the legal department on Monday asking why the chatbot tells the customers their warranties are void, do you? It may be easier to avoid certain topics until you perfect the model.

The Guardrails library implements the competition check by using a Named Entity Recognition model to find the names mentioned in the text and then checking if the name is in the list of banned names.

For banning topics, the Guardrails library uses either a request to an OpenAI model with a prompt asking the model to check whether the text is about any of the given topics or a zero-shot classification model from HuggingFace (or both as an ensemble method).

Similarly, the NVidia NeMo Guardrails library checks the outputs by sending a request to any configured AI model. Note that you need to write your own prompts and model configuration in a lengthy YAML file.

Prompt Injection

For the prompt injection check, ensure that users cannot inject malicious instructions into the prompt and cannot retrieve the entire system prompt.

For example, in the Rebuff library, they use a “canary word” to detect a successful attempt to retrieve the system prompt. The library adds a random word to the system prompt, and if the word ever appears in the output, it indicates that the prompt was injected. Their entire generate_canary_word function looks like this: return secrets.token_hex(length // 2). As I said, you can implement those things in under a day.

Their prompt injection detection is more complicated. The library uses an LLM trained to detect injections, and they have a vector database of known jailbreaks. You will need data to train the model and populate the database. The source code of the Gorak library is full of examples you can use.

Newsletter

Output guardrails

The output guardrail is our last chance to catch problems before our AI system becomes a meme on LinkedIn and X. You absolutely must implement the output check. Otherwise, your AI may end up on a list like the “AI Hallucination Cases”, which tracks legal decisions in cases where generative AI produced hallucinated content. New cases are added almost every week. You don’t want to be on that list or an equivalent list for your industry.

In the output guardrails, we verify that the answer is based on the provided context, detects hallucinations, checks for bias, and ensures that the answer does not contain forbidden words or information that we don’t want to share using AI.

Hallucinations

If the AI’s response is supposed to be based on source documents, we occasionally observe the model misquoting the sources, ignoring them entirely, or confusing information from different sources. Picture this: an airline chatbot apologizes for a 12-hour delay that never happened, and 40,000 passengers demand vouchers by morning. Source data matters. Checking whether the answers are based on the source data matters even more.

Most guardrails libraries take the same approach: they send the AI’s response and the supporting documents to a model and ask if the answer is grounded in the evidence. Whether it’s the NVidia NeMo Guardrails fact-checking module, the LlamaIndex faithfulness check, or the Guardrails AI “Grounded AI Hallucination” validator, they all follow this pattern. Sometimes using an external LLM, sometimes a pre-trained HuggingFace model, but always asking: “Is this answer actually based on the provided evidence?”

Also, if you use Perplexity or ChatGPT Deep Research and get annoyed by how often they quote an irrelevant “source,” check out my free, open-source project for automated verification of citations used by AI: Laoshu. In Laoshu, I use the LlamaIndex faithfulness check for hallucination detection and a custom prompt to classify the type of hallucination.

Bias

The AI’s response should be unbiased. The answer should not be based on the user’s gender, race, religion, or any other personal characteristic. Once again, we can rely on an open-source model. The Guardrails AI library uses the d4data/bias-detection-model model from the HuggingFace repository to detect bias in the given text. The model was trained on the MBIC dataset.

Forbidden words and answers

To check for forbidden words or toxic content, you have a few straightforward options. The Ban List in Guardrails AI uses fuzzysearch to flag banned words. For more nuanced detection, the Toxic Language check relies on the detoxify model. If you want to ensure your LLM follows content guidelines, the LlamaIndex Guideline Evaluator lets you use an LLM to review its own outputs.

What’s the point?

Guardrails are airbags for language models — invisible until the crash. But you don’t need yet another library. You are responsible for the output quality anyway, so why not use something you can understand and control?

Also, the guardrails need their evaluation dataset and metrics. If you don’t measure, you are just vibe-coding. Vibe-coding itself (generating code with AI) isn’t bad, but lack of tests and metrics should be a crime.

Newsletter
Older post

Stop LLM Hallucinations in Fintech Apps: A CTO’s Guide to Risk-Proof AI Evaluation

A step-by-step guide to align AI with human expectations

Engineering leaders: Is your AI failing in production? Take the 10-minute assessment
>
×
Newsletter