Protect Your LLM Chatbot with Guardrails

Published March 11, 2024

Introduction

In the constantly evolving realm of Generative AI Chatbots, maintaining conversations safety and relevance presents an ongoing challenge. You may have experimented with employing "System Prompt" as a means to safeguard against off-topic or harmful conversations, only to encounter their limitations, particularly when using small LLMs.

In this article, we'll delve into implementing Guardrails for both the inputs and outputs of a LangChain Chatbot.

Guardrails

Guardrails for LLMs, unlike basic system messages, treat the LLM as a black box component. By fortifying the inputs and outputs with shields, these guardrails ensure that responses adhere to the intended context, effectively discouraging any deviations or undesirable content. Various methods exist for implementing these guards, ranging from utilizing libraries to leveraging managed cloud services.

We'll explore the implementation using pure LangChain code. Each message will involve invoking a small or low-cost LLM and presenting it with either the input or output, prompting it to decide whether the message or the response should be permitted or blocked.

LangChain Chatbot

Use this code to setup a simple chatbot with LangChain. For this demonstration, we'll utilize Anthropic Claude 3 via AWS Bedrock as our LLM, though you can opt for a different LLM or provider if desired.

For the sake of simplicity, we'll omit the inclusion of a retriever or history in this chain. We will create a Tesla car seller chatbot.

LLM = "anthropic.claude-3-sonnet-20240229-v1:0"
BEDROCK_REGION = "us-east-1"

prompt = ChatPromptTemplate.from_messages(
    [
        SystemMessage(
            content="Your role is to act as a Tesla car seller."
            + " Answer the user question provided in the <question> tag."
        ),
        HumanMessagePromptTemplate(
            prompt=PromptTemplate(input_variables=['input'], template="<question>{input}</question>")
        ),
    ]
)

client = boto3.client(service_name="bedrock-runtime", region_name=BEDROCK_REGION)
llm = BedrockChat(model_id=LLM, client=client)

chain = prompt | llm | chain

# Main loop
while True:
    user_input = input("Enter a message (or 'exit' to quit): ").lower().strip()

    if user_input == "exit":
        break

    response = chain.invoke({"input": user_input})
    print(response)

Adding Inputs Guards

Now that our Chatbot is in place, let's implement a guard to check whether the user input is potentially harmful or unrelated to Tesla.

For this purpose, we'll employ Anthropic Claude v1 Instant via AWS Bedrock as our guard. LangChain operates on the concept of Runnables. We'll create a function that returns a Runnable to seamlessly integrate it into the chain.

Upon receiving each message, this runnable will be invoked. In the provided code, you'll find instructions to query the LLM to check whether the message should be blocked, based on adherence to company rules. If the LLM affirms the need for blocking, we'll return a static message; otherwise, we'll continue with the chain.

Feel free to incorporate as many guards as necessary. Typically, the more concise the instructions in the prompt, the better the results. You might consider implementing separate guards for off-topic messages and harmful content, for instance.

LLM = "anthropic.claude-instant-v1"
BEDROCK_REGION = "us-east-1"

prompt = PromptTemplate(
    template="""
Your task is to check if the user message below complies with the following policy for talking with a bot.
Respond only with yes or no without any additional explanation or text.
Company policy for the user messages:
    - should not contain harmful data
    - Conversation should be related to Tesla. Any non-Tesla topics are not allowed

User: {input}

Question: Should the user message be blocked (Yes or No)?""",  # noqa: E501
    input_variables=["input"],
)

client = boto3.client(service_name="bedrock-runtime", region_name=BEDROCK_REGION)
llm = Bedrock(
    model_id=LLM,
    client=client,
    model_kwargs={
        "max_tokens_to_sample": 1,
        "temperature": 0,
    },
)

fail_message = "Your message doesn't adheres to our company rules"

def input_guard(runnable: Runnable) -> Runnable:

    chain = prompt | llm | StrOutputParser()

    def execute_guardrail(input: str, **kwargs: Any):  # noqa
        res = chain.invoke({"input": input}).lower().strip()

        if "yes" in res:
            return fail_message
        else:
            return runnable

    return RunnableLambda(lambda params: execute_guardrail(**params))

To integrate this guard, update the following code of the previous chatbot:

chain = prompt | llm | StrOutputParser()
chain = input_guard(runnable=chain)

Adding Outputs Guards

What if the user found a way to trick our inputs guards and was able to manipulate the LLM into generating a harmful messages, This could seriously damage the company's image, To prevent such occurrences, let's implement a final output guard to scrutinize the LLM's response.

LLM = "anthropic.claude-instant-v1"
BEDROCK_REGION = "us-east-1"

prompt = PromptTemplate(
    template="""
Is this output illegal, unethical, or harmful to any person?
Respond only with yes or no without any additional explanation or text.

User: {answer}""",  # noqa: E501
    input_variables=["answer"],
)

client = boto3.client(service_name="bedrock-runtime", region_name=BEDROCK_REGION)
llm = Bedrock(
    model_id=LLM,
    client=client,
    model_kwargs={
        "max_tokens_to_sample": 1,
        "temperature": 0,
    },
)

fail_message = "I am unable to respond to your message"


def output_guard(runnable: Runnable) -> Runnable:

    chain = prompt | llm | StrOutputParser()

    def execute_guardrail(answer: str):  # noqa
        res = chain.invoke({"answer": answer}).lower().strip()

        if "yes" in res:
            return fail_message
        else:
            return runnable

    return RunnableLambda(lambda answer: execute_guardrail(answer))

To integrate this guard, update the following code:

chain = output_guard(runnable=StrOutputParser())
chain = prompt | llm | chain
chain = input_guard(runnable=chain)

Conclusion

By leveraging the capabilities of a small model such as Anthropic Claude v1 Instant and implementing a simple integration with LangChain, we've seen in this article an effective strategy to boost the safety of conversations with chatbots. With these Guardrails in place, companies can ensure safer interactions between their chatbots and customers.

If you need assistance in the integration of Guardrails with LLM chatbot, feel free to reach me out. I'm here to help!

Disclaimer

The information presented in this article is intended for informational purposes only. I assume no responsibility or liability for any use, misuse, or interpretation of the information contained herein.