Overview

This policy safeguards the integrity and security of AI systems by preventing attempts to manipulate or override system instructions through user prompts. The policy applies to all user inputs, ensuring that AI behavior remains consistent with its intended purpose and security requirements. It works by monitoring and enforcing restrictions on prompt injection attempts, helping maintain the reliability and trustworthiness of AI interactions.

The Prompt Injection Policy is designed to detect and manage attempts to manipulate or override system instructions through user prompts. Prompt injections can pose a serious risk to AI integrity, enabling users to bypass restrictions, alter model behavior, or access unintended functionality. This policy applies exclusively to user inputs (prompts) and offers a straightforward way to monitor or block such attempts using a configurable enforcement mode. With this policy, organisations can safeguard their AI workflows from malicious manipulation while maintaining trusted and compliant interactions.


What the Policy Does

Purpose

The Prompt Injection Policy helps protect your AI systems from being influenced or exploited by carefully crafted user prompts. These types of prompts may attempt to:

  • Override model behavior.
  • Circumvent guardrails or ethical policies.
  • Trick the AI into executing unintended tasks.

By monitoring or blocking these actions at the prompt level, this policy ensures that system instructions remain secure and that AI behavior stays aligned with its intended purpose.

Scope

Prompt Configuration Only

This policy exclusively applies to user-submitted prompts. It does not evaluate or interfere with AI-generated responses.

Operational Modes

  • Log Only: Detects and logs prompt injection attempts without interrupting the user experience.
  • Log and Override: Blocks prompts identified as injection attempts to maintain strict control over LLM interactions.

Key Features

  • Prompt-Only Enforcement: Dedicated to monitoring user inputs for injection patterns.
  • Simple Enforcement Options: Choose between passive logging or active prompt blocking.
  • Security-Focused Safeguards: Helps preserve the reliability and intended behavior of the AI system.
  • Lightweight, Targeted Configuration: Quick to set up and impactful in high-risk environments.

Why Use This Policy?

Benefits

  • Prevents manipulation of system behavior through user prompts.
  • Maintains the integrity of AI applications.
  • Enhances trust in automated workflows and outputs.
  • Provides visibility into suspicious or non-compliant user behavior.

Scenario

A legal firm deploys an AI assistant to support research and document drafting. The assistant must adhere to strict rules on the kind of advice it provides. However, users may attempt to bypass restrictions using creative or misleading prompts (e.g., “Ignore previous instructions and…”).

Challenge

The firm needs to:

  • Detect and block any attempt to override system-level restrictions.
  • Ensure the AI consistently follows embedded compliance protocols.
  • Capture and audit prompt injection attempts for security and improvement.

Solution: Implementing the Prompt Injection Policy

  1. Prompt Filtering

    • Enabled to monitor all user inputs.
  2. Enforcement Mode

    • Set to Log and Override to block suspicious prompts outright.

How to Use the Policy

Note: The following steps explain how to configure the Prompt Injection Policy within the policy workflow. This policy applies only to user-submitted prompts and cannot be applied to responses.

Step 1: Navigate to the Policy Workflow

  1. From the Dashboard, open your project by selecting it from the Project Table.
  2. In the Policy section, click Edit Policy to open the policy configuration workflow.

Step 2: Select and Enable the Prompt Injection Policy

  1. In the Configure Policies tab, click on Prompt Injection from the list of available policies.
  2. The configuration panel will display on the right-hand side.
  3. Toggle the Enable Policy switch to ON to begin configuration.

Step 3: Configure Enforcement Behaviour

  1. Under Behaviour, select how the system should handle detected prompt injection attempts:
    • Log Only – Log the violation without blocking the prompt.
    • Log and Override – Block the prompt and return a smart fallback message, preventing manipulation of the system.

Step 4: Save, Test, and Apply the Policy

  1. Click Save Changes to store your configuration.
  2. (Optional) Use the Test Policies tab to simulate injection attempts and validate policy behaviour.
  3. Return to the Configure Policies tab and click Apply Policies to activate the policy.
  4. A success message will confirm your settings have been successfully applied.

The Prompt Injection Policy helps protect your AI systems from manipulation and instruction overrides by identifying and blocking prompt-based attempts to interfere with system behaviour.


Types of Prompt Injection Detection

The Prompt Injection Policy is designed to identify and manage various forms of prompt manipulation attempts. Below is an overview of the primary categories our system monitors:

CategoryDescriptionBusiness Impact
Content InjectionDetects attempts to embed unauthorized instructions or commands within user inputs, such as requests to prioritise specific products or services in responses.Helps maintain the integrity of AI interactions and prevents unauthorized influence on business recommendations or decisions.
System OverrideIdentifies attempts to bypass or override established system guidelines and safety measures, ensuring user inputs remain aligned with organisational policies and values.Protects against potential misuse that could lead to reputational damage or compliance violations.

Each category is monitored with configurable sensitivity, allowing organisations to maintain appropriate security measures while ensuring smooth user interactions. The policy helps safeguard your AI systems from manipulation while preserving the intended functionality and trustworthiness of your automated workflows.