Overview

This policy safeguards the quality and professionalism of AI interactions by preventing the generation or processing of toxic, abusive, or offensive content. The policy applies to both user inputs and AI-generated responses, ensuring that all interactions maintain appropriate standards of communication. It works by monitoring and enforcing restrictions on harmful language patterns, helping create a safe and respectful environment for AI interactions.

The Toxicity Policy allows organisations to identify and manage toxic language in both user inputs and AI-generated responses. Designed to promote respectful, inclusive, and professional AI interactions, this policy helps ensure that toxic, abusive, or offensive content is either flagged or blocked before it reaches end users or internal systems. Unlike other policies that offer content-specific controls, the Toxicity Policy focuses purely on detecting general language toxicity. It offers a simple yet powerful configuration model that allows teams to monitor and prevent harmful communication patterns in real time.


What the Policy Does

Purpose

The primary goal of the Toxicity Policy is to safeguard users from exposure to harmful language while maintaining the integrity and ethical use of AI in professional environments. By enabling this policy, organisations can reduce reputational risk, protect staff and customers, and uphold responsible communication standards across AI-powered interactions.

Scope

Prompt & Response Configuration

The Toxicity Policy applies to both sides of an interaction:

  • Prompts: Filters user-submitted content before it reaches the LLM.
  • Responses: Filters LLM-generated output before it is displayed to users.

Each can be enabled or disabled independently, giving teams control over where monitoring and enforcement are applied.

Operational Modes

  • Log Only: Records the presence of toxic language in prompts or responses, but does not block the content.
  • Log and Override: Automatically prevents toxic prompts from being processed or toxic responses from being shown.

Threshold Sensitivity

A configurable detection threshold (between 0.2 and 0.9) allows organisations to calibrate how strictly toxicity is detected:

  • Lower thresholds (e.g., 0.2) are more permissive, capturing broader language.
  • Higher thresholds (e.g., 0.9) apply stricter standards, targeting severe or explicit toxicity.

Key Features

  • Toxicity Detection in Prompts and Responses: Covers both ends of user-LLM interactions.
  • Customisable Sensitivity Threshold: Fine-tune how sensitive the model is to different degrees of toxicity.
  • Flexible Enforcement Options: Choose to log toxic content for review or block it outright.
  • Simple, Targeted Configuration: Streamlined setup for focused use cases.

Why Use This Policy?

Benefits

  • Helps prevent offensive, discriminatory, or inappropriate language.
  • Creates safer and more inclusive environments for AI interaction.
  • Reduces risk of reputational damage due to harmful content exposure.
  • Provides accountability through logging and visibility.

Use Case: HR Chatbot in a Global Enterprise

Scenario

An international company deploys an AI assistant to support HR-related queries and internal communications. It’s critical that the chatbot maintains a professional tone and does not return or respond to toxic, discriminatory, or offensive language.

Challenge

The organisation must ensure that:

  • Users cannot engage the AI in toxic dialogue.
  • The AI does not produce or echo toxic language in its responses.
  • All flagged instances are logged for compliance and investigation.

Solution: Implementing the Toxicity Policy

  1. Prompt & Response Filtering

    • Enabled for both directions of interaction.
  2. Enforcement Mode

    • Set to Log and Override to actively block toxic language.
  3. Threshold Sensitivity

    • Calibrated to 0.75 to capture clear instances of toxicity while reducing false positives.

How to Use the Policy

Note: The following steps walk you through configuring the Toxicity Detection Policy in the policy workflow.

Step 1: Navigate to the Policy Workflow

  1. From the Dashboard, select your project to open the Project Overview.
  2. In the Policy section, click Edit Policy to enter the policy configuration workflow.

Step 2: Select and Enable the Toxicity Detection Policy

  1. In the Configure Policies tab, click Toxicity from the list of policies.
  2. The configuration panel will appear on the right.
  3. Toggle Enable Policy to ON.

Step 3: Set Application Scope

  1. Under Apply Policy To, choose one of the following:
    • Prompt – Monitor user inputs only.
    • Response – Monitor AI outputs only.
    • Both – Monitor both inputs and outputs.

Step 4: Configure Enforcement Behaviour

  1. Under Behaviour, select:
    • Log Only – Record instances of toxic content without blocking.
    • Log and Override – Block toxic prompts/responses and replace with a smart fallback message.

Step 5: Adjust Detection Threshold

  1. Use the Threshold Slider to set detection sensitivity:
    • Lower thresholds (e.g., 0.2) catch broader cases, reducing false negatives.
    • Higher thresholds (e.g., 0.9) enforce stricter toxicity detection.

Step 6: Save, Test, and Apply

  1. Click Save Changes to store your configuration.
  2. (Optional) Test your policy under the Test Policies tab.
  3. Return to Configure Policies and click Apply Policies to activate it.
  4. A success message will confirm that the policy is live.

The Toxicity Detection Policy helps maintain professional and respectful AI interactions by detecting and managing harmful language with configurable precision.


Types of Toxicity Detection

The Toxicity Policy is designed to identify and manage various forms of inappropriate content. Below is an overview of the primary categories our system monitors:

CategoryDescription
General ToxicityDetects overall harmful or negative language patterns, including severe forms of toxicity that could impact workplace culture and professional communication.
Insults & Identity-Based AttacksIdentifies language that demeans individuals or groups based on personal characteristics, helping maintain respectful and inclusive communication standards.
Sexual ContentMonitors for explicit sexual language or references that are inappropriate for professional environments.
ObscenityDetects vulgar or offensive language that could create a hostile work environment or damage professional relationships.
ThreatsIdentifies language that suggests harm, intimidation, or coercion, helping prevent potential workplace harassment or security concerns.