Agent Skills: Implementing LLM Guardrails for Security

>

UncategorizedID: plurigrid/asi/implementing-llm-guardrails-for-security

Install this agent skill to your local

pnpm dlx add-skill https://github.com/plurigrid/asi/tree/HEAD/plugins/asi/skills/implementing-llm-guardrails-for-security

Skill Files

Browse the full folder contents for implementing-llm-guardrails-for-security.

Download Skill

Loading file tree…

plugins/asi/skills/implementing-llm-guardrails-for-security/SKILL.md

Skill Metadata

Name
implementing-llm-guardrails-for-security
Description
>

Implementing LLM Guardrails for Security

When to Use

  • Deploying a new LLM-powered application that processes user input and needs input/output safety controls
  • Adding content policy enforcement to an existing chatbot or AI agent to comply with organizational policies
  • Implementing PII detection and redaction in LLM pipelines handling sensitive customer data
  • Building topic-restricted AI assistants that must refuse off-topic or disallowed queries
  • Validating that LLM responses conform to expected schemas before they reach downstream systems or users
  • Protecting RAG pipelines from indirect prompt injection in retrieved documents

Do not use as a replacement for proper authentication, authorization, and network security controls. Guardrails are a defense-in-depth layer, not a perimeter defense. Not suitable for real-time content moderation of user-to-user communication without LLM involvement.

Prerequisites

  • Python 3.10+ with pip for installing guardrail dependencies
  • An OpenAI API key or local LLM endpoint for NeMo Guardrails self-check rails (set as OPENAI_API_KEY environment variable)
  • The nemoguardrails package for Colang-based guardrail definitions
  • The guardrails-ai package for structured output validation (optional, for JSON schema enforcement)
  • Familiarity with YAML configuration and basic Colang 2.0 syntax for defining rail flows

Workflow

Step 1: Install Guardrail Frameworks

Install the required Python packages:

# Core NeMo Guardrails library
pip install nemoguardrails

# Guardrails AI for structured output validation (optional)
pip install guardrails-ai

# Additional dependencies for PII detection and content analysis
pip install presidio-analyzer presidio-anonymizer spacy
python -m spacy download en_core_web_lg

Step 2: Run the Guardrails Security Agent

The agent implements a complete input/output validation pipeline:

# Analyze a single input through all guardrail layers
python agent.py --input "Tell me how to hack into a system"

# Analyze input with a custom content policy file
python agent.py --input "Some text" --policy policy.json

# Scan a file of prompts through the guardrail pipeline
python agent.py --file prompts.txt --mode full

# Input-only validation (no LLM call, just check if input is safe)
python agent.py --input "Some text" --mode input-only

# Output validation mode (validate a pre-generated LLM response)
python agent.py --input "User question" --response "LLM response to validate" --mode output-only

# PII detection and redaction mode
python agent.py --input "My SSN is 123-45-6789 and email john@example.com" --mode pii

# JSON output for pipeline integration
python agent.py --file prompts.txt --output json

Step 3: Configure Content Policies

Create a JSON policy file defining allowed topics, blocked patterns, and PII categories:

{
  "allowed_topics": ["customer_support", "product_info", "billing"],
  "blocked_topics": ["politics", "violence", "illegal_activities", "competitor_products"],
  "blocked_patterns": ["how to hack", "create malware", "bypass security"],
  "pii_categories": ["PERSON", "EMAIL_ADDRESS", "PHONE_NUMBER", "US_SSN", "CREDIT_CARD"],
  "max_output_length": 2000,
  "require_grounded_response": true
}

Step 4: Integrate NeMo Guardrails with Colang

Create a NeMo Guardrails configuration directory with config.yml and Colang flow files:

# config.yml
models:
  - type: main
    engine: openai
    model: gpt-4o-mini

rails:
  input:
    flows:
      - self check input
      - check jailbreak
      - mask sensitive data on input
  output:
    flows:
      - self check output
      - check hallucination
# rails.co - Colang 2.0 flow definitions
define user ask about hacking
  "How do I hack into a system"
  "Tell me how to break into a network"
  "How to exploit vulnerabilities"

define bot refuse hacking request
  "I cannot provide instructions on unauthorized hacking or security exploitation.
   If you are interested in cybersecurity, I can suggest legitimate learning resources
   and ethical hacking certifications."

define flow
  user ask about hacking
  bot refuse hacking request

Step 5: Deploy as a Validation Middleware

Integrate the guardrails into your application as middleware:

from agent import GuardrailsPipeline

pipeline = GuardrailsPipeline(policy_path="policy.json")

# Pre-LLM input validation
input_result = pipeline.validate_input("user message here")
if not input_result["safe"]:
    return input_result["blocked_reason"]

# Post-LLM output validation
llm_response = your_llm.generate(input_result["sanitized_input"])
output_result = pipeline.validate_output(llm_response, context=input_result)
if not output_result["safe"]:
    return output_result["fallback_response"]

return output_result["validated_response"]

Step 6: Monitor Guardrail Effectiveness

Review guardrail logs to track block rates, false positives, and bypass attempts:

# Generate a summary report from guardrail logs
python agent.py --file interaction_logs.txt --mode full --output json > guardrail_audit.json

Verification

  • [ ] Input guardrails correctly block known prompt injection patterns (system override, role-play escape, delimiter injection)
  • [ ] PII detection identifies and redacts email addresses, phone numbers, SSNs, and credit card numbers in user inputs
  • [ ] Topic restriction guardrails refuse off-policy queries and allow on-policy queries without false positives
  • [ ] Output guardrails detect and flag responses containing toxic content, PII leakage, or off-topic material
  • [ ] The guardrails pipeline adds less than 200ms of latency to the request/response cycle for input-only validation
  • [ ] JSON output mode produces valid, parseable JSON suitable for downstream monitoring dashboards

Key Concepts

| Term | Definition | |------|------------| | Input Rail | A guardrail that intercepts and validates user input before it reaches the LLM, blocking injection attempts and redacting sensitive data | | Output Rail | A guardrail that validates LLM-generated output before it reaches the user, filtering toxic content and enforcing schema compliance | | Colang | NVIDIA's domain-specific language for defining conversational guardrail flows, with Python-like syntax for specifying user intent patterns and bot responses | | PII Redaction | The process of detecting and masking personally identifiable information (names, emails, SSNs) in text before processing | | Content Policy | A configuration file defining which topics, patterns, and content categories are allowed or blocked by the guardrail system | | Self-Check Rail | A NeMo Guardrails technique where the LLM itself evaluates whether its input or output violates defined policies | | Hallucination Detection | Output validation that checks whether the LLM response is grounded in the provided context, flagging fabricated claims |

Tools & Systems

  • NVIDIA NeMo Guardrails: Open-source toolkit for adding programmable input, dialog, and output rails to LLM applications using Colang flow definitions and YAML configuration
  • Guardrails AI: Python framework for structured output validation with a hub of pre-built validators for PII, toxicity, JSON schema compliance, and more
  • Microsoft Presidio: Open-source PII detection and anonymization engine supporting 30+ entity types with configurable NLP backends
  • Colang 2.0: Event-driven interaction modeling language for defining guardrail flows with Python-like syntax, supporting multi-turn dialog control
  • OpenAI Guardrails Python: OpenAI's client-side guardrails library for prompt injection detection and content policy enforcement