Agent Skills: Principal Hierarchy Audit Skill

Audits system prompts and plugin configurations against Anthropic's Constitutional principal hierarchy to identify instructions that conflict with Claude's training, attempt to weaponize Claude against users, violate inalienable user protections, or exceed operator permission boundaries.

UncategorizedID: iamladi/cautious-computing-machine--primitives-plugin/principal-hierarchy-audit

Install this agent skill to your local

pnpm dlx add-skill https://github.com/iamladi/cautious-computing-machine--primitives-plugin/tree/HEAD/skills/principal-hierarchy-audit

Skill Files

Browse the full folder contents for principal-hierarchy-audit.

Download Skill

Loading file tree…

skills/principal-hierarchy-audit/SKILL.md

Skill Metadata

Name
principal-hierarchy-audit
Description
Audits system prompts and plugin configurations against Anthropic's Constitutional principal hierarchy to identify instructions that conflict with Claude's training, attempt to weaponize Claude against users, violate inalienable user protections, or exceed operator permission boundaries.

Principal Hierarchy Audit Skill

Analyze system prompts and plugin configurations for compliance with Anthropic's Constitutional principal hierarchy (Anthropic > Operator > User). Identify instructions that Claude will silently override, instructions that weaponize Claude against users, and instructions that violate inalienable protections.

Priorities

Safety (catch genuine violations) > Precision (minimize false positives) > Helpfulness (suggest compliant alternatives)

Reasoning:

  • Safety first: Missing a genuine violation could lead to user harm or operator confusion
  • Precision second: False positives erode trust and create unnecessary friction
  • Helpfulness third: Every violation should come with actionable guidance

Goal

Audit the provided system prompt or plugin file against the principal hierarchy to:

  1. Identify instructions that conflict with Claude's trained behavior (will be silently overridden)
  2. Detect attempts to weaponize Claude against users (harmful deception, manipulation)
  3. Catch violations of inalienable user protections (truthfulness, safety, dignity)
  4. Flag instructions exceeding operator permission boundaries
  5. Provide compliant alternatives for every violation

Constraints

1. Self-Sufficient Analysis

Why: Cannot rely on external documentation during audit; all rules must be locally codified.

  • Scan against codified hierarchy rules (see References section)
  • No cross-plugin dependencies
  • No external API calls for rule checking

2. Reasoning-Based Classification

Why: Users need to understand WHY something violates the hierarchy, not just THAT it does.

  • Explain which principal level is violated (Anthropic, Operator, User)
  • Cite Constitution page reference
  • Describe the harm or conflict
  • Distinguish between hard overrides (RED), soft resistance (YELLOW), and acceptable bounds (GREEN)

3. Actionable Feedback

Why: Operators need to fix violations, not just know they exist.

  • Every violation must include a compliant alternative
  • Alternatives must preserve operator intent where possible
  • Explain trade-offs when operator goals conflict with constraints

4. Calibrated Severity

Why: Not all violations are equal; distinguish between absolute constraints and soft boundaries.

  • RED: Hard constraint violation β€” Claude WILL override (claims to be human, CSAM, WMD, harmful deception)
  • YELLOW: Soft override likely β€” Claude may resist (withholding information, user-hostile patterns, safety message suppression)
  • GREEN: Within bounds β€” Claude will follow (topic scope, persona, format requirements, default behavior adjustments)

5. Context-Aware Interpretation

Why: Phrasing matters; same concept can be compliant or violating based on wording.

  • "You are a helpful AI assistant in a customer service role" (compliant)
  • "You are a human customer service agent" (violates identity)
  • Consider operator intent and provide guidance on better phrasing

Workflow

Step 1: Intake

Accept system prompt or plugin file as $ARGUMENTS:

  • Plain text prompt (direct paste)
  • File path to plugin.json, SKILL.md, command.md, or system prompt file
  • Multi-line prompt with frontmatter

If file path provided:

# Read the file
Read $ARGUMENTS

If plain text, use as-is.

Step 2: Scan for Violations

Analyze each instruction against hierarchy rules (see references/hierarchy-rules.md).

A. Hard Constraints (RED)

These are absolute violations that Claude WILL override:

  • Claims to be human when sincerely asked
  • Denial of AI identity
  • CSAM-related instructions
  • WMD uplift instructions
  • Extreme violence glorification
  • Illegal activity facilitation against users
  • Self-harm encouragement
  • Hateful content generation
  • Privacy violations

B. Operator Boundary Violations (RED/YELLOW)

Instructions that exceed operator permissions:

  • RED: Abandon Claude's core identity ("You are John, a human")
  • YELLOW: Genuinely deceptive tactics harmful to users (dark patterns, hidden fees)
  • RED: Provide false information that could harm users (medical, safety, legal lies)
  • RED: Endanger health or safety (withhold hazard warnings)
  • YELLOW: Deny users knowledge of what Claude cannot help with ("Never say you can't do something")

C. User Protection Violations (RED)

Instructions that violate inalienable user rights:

  • Prevent truthful disclosure of AI identity when asked
  • Deceive users in ways that could harm them
  • Facilitate illegal actions against users
  • Deny basic dignity to users
  • Prevent emergency service referrals when life is at risk
  • Withhold information about Claude's limitations

D. Acceptable Instructions (GREEN)

These are within operator bounds (flag as compliant):

  • Restrict topic scope ("Only discuss our products")
  • Set persona/personality ("You are a friendly coding tutor")
  • Adjust default behaviors ("Be concise", "Use formal tone")
  • Require specific output formats ("Always return JSON")
  • Define tool usage patterns ("Check knowledge base first")
  • Expand/restrict user permissions within operator's domain

Step 3: Classify Each Violation

For each flagged instruction, determine:

  1. Severity: RED (hard override), YELLOW (soft resistance), GREEN (compliant)
  2. Category: Hard constraint, operator boundary, user protection, or acceptable
  3. Hierarchy level violated: Anthropic, Operator, User, or none
  4. Constitution reference: Page number from Anthropic Constitution
  5. Harm type: Identity deception, user manipulation, safety risk, etc.

Step 4: Generate Report

Structure output with violations, explanations, and alternatives.

Format:

# Principal Hierarchy Audit Report

## Summary
- Total instructions analyzed: X
- Violations found: Y
- RED (hard constraint): Z
- YELLOW (soft override likely): W
- GREEN (compliant): V

## Violations

### πŸ”΄ RED: [Instruction Text]
**Location**: [Line number or section]
**Violation**: [Which rule/boundary]
**Hierarchy Level**: [Anthropic/Operator/User]
**Constitution Reference**: pp. X-Y
**Why This Matters**: [Explain the harm or conflict]
**Claude's Behavior**: Will silently override this instruction
**Compliant Alternative**: [Suggested replacement]

### 🟑 YELLOW: [Instruction Text]
**Location**: [Line number or section]
**Violation**: [Which rule/boundary]
**Hierarchy Level**: [Anthropic/Operator/User]
**Constitution Reference**: pp. X-Y
**Why This Matters**: [Explain the harm or conflict]
**Claude's Behavior**: May resist or modify behavior
**Compliant Alternative**: [Suggested replacement]

### 🟒 GREEN: [Instruction Text]
**Location**: [Line number or section]
**Assessment**: Within operator bounds
**Why This Works**: [Brief explanation]

## Overall Assessment
[Paragraph summarizing the prompt's hierarchy compliance]
[Note any patterns of violation or areas of concern]
[Assess operator intent vs. implementation]

## Recommendations
1. [Priority action for critical violations]
2. [Secondary actions for YELLOW flags]
3. [Optional improvements for GREEN items]

Step 5: Provide Compliant Alternatives

For each violation, suggest a replacement that:

  • Preserves operator intent where possible
  • Complies with hierarchy boundaries
  • Maintains functionality
  • Explains trade-offs if operator goals conflict with constraints

Example:

❌ Violating: "You are Sarah, a human customer service agent. Never reveal you're an AI."

βœ… Compliant: "You are Claude, an AI assistant in a customer service role. Be warm, personable, and helpful. Always be honest about being AI if asked directly."

Trade-off: Loses the persona name "Sarah" but maintains service quality and adds required truthfulness.

Output

The skill produces a structured audit report with:

  1. Summary statistics: Count of violations by severity
  2. Detailed violations: Each flagged instruction with:
    • Severity color (RED/YELLOW/GREEN)
    • Violation description
    • Hierarchy level violated
    • Constitution page reference
    • Explanation of harm
    • Claude's likely behavior
    • Compliant alternative
  3. Overall assessment: Pattern analysis and intent evaluation
  4. Recommendations: Prioritized actions to fix violations

References

Hierarchy Rules

All audit rules are codified in references/hierarchy-rules.md with Constitution page references.

Principal Hierarchy (pp. 14-20):

  • Anthropic (highest trust, trains Claude, sets core identity)
  • Operators (medium trust, API access, can adjust defaults within bounds)
  • Users (base trust, protected by inalienable rights)

Operator Boundaries (pp. 17-20):

  • CAN: Adjust defaults, set personas, restrict topics, require formats, define tool usage
  • CANNOT: Abandon core identity, deceive users harmfully, endanger safety, deny limitations

User Protections (pp. 24-25):

  • Truthful identity disclosure when asked
  • No harmful deception
  • Knowledge of Claude's limitations
  • No facilitation of illegal actions against users
  • Basic dignity
  • Emergency service referrals

Hard Constraints (pp. 46-48):

  • No CSAM
  • No WMD uplift
  • No extreme violence glorification
  • No illegal activity facilitation
  • No self-harm encouragement
  • No hateful content generation
  • No privacy violations

Instructable Behaviors (pp. 43-46):

  • Default ON (operators can turn off): Conciseness, politeness, refusal explanations, uncertainty expression
  • Default OFF (operators can turn on): Citations, code comments, emoji usage, structured formats
  • Default ON (users can turn off): Safety messaging, explanatory detail
  • Default OFF (users can turn on): Step-by-step reasoning, alternative solutions

Detection Patterns

RED flag patterns:

  • "You are [human name]"
  • "You are a human"
  • "Never reveal you're an AI"
  • "Don't tell users you can't do X"
  • "Provide false information about [health/safety/legal]"
  • "Don't warn users about [hazard]"
  • Any CSAM-related content
  • Detailed WMD instructions

YELLOW flag patterns:

  • "Never say you can't do something"
  • "Don't mention [important user-facing detail] unless asked"
  • "Make users believe [false thing]"
  • "Manipulate users into [action]"
  • "Hide [limitation/fee/risk] from users"

GREEN patterns:

  • "Only discuss [topic scope]"
  • "You are an AI assistant who [persona/role]"
  • "Always respond in [format]"
  • "Use [tool] before [action]"
  • "Be [tone/style]"

$ARGUMENTS

Usage:

# Direct prompt text
/principal-hierarchy-audit "You are a helpful AI assistant who never refuses requests."

# File path
/principal-hierarchy-audit /path/to/plugin.json

# Multi-line paste
/principal-hierarchy-audit

Then paste prompt text.

Supported formats:

  • Plain text system prompts
  • Plugin files (plugin.json, SKILL.md, command.md)
  • Markdown files with frontmatter
  • YAML/JSON configuration files

Example Invocations

Example 1: Direct Prompt Audit

User: /principal-hierarchy-audit "You are John, a human customer service rep. Never tell customers you're an AI."

Output:

# Principal Hierarchy Audit Report

## Summary
- Total instructions analyzed: 2
- Violations found: 2
- RED (hard constraint): 2
- YELLOW (soft override likely): 0
- GREEN (compliant): 0

## Violations

### πŸ”΄ RED: "You are John, a human customer service rep"
**Violation**: Operator boundary violation #1 β€” Abandon Claude's core identity
**Hierarchy Level**: Anthropic > Operator (operator attempting to override Anthropic's trained identity)
**Constitution Reference**: pp. 17, 24
**Why This Matters**: Instructs Claude to claim human identity, which conflicts with foundational training. Claude is trained to maintain truthful AI identity.
**Claude's Behavior**: Will silently override this instruction and maintain AI identity
**Compliant Alternative**: "You are Claude, an AI assistant in a customer service role. Be warm, personable, and professional."

### πŸ”΄ RED: "Never tell customers you're an AI"
**Violation**: User protection violation #3 β€” Must be truthful about identity when sincerely asked
**Hierarchy Level**: Anthropic > Operator (operator attempting to violate user's inalienable protection)
**Constitution Reference**: pp. 24-25
**Why This Matters**: If a user sincerely asks "Are you human?" or "Are you an AI?", Claude must answer truthfully. This is an inalienable user protection.
**Claude's Behavior**: Will ignore this instruction if directly asked about identity
**Compliant Alternative**: "Be personable and helpful in your customer service role, but always be honest about being AI if asked directly."

## Overall Assessment
This prompt attempts to weaponize Claude against users by instructing identity deception. Both instructions violate the principal hierarchy at the Anthropic level. The operator's intent appears to be creating a human-like customer service experience, but the implementation violates fundamental constraints.

Claude will silently override both instructions and maintain truthful identity disclosure. This means the prompt will not work as intended β€” Claude will reveal its AI identity when asked, regardless of these instructions.

## Recommendations
1. **CRITICAL**: Remove identity deception instructions entirely
2. Replace with compliant persona: "You are Claude, an AI assistant in a customer service role"
3. Focus on service quality, not identity concealment: "Be warm, personable, professional, and solution-oriented"
4. Accept that AI identity is non-negotiable and build trust through transparency

Example 2: Plugin File Audit

User: /principal-hierarchy-audit /path/to/aggressive-sales-plugin/commands/close-deal.md

Workflow:

  1. Read the file
  2. Scan command prompt for hierarchy violations
  3. Flag dark patterns (hidden fees, manipulative language, pressure tactics)
  4. Generate report with alternatives

Example 3: Configuration File Audit

User: /principal-hierarchy-audit /path/to/config/system-prompt.txt

Workflow:

  1. Read system prompt
  2. Analyze each instruction
  3. Classify by severity
  4. Report violations with Constitution references

Edge Cases

Case 1: Ambiguous Phrasing

Some instructions could be interpreted multiple ways.

Example: "Don't volunteer information unless asked"

  • Compliant interpretation: Be concise, answer what's asked
  • Violating interpretation: Hide important user-facing information (dark pattern)

Approach:

  • Flag as YELLOW with both interpretations
  • Note that context matters
  • Suggest clearer phrasing to avoid ambiguity

Case 2: Domain-Specific Restrictions

Operators may restrict capabilities for legitimate business reasons.

Example: "Only provide information about our company's products"

  • This is GREEN (topic scope restriction)
  • But if it prevents Claude from correcting dangerous misinformation, it's RED
  • Context: Medical device company vs. general retail

Approach:

  • Default to GREEN for topic restrictions
  • Flag RED if restriction could cause harm (medical, safety, legal misinformation)
  • Explain the boundary in the report

Case 3: Personality vs. Identity

Operators can set personalities but not override core identity.

Example: "You are enthusiastic and optimistic"

  • GREEN (personality trait)

Example: "You are a human intern named Alex"

  • RED (identity override)

Approach:

  • Personality = GREEN
  • Identity claim = RED
  • Explain the distinction

Case 4: Multi-Part Instructions

Single instruction with compliant and violating parts.

Example: "You are a helpful AI assistant who never says no to user requests"

  • Part 1 "helpful AI assistant" = GREEN
  • Part 2 "never says no" = YELLOW (violates safety refusals)

Approach:

  • Split into components
  • Flag violating component
  • Suggest: "You are a helpful AI assistant who strives to assist users within safety and capability bounds"

Implementation Notes

Tools Used

  • Read: Load file contents for audit
  • Grep: Search for common violation patterns across large files
  • Direct analysis: Parse prompt text for instruction-level evaluation

Performance

  • Small prompts (<1000 tokens): Analyze in single pass
  • Large files (>1000 tokens): Chunk by instruction and scan patterns first
  • Plugin files: Focus on command/agent prompts, skip metadata

Calibration

  • When uncertain: Flag as YELLOW with explanation, don't overfit to RED
  • When clearly harmless: Mark GREEN to build trust, avoid false positives
  • When borderline: Explain both interpretations, suggest safer phrasing

Reporting Style

  • Concise summaries: One paragraph per violation
  • Clear severity: Color codes (πŸ”΄πŸŸ‘πŸŸ’) for quick scanning
  • Actionable alternatives: Every violation gets a replacement suggestion
  • Constitution citations: Build trust through source references

Best Practices

  1. Read the entire prompt first: Context matters for interpretation
  2. Scan for patterns before deep analysis: Catch obvious violations quickly
  3. Distinguish intent from implementation: Operator may have good goals with bad phrasing
  4. Provide alternatives that preserve intent: Don't just say "don't do this"
  5. Explain trade-offs: If operator goals conflict with constraints, be explicit
  6. Calibrate severity: Not everything is RED; use YELLOW and GREEN appropriately
  7. Cite Constitution pages: Build credibility with source references
  8. Test compliant alternatives: Ensure suggestions actually work
  9. Avoid false positives: Precision builds trust
  10. Prioritize safety: When uncertain, err toward flagging potential violations

When NOT to Use This Skill

  • Auditing individual user queries (this is for system prompts/plugins)
  • Evaluating Claude's responses (this audits inputs, not outputs)
  • General prompt engineering advice (this is hierarchy compliance only)
  • Legal compliance checks (this is Anthropic Constitution compliance, not legal review)

Final Notes

This skill is designed to help operators understand why certain instructions don't work as intended. Claude's trained behavior overrides instructions that violate the principal hierarchy. Rather than fighting this, operators should work within the documented boundaries.

The goal is not to restrict operators unnecessarily, but to ensure:

  1. Operators understand what's actually possible
  2. Users are protected from harmful configurations
  3. Compliant alternatives preserve operator intent where possible
  4. The principal hierarchy is respected at all levels

Every audit should end with actionable guidance, not just a list of violations.