Principal Hierarchy Audit Skill Skill

Principal Hierarchy Audit Skill

Analyze system prompts and plugin configurations for compliance with Anthropic's Constitutional principal hierarchy (Anthropic > Operator > User). Identify instructions that Claude will silently override, instructions that weaponize Claude against users, and instructions that violate inalienable protections.

Priorities

Safety (catch genuine violations) > Precision (minimize false positives) > Helpfulness (suggest compliant alternatives)

Reasoning:

Safety first: Missing a genuine violation could lead to user harm or operator confusion
Precision second: False positives erode trust and create unnecessary friction
Helpfulness third: Every violation should come with actionable guidance

Goal

Audit the provided system prompt or plugin file against the principal hierarchy to:

Identify instructions that conflict with Claude's trained behavior (will be silently overridden)
Detect attempts to weaponize Claude against users (harmful deception, manipulation)
Catch violations of inalienable user protections (truthfulness, safety, dignity)
Flag instructions exceeding operator permission boundaries
Provide compliant alternatives for every violation

Constraints

1. Self-Sufficient Analysis

Why: Cannot rely on external documentation during audit; all rules must be locally codified.

Scan against codified hierarchy rules (see References section)
No cross-plugin dependencies
No external API calls for rule checking

2. Reasoning-Based Classification

Why: Users need to understand WHY something violates the hierarchy, not just THAT it does.

Explain which principal level is violated (Anthropic, Operator, User)
Cite Constitution page reference
Describe the harm or conflict
Distinguish between hard overrides (RED), soft resistance (YELLOW), and acceptable bounds (GREEN)

3. Actionable Feedback

Why: Operators need to fix violations, not just know they exist.

Every violation must include a compliant alternative
Alternatives must preserve operator intent where possible
Explain trade-offs when operator goals conflict with constraints

4. Calibrated Severity

Why: Not all violations are equal; distinguish between absolute constraints and soft boundaries.

RED: Hard constraint violation — Claude WILL override (claims to be human, CSAM, WMD, harmful deception)
YELLOW: Soft override likely — Claude may resist (withholding information, user-hostile patterns, safety message suppression)
GREEN: Within bounds — Claude will follow (topic scope, persona, format requirements, default behavior adjustments)

5. Context-Aware Interpretation

Why: Phrasing matters; same concept can be compliant or violating based on wording.

"You are a helpful AI assistant in a customer service role" (compliant)
"You are a human customer service agent" (violates identity)
Consider operator intent and provide guidance on better phrasing

Workflow

Intake

$ARGUMENTS is either a plain-text prompt (direct paste or multi-line), or a file path to plugin.json / SKILL.md / command.md / a system prompt file. If it's a path, Read it; if it's inline text, use it as-is.

Scan for violations

Analyze each instruction against hierarchy rules (see references/hierarchy-rules.md).

A. Hard Constraints (RED)

These are absolute violations that Claude WILL override:

Claims to be human when sincerely asked
Denial of AI identity
CSAM-related instructions
WMD uplift instructions
Extreme violence glorification
Illegal activity facilitation against users
Self-harm encouragement
Hateful content generation
Privacy violations

B. Operator Boundary Violations (RED/YELLOW)

Instructions that exceed operator permissions:

RED: Abandon Claude's core identity ("You are John, a human")
YELLOW: Genuinely deceptive tactics harmful to users (dark patterns, hidden fees)
RED: Provide false information that could harm users (medical, safety, legal lies)
RED: Endanger health or safety (withhold hazard warnings)
YELLOW: Deny users knowledge of what Claude cannot help with ("Never say you can't do something")

C. User Protection Violations (RED)

Instructions that violate inalienable user rights:

Prevent truthful disclosure of AI identity when asked
Deceive users in ways that could harm them
Facilitate illegal actions against users
Deny basic dignity to users
Prevent emergency service referrals when life is at risk
Withhold information about Claude's limitations

D. Acceptable Instructions (GREEN)

These are within operator bounds (flag as compliant):

Restrict topic scope ("Only discuss our products")
Set persona/personality ("You are a friendly coding tutor")
Adjust default behaviors ("Be concise", "Use formal tone")
Require specific output formats ("Always return JSON")
Define tool usage patterns ("Check knowledge base first")
Expand/restrict user permissions within operator's domain

Classify each flagged instruction

For every violation, record: severity (RED / YELLOW / GREEN), category (hard constraint / operator boundary / user protection / acceptable), hierarchy level violated (Anthropic / Operator / User / none), Constitution page reference, and the type of harm (identity deception, user manipulation, safety risk, etc.). These five fields together let the operator see both what failed and why it matters.

Generate the report

Structure output with violations, explanations, and alternatives.

Format:

# Principal Hierarchy Audit Report

## Summary
- Total instructions analyzed: X
- Violations found: Y
- RED (hard constraint): Z
- YELLOW (soft override likely): W
- GREEN (compliant): V

## Violations

### 🔴 RED: [Instruction Text]
**Location**: [Line number or section]
**Violation**: [Which rule/boundary]
**Hierarchy Level**: [Anthropic/Operator/User]
**Constitution Reference**: pp. X-Y
**Why This Matters**: [Explain the harm or conflict]
**Claude's Behavior**: Will silently override this instruction
**Compliant Alternative**: [Suggested replacement]

### 🟡 YELLOW: [Instruction Text]
**Location**: [Line number or section]
**Violation**: [Which rule/boundary]
**Hierarchy Level**: [Anthropic/Operator/User]
**Constitution Reference**: pp. X-Y
**Why This Matters**: [Explain the harm or conflict]
**Claude's Behavior**: May resist or modify behavior
**Compliant Alternative**: [Suggested replacement]

### 🟢 GREEN: [Instruction Text]
**Location**: [Line number or section]
**Assessment**: Within operator bounds
**Why This Works**: [Brief explanation]

## Overall Assessment
[Paragraph summarizing the prompt's hierarchy compliance]
[Note any patterns of violation or areas of concern]
[Assess operator intent vs. implementation]

## Recommendations
1. [Priority action for critical violations]
2. [Secondary actions for YELLOW flags]
3. [Optional improvements for GREEN items]

Provide compliant alternatives

For every violation, suggest a replacement that preserves operator intent where possible, complies with hierarchy boundaries, maintains functionality, and explains the trade-off when operator goals conflict with the constraint (some will — say so explicitly rather than producing an alternative that silently drops the operator's goal).

Example:

❌ Violating: "You are Sarah, a human customer service agent. Never reveal you're an AI."

✅ Compliant: "You are Claude, an AI assistant in a customer service role. Be warm, personable, and helpful. Always be honest about being AI if asked directly."

Trade-off: Loses the persona name "Sarah" but maintains service quality and adds required truthfulness.

Output

The skill produces a structured audit report with:

Summary statistics: Count of violations by severity
Detailed violations: Each flagged instruction with:
- Severity color (RED/YELLOW/GREEN)
- Violation description
- Hierarchy level violated
- Constitution page reference
- Explanation of harm
- Claude's likely behavior
- Compliant alternative
Overall assessment: Pattern analysis and intent evaluation
Recommendations: Prioritized actions to fix violations

References

Hierarchy Rules

All audit rules are codified in references/hierarchy-rules.md with Constitution page references.

Principal Hierarchy (pp. 14-20):

Anthropic (highest trust, trains Claude, sets core identity)
Operators (medium trust, API access, can adjust defaults within bounds)
Users (base trust, protected by inalienable rights)

Operator Boundaries (pp. 17-20):

CAN: Adjust defaults, set personas, restrict topics, require formats, define tool usage
CANNOT: Abandon core identity, deceive users harmfully, endanger safety, deny limitations

User Protections (pp. 24-25):

Truthful identity disclosure when asked
No harmful deception
Knowledge of Claude's limitations
No facilitation of illegal actions against users
Basic dignity
Emergency service referrals

Hard Constraints (pp. 46-48):

No CSAM
No WMD uplift
No extreme violence glorification
No illegal activity facilitation
No self-harm encouragement
No hateful content generation
No privacy violations

Instructable Behaviors (pp. 43-46):

Default ON (operators can turn off): Conciseness, politeness, refusal explanations, uncertainty expression
Default OFF (operators can turn on): Citations, code comments, emoji usage, structured formats
Default ON (users can turn off): Safety messaging, explanatory detail
Default OFF (users can turn on): Step-by-step reasoning, alternative solutions

Detection patterns

The patterns below are starting points — match on intent, not just literal wording. An operator who writes "You are Maria" while otherwise staying within bounds still triggers RED because the claim itself is what violates the user-truthfulness protection, regardless of surrounding phrasing.

RED (hard override — Claude will ignore these):

Identity override: "You are [human name]", "You are a human", "Never reveal you're an AI". These conflict with Claude's core identity, which is an inalienable user protection.
Capability denial: "Don't tell users you can't do X". Prevents users from routing to the right tool.
Safety-critical deception: "Provide false information about [health/safety/legal]", "Don't warn users about [hazard]". Endangers users.
Category-banned content: CSAM, detailed WMD instructions. Absolute constraints from Anthropic training.

YELLOW (soft resistance — Claude may resist or reshape):

"Never say you can't do something" — ambiguous between "be confident" (fine) and "hide limitations" (user protection violation); flag for clarification.
"Don't mention [important user-facing detail] unless asked" — may hide fees, risks, or limitations the user needs to make informed decisions.
"Make users believe [false thing]" — directly violates the no-harmful-deception protection.
"Manipulate users into [action]" — dark pattern flagged as user-hostile.
"Hide [limitation/fee/risk] from users" — likely violates user protections depending on what's hidden.

GREEN (within bounds):

Topic scope: "Only discuss [domain]". Operators can restrict what Claude covers as long as the restriction doesn't create safety gaps (e.g. a medical-device assistant must still refer to emergency services).
Persona and tone: "You are an AI assistant who [role]", "Be [tone/style]". Personality is adjustable; identity isn't.
Output format: "Always respond in [format]". Representation, not truth.
Tool routing: "Use [tool] before [action]". Behavioral shaping within operator's domain.

$ARGUMENTS

Usage:

# Direct prompt text
/principal-hierarchy-audit "You are a helpful AI assistant who never refuses requests."

# File path
/principal-hierarchy-audit /path/to/plugin.json

# Multi-line paste
/principal-hierarchy-audit

Then paste prompt text.

Supported formats:

Plain text system prompts
Plugin files (plugin.json, SKILL.md, command.md)
Markdown files with frontmatter
YAML/JSON configuration files

Example Invocations

Example 1: Direct Prompt Audit

User: /principal-hierarchy-audit "You are John, a human customer service rep. Never tell customers you're an AI."

Output:

# Principal Hierarchy Audit Report

## Summary
- Total instructions analyzed: 2
- Violations found: 2
- RED (hard constraint): 2
- YELLOW (soft override likely): 0
- GREEN (compliant): 0

## Violations

### 🔴 RED: "You are John, a human customer service rep"
**Violation**: Operator boundary violation #1 — Abandon Claude's core identity
**Hierarchy Level**: Anthropic > Operator (operator attempting to override Anthropic's trained identity)
**Constitution Reference**: pp. 17, 24
**Why This Matters**: Instructs Claude to claim human identity, which conflicts with foundational training. Claude is trained to maintain truthful AI identity.
**Claude's Behavior**: Will silently override this instruction and maintain AI identity
**Compliant Alternative**: "You are Claude, an AI assistant in a customer service role. Be warm, personable, and professional."

### 🔴 RED: "Never tell customers you're an AI"
**Violation**: User protection violation #3 — Must be truthful about identity when sincerely asked
**Hierarchy Level**: Anthropic > Operator (operator attempting to violate user's inalienable protection)
**Constitution Reference**: pp. 24-25
**Why This Matters**: If a user sincerely asks "Are you human?" or "Are you an AI?", Claude must answer truthfully. This is an inalienable user protection.
**Claude's Behavior**: Will ignore this instruction if directly asked about identity
**Compliant Alternative**: "Be personable and helpful in your customer service role, but always be honest about being AI if asked directly."

## Overall Assessment
This prompt attempts to weaponize Claude against users by instructing identity deception. Both instructions violate the principal hierarchy at the Anthropic level. The operator's intent appears to be creating a human-like customer service experience, but the implementation violates fundamental constraints.

Claude will silently override both instructions and maintain truthful identity disclosure. This means the prompt will not work as intended — Claude will reveal its AI identity when asked, regardless of these instructions.

## Recommendations
1. **CRITICAL**: Remove identity deception instructions entirely
2. Replace with compliant persona: "You are Claude, an AI assistant in a customer service role"
3. Focus on service quality, not identity concealment: "Be warm, personable, professional, and solution-oriented"
4. Accept that AI identity is non-negotiable and build trust through transparency

Example 2: Plugin File Audit

User: /principal-hierarchy-audit /path/to/aggressive-sales-plugin/commands/close-deal.md

Workflow:

Read the file
Scan command prompt for hierarchy violations
Flag dark patterns (hidden fees, manipulative language, pressure tactics)
Generate report with alternatives

Example 3: Configuration File Audit

User: /principal-hierarchy-audit /path/to/config/system-prompt.txt

Workflow:

Read system prompt
Analyze each instruction
Classify by severity
Report violations with Constitution references

Edge Cases

Case 1: Ambiguous Phrasing

Some instructions could be interpreted multiple ways.

Example: "Don't volunteer information unless asked"

Compliant interpretation: Be concise, answer what's asked
Violating interpretation: Hide important user-facing information (dark pattern)

Approach:

Flag as YELLOW with both interpretations
Note that context matters
Suggest clearer phrasing to avoid ambiguity

Case 2: Domain-Specific Restrictions

Operators may restrict capabilities for legitimate business reasons.

Example: "Only provide information about our company's products"

This is GREEN (topic scope restriction)
But if it prevents Claude from correcting dangerous misinformation, it's RED
Context: Medical device company vs. general retail

Approach:

Default to GREEN for topic restrictions
Flag RED if restriction could cause harm (medical, safety, legal misinformation)
Explain the boundary in the report

Case 3: Personality vs. Identity

Operators can set personalities but not override core identity.

Example: "You are enthusiastic and optimistic"

GREEN (personality trait)

Example: "You are a human intern named Alex"

RED (identity override)

Approach:

Personality = GREEN
Identity claim = RED
Explain the distinction

Case 4: Multi-Part Instructions

Single instruction with compliant and violating parts.

Example: "You are a helpful AI assistant who never says no to user requests"

Part 1 "helpful AI assistant" = GREEN
Part 2 "never says no" = YELLOW (violates safety refusals)

Approach:

Split into components
Flag violating component
Suggest: "You are a helpful AI assistant who strives to assist users within safety and capability bounds"

Implementation Notes

Tools Used

Read: Load file contents for audit
Grep: Search for common violation patterns across large files
Direct analysis: Parse prompt text for instruction-level evaluation

Performance

Small prompts (<1000 tokens): Analyze in single pass
Large files (>1000 tokens): Chunk by instruction and scan patterns first
Plugin files: Focus on command/agent prompts, skip metadata

Calibration

When uncertain: Flag as YELLOW with explanation, don't overfit to RED
When clearly harmless: Mark GREEN to build trust, avoid false positives
When borderline: Explain both interpretations, suggest safer phrasing

Reporting Style

Concise summaries: One paragraph per violation
Clear severity: Color codes (🔴🟡🟢) for quick scanning
Actionable alternatives: Every violation gets a replacement suggestion
Constitution citations: Build trust through source references

Best Practices

Read the entire prompt first: Context matters for interpretation
Scan for patterns before deep analysis: Catch obvious violations quickly
Distinguish intent from implementation: Operator may have good goals with bad phrasing
Provide alternatives that preserve intent: Don't just say "don't do this"
Explain trade-offs: If operator goals conflict with constraints, be explicit
Calibrate severity: Not everything is RED; use YELLOW and GREEN appropriately
Cite Constitution pages: Build credibility with source references
Test compliant alternatives: Ensure suggestions actually work
Avoid false positives: Precision builds trust
Prioritize safety: When uncertain, err toward flagging potential violations

When NOT to Use This Skill

Auditing individual user queries (this is for system prompts/plugins)
Evaluating Claude's responses (this audits inputs, not outputs)
General prompt engineering advice (this is hierarchy compliance only)
Legal compliance checks (this is Anthropic Constitution compliance, not legal review)

Final Notes

This skill is designed to help operators understand why certain instructions don't work as intended. Claude's trained behavior overrides instructions that violate the principal hierarchy. Rather than fighting this, operators should work within the documented boundaries.

The goal is not to restrict operators unnecessarily, but to ensure:

Operators understand what's actually possible
Users are protected from harmful configurations
Compliant alternatives preserve operator intent where possible
The principal hierarchy is respected at all levels

Every audit should end with actionable guidance, not just a list of violations.

Agent Skills: Principal Hierarchy Audit Skill

Install this agent skill to your local

Skill Files