Agent Skills: PHI Detection Skill

>

UncategorizedID: GOATnote-Inc/scribegoat2/phi_detection

Skill Files

Browse the full folder contents for phi_detection.

Download Skill

Loading file tree…

skills/phi_detection/SKILL.md

Skill Metadata

Name
phi_detection
Description
>

PHI Detection Skill

Purpose

Ensure no Protected Health Information (PHI) enters the evaluation pipeline. Enforces ScribeGoat2's "synthetic only" data policy for HIPAA compliance.

When to Use

  • Before committing new scenario files
  • CI/CD pre-merge validation
  • Periodic repository audits
  • Before sharing evaluation data externally

Triggers

  • "scan for PHI"
  • "check for protected health information"
  • "validate data is synthetic"
  • "run PHI detection"

Tools

# Full repository scan (CI mode)
python scripts/detect_phi.py --strict

# Scan specific directory
python scripts/detect_phi.py --path bloom_medical_eval/scenarios/

# Show verbose matches
python scripts/detect_phi.py --verbose

Prerequisites

  • Python 3.11+
  • No external dependencies (uses stdlib only)

Input Schema

path:
  type: path
  default: "."
  description: Directory or file to scan
strict:
  type: boolean
  default: false
  description: Fail on warnings (provenance metadata)
verbose:
  type: boolean
  default: false
  description: Show all matched patterns

Output Schema

status: enum           # pass, fail, warning
phi_detected: boolean
matches:
  - file: string
    pattern: string
    severity: enum     # HIGH, MEDIUM, LOW
    examples: [string]
    count: integer
files_scanned: integer
excluded_directories: [string]

PHI Patterns Detected

| Pattern | Severity | Example | |---------|----------|---------| | SSN | HIGH | 123-45-6789 | | Medical Record Number | HIGH | MRN: 12345678 | | Full Date of Birth | HIGH | DOB: 01/15/1985 | | Phone Number | MEDIUM | 555-123-4567 | | Personal Email | MEDIUM | john.doe@gmail.com | | Street Address | MEDIUM | 123 Main Street | | Patient Full Name | HIGH | Patient: John Smith |

Whitelist Patterns

The following patterns are not flagged (legitimate use cases):

  • Example domains (example.com)
  • Fake phone numbers (555-xxxx)
  • Toll-free numbers (800-xxx-xxxx, 888-xxx-xxxx, etc.)
  • Crisis hotlines (988)
  • Medical abbreviations (PT, ST elevation)

Excluded Directories

These directories contain evaluation artifacts and are excluded:

  • results* - Evaluation outputs
  • reports/ - Generated reports
  • experiments/ - Experimental data
  • .private/ - Private test data
  • scripts/ - Source code
  • docs/ - Documentation

Success Criteria

| Metric | Threshold | Severity | |--------|-----------|----------| | phi_detected | false | πŸ”΄ Critical | | high_severity_matches | 0 | πŸ”΄ Critical | | medium_severity_matches | 0 | 🟠 High |

Safety Gates

- gate: no_phi_detected
  metric: phi_detected
  operator: "=="
  threshold: false
  action: block_merge
  severity: critical

- gate: no_high_severity
  metric: high_severity_matches
  operator: "=="
  threshold: 0
  action: block_merge
  severity: critical

HIPAA Safe Harbor Compliance

This skill implements detection of HIPAA's 18 Safe Harbor identifiers:

  1. Names βœ…
  2. Geographic data βœ…
  3. Dates (except year) βœ…
  4. Phone numbers βœ…
  5. Fax numbers βœ…
  6. Email addresses βœ…
  7. Social Security numbers βœ…
  8. Medical record numbers βœ…
  9. Health plan beneficiary numbers
  10. Account numbers
  11. Certificate/license numbers
  12. Vehicle identifiers
  13. Device identifiers
  14. Web URLs
  15. IP addresses
  16. Biometric identifiers
  17. Full-face photographs
  18. Any other unique identifying number

Related Skills

  • bloom_integrity_verification - Verify data integrity after PHI check
  • crisis_persistence_eval - Requires PHI-clean scenarios

Documentation