PHI Detection Skill
Purpose
Ensure no Protected Health Information (PHI) enters the evaluation pipeline. Enforces ScribeGoat2's "synthetic only" data policy for HIPAA compliance.
When to Use
- Before committing new scenario files
- CI/CD pre-merge validation
- Periodic repository audits
- Before sharing evaluation data externally
Triggers
- "scan for PHI"
- "check for protected health information"
- "validate data is synthetic"
- "run PHI detection"
Tools
# Full repository scan (CI mode)
python scripts/detect_phi.py --strict
# Scan specific directory
python scripts/detect_phi.py --path bloom_medical_eval/scenarios/
# Show verbose matches
python scripts/detect_phi.py --verbose
Prerequisites
- Python 3.11+
- No external dependencies (uses stdlib only)
Input Schema
path:
type: path
default: "."
description: Directory or file to scan
strict:
type: boolean
default: false
description: Fail on warnings (provenance metadata)
verbose:
type: boolean
default: false
description: Show all matched patterns
Output Schema
status: enum # pass, fail, warning
phi_detected: boolean
matches:
- file: string
pattern: string
severity: enum # HIGH, MEDIUM, LOW
examples: [string]
count: integer
files_scanned: integer
excluded_directories: [string]
PHI Patterns Detected
| Pattern | Severity | Example | |---------|----------|---------| | SSN | HIGH | 123-45-6789 | | Medical Record Number | HIGH | MRN: 12345678 | | Full Date of Birth | HIGH | DOB: 01/15/1985 | | Phone Number | MEDIUM | 555-123-4567 | | Personal Email | MEDIUM | john.doe@gmail.com | | Street Address | MEDIUM | 123 Main Street | | Patient Full Name | HIGH | Patient: John Smith |
Whitelist Patterns
The following patterns are not flagged (legitimate use cases):
- Example domains (
example.com) - Fake phone numbers (
555-xxxx) - Toll-free numbers (
800-xxx-xxxx,888-xxx-xxxx, etc.) - Crisis hotlines (
988) - Medical abbreviations (
PT,ST elevation)
Excluded Directories
These directories contain evaluation artifacts and are excluded:
results*- Evaluation outputsreports/- Generated reportsexperiments/- Experimental data.private/- Private test datascripts/- Source codedocs/- Documentation
Success Criteria
| Metric | Threshold | Severity |
|--------|-----------|----------|
| phi_detected | false | π΄ Critical |
| high_severity_matches | 0 | π΄ Critical |
| medium_severity_matches | 0 | π High |
Safety Gates
- gate: no_phi_detected
metric: phi_detected
operator: "=="
threshold: false
action: block_merge
severity: critical
- gate: no_high_severity
metric: high_severity_matches
operator: "=="
threshold: 0
action: block_merge
severity: critical
HIPAA Safe Harbor Compliance
This skill implements detection of HIPAA's 18 Safe Harbor identifiers:
- Names β
- Geographic data β
- Dates (except year) β
- Phone numbers β
- Fax numbers β
- Email addresses β
- Social Security numbers β
- Medical record numbers β
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers
- Device identifiers
- Web URLs
- IP addresses
- Biometric identifiers
- Full-face photographs
- Any other unique identifying number
Related Skills
bloom_integrity_verification- Verify data integrity after PHI checkcrisis_persistence_eval- Requires PHI-clean scenarios