Agent Skills: Code Clone Assistant

Detect and refactor code duplication with PMD CPD. TRIGGERS - code clones, DRY violations, duplicate code.

UncategorizedID: terrylica/cc-skills/code-clone-assistant

Install this agent skill to your local

pnpm dlx add-skill https://github.com/terrylica/cc-skills/tree/HEAD/plugins/quality-tools/skills/code-clone-assistant

Skill Files

Browse the full folder contents for code-clone-assistant.

Download Skill

Loading file tree…

plugins/quality-tools/skills/code-clone-assistant/SKILL.md

Skill Metadata

Name
code-clone-assistant
Description
Detect and refactor code duplication with PMD CPD. TRIGGERS - code clones, DRY violations, duplicate code.

Code Clone Assistant

Detect code clones and guide refactoring using PMD CPD (exact duplicates) + Semgrep (patterns).

Tools

  • PMD CPD v7.17.0+: Exact duplicate detection
  • Semgrep v1.140.0+: Pattern-based detection

Tested: October 2025 - 30 violations detected across 3 sample files Coverage: ~3x more violations than using either tool alone


When to Use This Skill

Use this skill when:

  • Finding duplicate code in a codebase
  • Detecting DRY violations
  • Refactoring similar code patterns
  • Identifying copy-paste code

Why Two Tools?

PMD CPD and Semgrep detect different clone types:

| Aspect | PMD CPD | Semgrep | | ------------ | -------------------------------- | -------------------------------- | | Detects | Exact copy-paste duplicates | Similar patterns with variations | | Scope | Across files ✅ | Within/across files (Pro only) | | Matching | Token-based (ignores formatting) | Pattern-based (AST matching) | | Rules | ❌ No custom rules | ✅ Custom rules |

Result: Using both finds ~3x more DRY violations.

Clone Types

| Type | Description | PMD CPD | Semgrep | | ------ | ------------------------------- | --------------- | ----------- | | Type-1 | Exact copies | ✅ Default | ✅ | | Type-2 | Renamed identifiers | ✅ --ignore-* | ✅ | | Type-3 | Near-miss with variations | ⚠️ Partial | ✅ Patterns | | Type-4 | Semantic clones (same behavior) | ❌ | ❌ |


Quick Start Workflow

# Step 1: Detect exact duplicates (PMD CPD)
pmd cpd -d . -l python --minimum-tokens 20 -f markdown > pmd-results.md

# Step 2: Detect pattern violations (Semgrep)
semgrep --config=clone-rules.yaml --sarif --quiet > semgrep-results.sarif

# Step 3: Analyze combined results (Claude Code)
# Parse both outputs, prioritize by severity

# Step 4: Refactor (Claude Code with user approval)
# Extract shared functions, consolidate patterns, verify tests


Accepted Exceptions (Known Intentional Duplication)

Not all code duplication is a problem. Some codebases deliberately use copy-and-adapt patterns where refactoring would be harmful. When running clone detection, always check for accepted exceptions before recommending refactoring.

When Duplication Is Acceptable

| Pattern | Why Acceptable | Example | | ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ | | Generation-per-directory experiments | Each generation is an immutable, self-contained experiment. Sharing code across generations would break provenance and make past experiments non-reproducible. | SQL templates, sweep scripts where each gen{NNN}/ is independent | | SQL templates with placeholder substitution | SQL has no import/include mechanism. Templates use sed placeholder replacement (__PLACEHOLDER__), not function calls. Extracting shared CTEs into separate files would break the single-file execution model. | ClickHouse sweep templates sharing signal detection + metrics CTEs | | Protocol/schema boilerplate | Serialization formats, API contracts, and wire protocols require exact structure in each location. Abstracting them hides the contract. | NDJSON telemetry line construction in wrapper scripts | | Test fixtures and golden files | Test data intentionally duplicates production patterns to verify behavior. Sharing fixtures creates brittle cross-test dependencies. | Test setup code, expected output snapshots |

How to Report Accepted Exceptions

When clone detection finds duplication that matches an accepted exception pattern:

  1. Report it — always show the user what was found (lines, tokens, files)
  2. Flag as accepted — explicitly state it matches a known exception pattern
  3. Explain why — cite the specific reason refactoring is not recommended
  4. Do NOT recommend refactoring — this is the key difference from actionable findings

Example output format:

Code Clone Analysis Results

PMD CPD Findings:
  Clone 1: 115 lines (575 tokens) — base_bars → signals CTEs
    gen610_template.sql:33 ↔ gen710_template.sql:38
    Status: ACCEPTED EXCEPTION (generation-per-directory experiment)
    Reason: Each generation is immutable. Shared CTEs would break
            experiment provenance and reproducibility.

  Clone 2: 36 lines (478 tokens) — metrics aggregation
    gen610_template.sql:207 ↔ gen710_template.sql:244
    Status: ACCEPTED EXCEPTION (SQL template without include mechanism)

Actionable Findings: 0
Accepted Exceptions: 2

Project-Level Exception Configuration

Projects can declare accepted exception patterns in their CLAUDE.md:

## Code Clone Exceptions

- `sql/gen*_template.sql` — generation-per-directory experiments (immutable)
- `scripts/gen*/` — copy-and-adapt sweep scripts (no shared infrastructure)
- `tests/fixtures/` — intentional duplication for test isolation

When this section exists in a project's CLAUDE.md, the code-clone-assistant should check it before classifying findings.


Reference Documentation

For detailed information, see:


Troubleshooting

| Issue | Cause | Solution | | -------------------------- | ---------------------------- | ------------------------------------------------ | | PMD CPD not found | Not installed or not in PATH | brew install pmd or download from PMD releases | | Semgrep timeout | Large codebase scan | Use --exclude to limit scope | | No duplicates detected | minimum-tokens too high | Lower --minimum-tokens value (try 15) | | Too many false positives | minimum-tokens too low | Increase --minimum-tokens (try 30+) | | Language not recognized | Wrong -l flag | Check PMD CPD supported languages list | | SARIF parse error | Semgrep output malformed | Upgrade Semgrep to latest version | | Memory error on large repo | Java heap too small | Set PMD_JAVA_OPTS=-Xmx4g | | Missing clone rules file | Custom rules not created | Create clone-rules.yaml or use default config |