Agent Skills: Performing AI-Driven OSINT Correlation

>-

UncategorizedID: plurigrid/asi/performing-ai-driven-osint-correlation

Install this agent skill to your local

pnpm dlx add-skill https://github.com/plurigrid/asi/tree/HEAD/plugins/asi/skills/performing-ai-driven-osint-correlation

Skill Files

Browse the full folder contents for performing-ai-driven-osint-correlation.

Download Skill

Loading file tree…

plugins/asi/skills/performing-ai-driven-osint-correlation/SKILL.md

Skill Metadata

Name
performing-ai-driven-osint-correlation
Description
>-

Performing AI-Driven OSINT Correlation

When to Use

  • You have collected raw OSINT data from multiple tools and sources but need to identify connections, contradictions, and patterns across them.
  • You need to build a unified intelligence profile for a target entity (person, organization, or infrastructure) from fragmented data.
  • Traditional manual correlation is too slow or error-prone for the volume of data collected.
  • You want confidence-scored assessments of identity linkage across platforms rather than simple keyword matching.

Prerequisites

  • Python 3.10+ with requests, json, and csv libraries
  • Sherlock installed (pip install sherlock-project)
  • theHarvester installed (pip install theHarvester)
  • SpiderFoot 4.0+ running on localhost:5001
  • Access to an LLM API (OpenAI, Anthropic, or local model via Ollama)
  • Optional: Maltego CE for graph visualization of correlation results
  • Optional: API keys for Shodan, VirusTotal, HaveIBeenPwned, Hunter.io

Workflow

Legal & Ethical Requirements

  • Obtain documented written authorization before any investigation
  • Establish lawful basis for data processing (law enforcement, corporate policy, etc.)
  • Define PII retention limits and data handling procedures
  • Comply with local privacy regulations (GDPR, CCPA, etc.)

Phase 1 — Multi-Source OSINT Collection

  1. Create the working directory for all OSINT outputs:

    mkdir -p /tmp/osint
    
  2. Enumerate usernames across platforms with Sherlock:

    sherlock "targetusername" --output /tmp/osint/sherlock-results.txt --csv
    
  3. Harvest emails, subdomains, and hosts with theHarvester:

    theHarvester -d targetdomain.com -b all -f /tmp/osint/harvester-results.json
    
  4. Run a SpiderFoot passive scan via REST API:

    curl -s http://localhost:5001/api/scan/start \
      -d "scanname=target-recon&scantarget=targetdomain.com&usecase=passive" \
      | jq '.scanid'
    
  5. Export SpiderFoot results when scan completes:

    SCAN_ID="<scanid_from_step_3>"
    curl -s "http://localhost:5001/api/scan/${SCAN_ID}/results?type=all" \
      -o /tmp/osint/spiderfoot-results.json
    
  6. Query breach databases for email exposure (example with HIBP API):

    curl -s -H "hibp-api-key: ${HIBP_KEY}" \
      -H "User-Agent: OSINT-Correlation-Skill" \
      "https://haveibeenpwned.com/api/v3/breachedaccount/target@example.com" \
      -o /tmp/osint/breach-results.json
    

Phase 2 — Data Normalization

  1. Normalize all collected data into a common schema. Create a unified JSON structure that tags each finding with its source, timestamp, and data type:

    cat > /tmp/osint/normalize.py << 'EOF'
    import json, csv, sys, os
    from datetime import datetime
    
    findings = []
    
    # Normalize Sherlock CSV results
    sherlock_path = "/tmp/osint/sherlock-results.txt"
    if os.path.exists(sherlock_path):
        with open(sherlock_path) as f:
            for row in csv.DictReader(f):
                findings.append({
                    "source": "sherlock",
                    "type": "social_profile",
                    "platform": row.get("name", ""),
                    "url": row.get("url_user", ""),
                    "username": row.get("username", ""),
                    "status": row.get("status", ""),
                    "collected_at": datetime.utcnow().isoformat()
                })
    
    # Normalize theHarvester JSON results
    harvester_path = "/tmp/osint/harvester-results.json"
    if os.path.exists(harvester_path):
        with open(harvester_path) as f:
            data = json.load(f)
            for email in data.get("emails", []):
                findings.append({
                    "source": "theHarvester",
                    "type": "email",
                    "value": email,
                    "collected_at": datetime.utcnow().isoformat()
                })
            for host in data.get("hosts", []):
                findings.append({
                    "source": "theHarvester",
                    "type": "hostname",
                    "value": host,
                    "collected_at": datetime.utcnow().isoformat()
                })
    
    # Normalize SpiderFoot results
    sf_path = "/tmp/osint/spiderfoot-results.json"
    if os.path.exists(sf_path):
        with open(sf_path) as f:
            for item in json.load(f):
                findings.append({
                    "source": "spiderfoot",
                    "type": item.get("type", "unknown"),
                    "value": item.get("data", ""),
                    "module": item.get("module", ""),
                    "collected_at": datetime.utcnow().isoformat()
                })
    
    with open("/tmp/osint/normalized-findings.json", "w") as f:
        json.dump(findings, f, indent=2)
    
    print(f"Normalized {len(findings)} findings from {len(set(f['source'] for f in findings))} sources")
    EOF
    python3 /tmp/osint/normalize.py
    

Phase 3 — AI-Driven Correlation

  1. Send normalized findings to an LLM for cross-source correlation analysis:

    cat > /tmp/osint/correlate.py << 'PYEOF'
    import json, os
    from openai import OpenAI  # or anthropic, ollama, etc.
    
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    
    with open("/tmp/osint/normalized-findings.json") as f:
        findings = json.load(f)
    
    correlation_prompt = f"""You are an OSINT analyst. Analyze these findings collected
    from multiple sources and produce a correlation report.
    
    For each identity or entity you detect:
    1. List all linked accounts/profiles with the evidence connecting them.
    2. Assign a confidence score (0.0-1.0) for each linkage based on:
       - Exact username match across platforms (high)
       - Similar usernames with shared metadata (medium)
       - Same email in breach data and registration (high)
       - Co-occurring infrastructure (IP, domain) (medium)
       - Temporal correlation of account creation dates (low-medium)
    3. Identify contradictions or potential false positives.
    4. Flag high-risk exposures (breached credentials, PII leaks, infrastructure overlaps).
    5. Produce a structured JSON report.
    
    Raw findings:
    {json.dumps(findings[:500], indent=2)}
    """
    
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are an expert OSINT analyst specializing in identity correlation and link analysis."},
            {"role": "user", "content": correlation_prompt}
        ],
        temperature=0.1,
        response_format={"type": "json_object"}
    )
    
    report = json.loads(response.choices[0].message.content)
    
    with open("/tmp/osint/correlation-report.json", "w") as f:
        json.dump(report, f, indent=2)
    
    print(json.dumps(report, indent=2))
    PYEOF
    python3 /tmp/osint/correlate.py
    
  2. Perform entity resolution — deduplicate and merge related identities:

    cat > /tmp/osint/resolve.py << 'PYEOF'
    import json
    
    with open("/tmp/osint/correlation-report.json") as f:
        report = json.load(f)
    
    # Extract entities and build a link graph
    entities = report.get("entities", [])
    print(f"Identified {len(entities)} distinct entities")
    for entity in entities:
        name = entity.get("identifier", "unknown")
        confidence = entity.get("confidence", 0)
        links = entity.get("linked_accounts", [])
        risk = entity.get("risk_level", "unknown")
        print(f"  [{confidence:.0%}] {name} — {len(links)} linked accounts — risk: {risk}")
    PYEOF
    python3 /tmp/osint/resolve.py
    

Phase 4 — Reporting and Visualization

  1. Generate a final intelligence profile in Markdown:

    cat > /tmp/osint/report.py << 'PYEOF'
    import json
    from datetime import datetime
    
    with open("/tmp/osint/correlation-report.json") as f:
        report = json.load(f)
    
    md = f"# OSINT Correlation Report\n\n"
    md += f"**Generated:** {datetime.utcnow().isoformat()}Z\n\n"
    md += "## Entity Profiles\n\n"
    
    for entity in report.get("entities", []):
        eid = entity.get("identifier", "Unknown")
        conf = entity.get("confidence", 0)
        md += f"### {eid} (Confidence: {conf:.0%})\n\n"
        md += "| Source | Platform | Evidence |\n|--------|----------|----------|\n"
        for link in entity.get("linked_accounts", []):
            md += f"| {link.get('source','')} | {link.get('platform','')} | {link.get('evidence','')} |\n"
        md += f"\n**Risk Level:** {entity.get('risk_level', 'N/A')}\n\n"
        for flag in entity.get("flags", []):
            md += f"- ⚠️ {flag}\n"
        md += "\n"
    
    with open("/tmp/osint/intelligence-profile.md", "w") as f:
        f.write(md)
    
    print("Report written to /tmp/osint/intelligence-profile.md")
    PYEOF
    python3 /tmp/osint/report.py
    
  2. Optional — Import correlation graph into Maltego for visualization:

    # Export entities as Maltego-compatible CSV for manual import
    cat > /tmp/osint/maltego_export.py << 'PYEOF'
    import json, csv
    
    with open("/tmp/osint/correlation-report.json") as f:
        report = json.load(f)
    
    with open("/tmp/osint/maltego-import.csv", "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["Entity Type", "Value", "Linked To", "Link Label", "Confidence"])
        for entity in report.get("entities", []):
            for link in entity.get("linked_accounts", []):
                writer.writerow([
                    link.get("type", "Alias"),
                    link.get("value", ""),
                    entity.get("identifier", ""),
                    link.get("evidence", ""),
                    link.get("confidence", "")
                ])
    
    print("Maltego CSV exported to /tmp/osint/maltego-import.csv")
    PYEOF
    python3 /tmp/osint/maltego_export.py
    

Key Concepts

| Concept | Description | |---------|-------------| | Cross-Source Correlation | Matching identifiers (usernames, emails, IPs) across independent OSINT sources to establish entity linkage | | Confidence Scoring | Assigning probabilistic confidence (0.0–1.0) to each linkage based on evidence strength and corroboration | | Entity Resolution | Deduplicating and merging records that refer to the same real-world entity across fragmented datasets | | False Positive Detection | Using AI reasoning to identify coincidental matches versus genuine identity links | | Multi-Vector Intelligence | Combining findings from social media, DNS, breach data, and infrastructure into a single threat picture | | Link Analysis | Graph-based examination of relationships between entities, accounts, and infrastructure |

Tools & Systems

| Tool | Role in Workflow | |------|-----------------| | Sherlock | Username enumeration across 400+ social platforms | | theHarvester | Email, subdomain, and host discovery from public sources | | SpiderFoot | Automated OSINT collection across 200+ modules | | Maltego | Graph-based visualization of entity relationships | | LLM API (GPT-4, Claude, Ollama) | Cross-source reasoning, pattern detection, and confidence scoring | | HaveIBeenPwned | Breach exposure and credential leak detection |

Common Scenarios

  • Threat Actor Attribution: Correlate a suspicious username found in a phishing campaign with social media profiles, domain registrations, and breach data to build an attribution profile.
  • Attack Surface Mapping: Link discovered subdomains, emails, and employee social accounts to understand an organization's full external exposure.
  • Insider Threat Investigation: Cross-reference an employee's known accounts with dark web marketplace activity and breach databases.
  • Brand Impersonation Detection: Identify accounts across platforms mimicking a target brand by correlating registration patterns, naming conventions, and temporal signals.

Output Format

The final output is a structured JSON correlation report and a Markdown intelligence profile containing:

{
  "meta": {
    "target": "targetdomain.com",
    "sources_used": ["sherlock", "theHarvester", "spiderfoot", "hibp"],
    "total_findings": 247,
    "generated_at": "2025-01-15T14:30:00Z"
  },
  "entities": [
    {
      "identifier": "john.target",
      "confidence": 0.92,
      "linked_accounts": [
        {
          "source": "sherlock",
          "platform": "GitHub",
          "value": "john.target",
          "evidence": "Exact username match, bio references targetdomain.com",
          "confidence": 0.95
        }
      ],
      "risk_level": "high",
      "flags": [
        "Credentials exposed in 2 breaches (2022, 2023)",
        "Admin email for targetdomain.com found in public WHOIS"
      ]
    }
  ],
  "contradictions": [],
  "recommendations": []
}

Verification

  • Confirm that each linked account has been independently verified against at least two sources before assigning confidence > 0.8.
  • Cross-check AI-generated correlations manually for a random sample (10–20%) to validate accuracy.
  • Verify that no false positives from common usernames (e.g., "admin", "test") inflated entity profiles.
  • Ensure breach data timestamps are current and from reputable aggregators.
  • Validate that the final report does not include stale or retracted OSINT data.