Agent Skills: HubSpot Contact Deduplication

|

UncategorizedID: jeremylongshore/claude-code-plugins/hubspot-contact-dedup

Install this agent skill to your local

pnpm dlx add-skill https://github.com/jeremylongshore/claude-code-plugins-plus-skills/tree/HEAD/plugins/saas-packs/hubspot-pack/skills/hubspot-contact-dedup

Skill Files

Browse the full folder contents for hubspot-contact-dedup.

Download Skill

Loading file tree…

plugins/saas-packs/hubspot-pack/skills/hubspot-contact-dedup/SKILL.md

Skill Metadata

Name
hubspot-contact-dedup
Description
|

HubSpot Contact Deduplication

Overview

Merge duplicate contacts in HubSpot and operate that process in production, at scale, without data loss. This is not a one-click cleanup guide — it is the logic your pipeline runs when a sales ops team imports 80,000 leads from a tradeshow CSV that already exist in the CRM, when a merge destroys the "winner" contact's email history, when a fuzzy match on "Jon" vs "John" leaves a six-figure deal associated to a ghost record, and when on-call discovers that 40,000 contacts were merged without checking opt-out flags.

The six production failures this skill prevents:

  1. Import storms creating thousands of exact duplicates — HubSpot enforces email uniqueness only at the property level; the merge API has no dedup-all-at-once endpoint. A 100K-row CSV import where 60% of rows already exist creates 60,000 duplicates that must be found and merged one pair at a time within a 100 req/10s rate envelope.
  2. Merge destroying the wrong timelinePOST /crm/v3/objects/contacts/merge requires a primaryObjectId. Picking the wrong one demotes the older contact's full activity timeline — calls, emails, form submissions — to the discarded record's history.
  3. Property-based dedup missing fuzzy matches — Email-exact dedup leaves "john@gmail.com" and "jon.smith@googlemail.com" as separate records. Phone dedup leaves "+1 (512) 867-5309" and "5128675309" as separate records. Without normalization your CRM accumulates a shadow population of semantically identical but technically distinct contacts.
  4. Post-merge association orphans — When a secondary contact has deals, tickets, or company associations, HubSpot re-parents most automatically — but not all. Custom object associations and some third-party-integration links may not follow.
  5. Rate-limit exhaustion on large catalogs — A 1-million-contact dedup scan requires 10,000 batch reads (2.7 hours at full throughput, before merge calls). Naive single-threaded loops exhaust the 500K daily quota before the search phase finishes.
  6. Silent merge failures on conflicting lifecycle or opt-out status — The merge API returns 200 even when the resulting contact has hs_email_optout=true overriding the primary's opted-in status. HubSpot's "most recently updated value wins" rule is wrong for compliance flags.

Auth

Authenticate with a private app token (pat-na1-*) or OAuth access token. Pass it on every request:

Authorization: Bearer {your-token}

Required scopes: crm.objects.contacts.read, crm.objects.contacts.write, crm.associations.read, crm.associations.write. See the hubspot-auth skill for token caching, OAuth refresh, and scope-drift detection.

Prerequisites

  • Python 3.10+ (requests, phonenumbers, rapidfuzz) for the full pipeline
  • HubSpot Professional or Enterprise account (batch merge at scale)
  • Private app token with required scopes (above)
  • jq for shell examples
  • For catalogs >500K contacts: confirm daily quota with HubSpot support

Instructions

Step 1. Discover duplicates with search

Find exact duplicates by email using the search API. Never pull all contacts into memory for comparison — use the search endpoint with specific filter values.

# Find all contacts sharing a normalized email
curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/search" \
  -H "Authorization: Bearer {your-token}" \
  -H "Content-Type: application/json" \
  -d '{
    "filterGroups": [{"filters": [
      {"propertyName":"email","operator":"EQ","value":"jane.doe@example.com"}
    ]}],
    "properties": ["email","firstname","lastname","hs_object_id","createdate",
                   "lifecyclestage","hs_email_optout","hs_email_hard_bounce_reason_enum"],
    "sorts": [{"propertyName":"createdate","direction":"ASCENDING"}],
    "limit": 10
  }' | jq '[.results[] | {id, created:.properties.createdate}]'

For full-portal scans across millions of contacts use the four-stage Python pipeline in implementation-guide.md. The pipeline writes a local SQLite checkpoint so rate-limit interruptions do not require starting over.

Step 2. Select the primary (winner) contact

The oldest contact by createdate is the primary — its timeline is most historically complete. Two overrides apply:

  • If the oldest contact has hs_email_optout=true and the newer one does not, prefer the opted-in record as primary to avoid propagating unsubscribe status.
  • If the oldest contact has a test-domain email (@mailinator.com, @example.com, @test.com), always make the real-address contact the primary.
from datetime import datetime

def pick_primary(contacts: list[dict]) -> tuple[dict, list[dict]]:
    """Return (primary, secondaries). contacts is a list of HubSpot result dicts."""
    TEST_DOMAINS = {"mailinator.com","example.com","test.com","yopmail.com"}

    def is_test(email: str) -> bool:
        return (email or "").split("@")[-1].lower() in TEST_DOMAINS

    # Sort oldest first (default primary)
    sorted_c = sorted(contacts, key=lambda c: c["properties"]["createdate"])
    primary = sorted_c[0]

    # Opt-out override
    if primary["properties"].get("hs_email_optout") == "true":
        opted_in = next((c for c in sorted_c[1:] if c["properties"].get("hs_email_optout") != "true"), None)
        if opted_in:
            primary = opted_in

    # Test email override
    if is_test(primary["properties"].get("email", "")):
        real = next((c for c in sorted_c if not is_test(c["properties"].get("email", ""))), None)
        if real:
            primary = real

    secondaries = [c for c in contacts if c["id"] != primary["id"]]
    return primary, secondaries

Step 3. Normalize emails and phones for fuzzy matching

Exact-email dedup leaves a shadow population. Normalize before comparing:

import phonenumbers

def normalize_email(raw: str) -> str:
    lower = (raw or "").strip().lower().replace("@googlemail.com", "@gmail.com")
    local, _, domain = lower.partition("@")
    if domain == "gmail.com":
        local = local.split("+")[0].replace(".", "")
    return f"{local}@{domain}" if domain else lower

def normalize_phone(raw: str, region: str = "US") -> str | None:
    try:
        p = phonenumbers.parse((raw or "").strip(), region)
        if phonenumbers.is_valid_number(p):
            return phonenumbers.format_number(p, phonenumbers.PhoneNumberFormat.E164)
    except Exception:
        pass
    return None

For name similarity and the full confidence-scoring matrix, see implementation-guide.md § Stage 2.

Step 4. Pre-merge compliance check

Before merging, verify neither contact has blocking compliance flags:

def pre_merge_check(a: dict, b: dict) -> tuple[bool, str]:
    """Returns (can_merge, reason). False = queue for human review."""
    pa, pb = a["properties"], b["properties"]
    if pa.get("hs_email_hard_bounce_reason_enum") or pb.get("hs_email_hard_bounce_reason_enum"):
        return False, "hard_bounce_present"
    # Asymmetric GDPR legal basis requires human review
    a_gdpr = bool(pa.get("hs_legal_basis"))
    b_gdpr = bool(pb.get("hs_legal_basis"))
    if a_gdpr != b_gdpr:
        return False, "gdpr_basis_asymmetry"
    return True, "ok"

# Expected post-merge opt-out: conservative — opted out if either contact is opted out
def resolve_optout(a: dict, b: dict) -> bool:
    return (a["properties"].get("hs_email_optout") == "true" or
            b["properties"].get("hs_email_optout") == "true")

Step 5. Execute merge with rate limiting

import time, requests

MERGE_URL = "https://api.hubapi.com/crm/v3/objects/contacts/merge"
_window_start = time.monotonic()
_window_calls = 0

def rate_gate(burst_limit: int = 90) -> None:
    """Enforce burst limit (90/10s — leaves buffer below HubSpot's 100/10s cap)."""
    global _window_start, _window_calls
    elapsed_ms = (time.monotonic() - _window_start) * 1000
    if elapsed_ms >= 10_000:
        _window_start = time.monotonic()
        _window_calls = 0
    if _window_calls >= burst_limit:
        time.sleep((10_000 - elapsed_ms) / 1000 + 0.05)
        _window_start = time.monotonic()
        _window_calls = 0
    _window_calls += 1

def merge_contacts(token: str, primary_id: str, secondary_id: str) -> bool:
    headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
    for attempt in range(3):
        rate_gate()
        resp = requests.post(MERGE_URL, headers=headers,
                             json={"primaryObjectId": primary_id, "objectIdToMerge": secondary_id},
                             timeout=30)
        if resp.status_code == 200:
            return True
        if resp.status_code == 429:
            time.sleep(int(resp.headers.get("Retry-After", "10")))
            continue
        if resp.status_code >= 500:
            time.sleep(min(60, 5 * 2 ** attempt))
            continue
        # Non-retryable (400, 404, 409)
        print(f"Merge failed {resp.status_code}: {resp.text}")
        return False
    return False

Stop the pipeline before hitting the daily quota:

DAILY_STOP_AT = 480_000  # Stop at 96% of 500K quota

def check_quota(resp: requests.Response) -> None:
    remaining = int(resp.headers.get("X-HubSpot-RateLimit-Daily-Remaining", 500_000))
    if (500_000 - remaining) >= DAILY_STOP_AT:
        raise SystemExit("Daily quota near limit — stopping. Resume after midnight UTC reset.")

Step 6. Post-merge verification and association repair

After merging, verify that the surviving contact's hs_email_optout matches the expected value (Step 4) and patch it if it drifted. Then audit associations that may not have transferred automatically:

# Check associations on surviving contact (replace 12345 with actual primary contact ID)
curl -s "https://api.hubapi.com/crm/v4/objects/contacts/12345/associations/deals" \
  -H "Authorization: Bearer {your-token}" | jq '[.results[].toObjectId]'

# Manually create a missing association (replace 12345 with primary ID, 67890 with deal ID)
curl -s -X PUT \
  "https://api.hubapi.com/crm/v4/objects/contacts/12345/associations/deals/67890" \
  -H "Authorization: Bearer {your-token}" \
  -H "Content-Type: application/json" \
  -d '[{"associationCategory":"HUBSPOT_DEFINED","associationTypeId":3}]'

The full four-stage Python pipeline (scan → pair → qualify → execute) with automatic association repair is in implementation-guide.md.

Error Handling

| HTTP Status | Error | Root Cause | Action | |---|---|---|---| | 400 | CONTACT_ALREADY_MERGED | Secondary was already merged into another record | Re-fetch secondary; check hs_merged_object_ids for surviving primary ID | | 400 | SAME_OBJECT_MERGE | Both IDs are identical | Remove self-merge pairs from candidate list before executing | | 400 | INVALID_OBJECT_TYPE | One ID belongs to a different CRM object type | Verify via GET /crm/v3/objects/contacts/{id} before merging | | 404 | OBJECT_NOT_FOUND | Contact was deleted between discovery and merge | Re-fetch to confirm existence; skip if deleted | | 409 | MERGE_IN_PROGRESS | A concurrent merge is already running for this contact | Retry after 30 seconds | | 429 | Rate limit | Burst or daily quota exceeded | Honor Retry-After header; check X-HubSpot-RateLimit-Daily-Remaining | | 500 | INTERNAL_ERROR | Transient HubSpot platform fault | Exponential back-off, max 3 retries; log X-HubSpot-Correlation-Id for support | | 200 (silent) | Opt-out propagated incorrectly | "Most recently updated wins" resolved compliance flag wrong | Run post-merge hs_email_optout verification; patch via PATCH endpoint |

Examples

Merge two contacts via curl

# Step 1: find the duplicate pair sorted oldest-first
SEARCH=$(curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/search" \
  -H "Authorization: Bearer {your-token}" -H "Content-Type: application/json" \
  -d '{"filterGroups":[{"filters":[{"propertyName":"email","operator":"EQ","value":"jane.doe@example.com"}]}],
       "properties":["email","createdate"],"sorts":[{"propertyName":"createdate","direction":"ASCENDING"}],"limit":5}')

PRIMARY_ID=$(echo "$SEARCH" | jq -r '.results[0].id')
SECONDARY_ID=$(echo "$SEARCH" | jq -r '.results[1].id')
echo "primary=$PRIMARY_ID secondary=$SECONDARY_ID"

# Step 2: merge
curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/merge" \
  -H "Authorization: Bearer {your-token}" -H "Content-Type: application/json" \
  -d "{\"primaryObjectId\":\"$PRIMARY_ID\",\"objectIdToMerge\":\"$SECONDARY_ID\"}" \
  | jq '{id, email: .properties.email}'

Dry-run dedup report

python3 - <<'EOF'
import json, subprocess, sys

TOKEN = "{your-token}"
EMAIL = "jane.doe@example.com"

out = subprocess.run([
    "curl","-s","-X","POST","https://api.hubapi.com/crm/v3/objects/contacts/search",
    "-H",f"Authorization: Bearer {TOKEN}","-H","Content-Type: application/json",
    "-d", json.dumps({"filterGroups":[{"filters":[{"propertyName":"email","operator":"EQ","value":EMAIL}]}],
                      "properties":["email","firstname","lastname","createdate","lifecyclestage"],
                      "sorts":[{"propertyName":"createdate","direction":"ASCENDING"}],"limit":10}),
], capture_output=True, text=True).stdout

data = json.loads(out)
contacts = data["results"]
if len(contacts) < 2:
    print("No duplicates found"); sys.exit(0)
print(f"Found {len(contacts)} contacts for {EMAIL}:")
for c in contacts:
    p = c["properties"]
    print(f"  ID {c['id']} | created {p['createdate']} | stage {p.get('lifecyclestage')}")
print(f"\nWould merge: primary={contacts[0]['id']}, secondaries={[c['id'] for c in contacts[1:]]}")
EOF

Batch read to pre-fetch properties before deciding primary

curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/batch/read" \
  -H "Authorization: Bearer {your-token}" -H "Content-Type: application/json" \
  -d '{
    "inputs": [{"id":"101"},{"id":"202"},{"id":"303"}],
    "properties": ["email","phone","firstname","lastname","createdate",
                   "lifecyclestage","hs_email_optout","hs_email_hard_bounce_reason_enum"]
  }' | jq '[.results[] | {id, email:.properties.email, created:.properties.createdate}]'

Output

  • Candidate list grouped by normalized email, phone, or name similarity with confidence scores
  • Winner selection rationale per merge pair (oldest contact, opt-out override, test-email override)
  • Compliance pre-check table per pair (opt-out status, lifecycle, GDPR basis, hard-bounce flag)
  • Association audit report — which transferred automatically and which required manual re-parenting
  • Merge execution log with rate-limit headers and daily quota burn rate
  • Post-merge verification confirming hs_email_optout on surviving records matches expected value
  • Human review queue for pairs flagged with conflicting compliance flags or 0.70–0.84 confidence

Resources