HubSpot Contact Deduplication
Overview
Merge duplicate contacts in HubSpot and operate that process in production, at scale, without data loss. This is not a one-click cleanup guide — it is the logic your pipeline runs when a sales ops team imports 80,000 leads from a tradeshow CSV that already exist in the CRM, when a merge destroys the "winner" contact's email history, when a fuzzy match on "Jon" vs "John" leaves a six-figure deal associated to a ghost record, and when on-call discovers that 40,000 contacts were merged without checking opt-out flags.
The six production failures this skill prevents:
- Import storms creating thousands of exact duplicates — HubSpot enforces email uniqueness only at the property level; the merge API has no dedup-all-at-once endpoint. A 100K-row CSV import where 60% of rows already exist creates 60,000 duplicates that must be found and merged one pair at a time within a 100 req/10s rate envelope.
- Merge destroying the wrong timeline —
POST /crm/v3/objects/contacts/mergerequires aprimaryObjectId. Picking the wrong one demotes the older contact's full activity timeline — calls, emails, form submissions — to the discarded record's history. - Property-based dedup missing fuzzy matches — Email-exact dedup leaves "john@gmail.com" and "jon.smith@googlemail.com" as separate records. Phone dedup leaves "+1 (512) 867-5309" and "5128675309" as separate records. Without normalization your CRM accumulates a shadow population of semantically identical but technically distinct contacts.
- Post-merge association orphans — When a secondary contact has deals, tickets, or company associations, HubSpot re-parents most automatically — but not all. Custom object associations and some third-party-integration links may not follow.
- Rate-limit exhaustion on large catalogs — A 1-million-contact dedup scan requires 10,000 batch reads (2.7 hours at full throughput, before merge calls). Naive single-threaded loops exhaust the 500K daily quota before the search phase finishes.
- Silent merge failures on conflicting lifecycle or opt-out status — The merge API returns 200 even when the resulting contact has
hs_email_optout=trueoverriding the primary's opted-in status. HubSpot's "most recently updated value wins" rule is wrong for compliance flags.
Auth
Authenticate with a private app token (pat-na1-*) or OAuth access token. Pass it on every request:
Authorization: Bearer {your-token}
Required scopes: crm.objects.contacts.read, crm.objects.contacts.write, crm.associations.read, crm.associations.write. See the hubspot-auth skill for token caching, OAuth refresh, and scope-drift detection.
Prerequisites
- Python 3.10+ (
requests,phonenumbers,rapidfuzz) for the full pipeline - HubSpot Professional or Enterprise account (batch merge at scale)
- Private app token with required scopes (above)
jqfor shell examples- For catalogs >500K contacts: confirm daily quota with HubSpot support
Instructions
Step 1. Discover duplicates with search
Find exact duplicates by email using the search API. Never pull all contacts into memory for comparison — use the search endpoint with specific filter values.
# Find all contacts sharing a normalized email
curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/search" \
-H "Authorization: Bearer {your-token}" \
-H "Content-Type: application/json" \
-d '{
"filterGroups": [{"filters": [
{"propertyName":"email","operator":"EQ","value":"jane.doe@example.com"}
]}],
"properties": ["email","firstname","lastname","hs_object_id","createdate",
"lifecyclestage","hs_email_optout","hs_email_hard_bounce_reason_enum"],
"sorts": [{"propertyName":"createdate","direction":"ASCENDING"}],
"limit": 10
}' | jq '[.results[] | {id, created:.properties.createdate}]'
For full-portal scans across millions of contacts use the four-stage Python pipeline in implementation-guide.md. The pipeline writes a local SQLite checkpoint so rate-limit interruptions do not require starting over.
Step 2. Select the primary (winner) contact
The oldest contact by createdate is the primary — its timeline is most historically complete. Two overrides apply:
- If the oldest contact has
hs_email_optout=trueand the newer one does not, prefer the opted-in record as primary to avoid propagating unsubscribe status. - If the oldest contact has a test-domain email (
@mailinator.com,@example.com,@test.com), always make the real-address contact the primary.
from datetime import datetime
def pick_primary(contacts: list[dict]) -> tuple[dict, list[dict]]:
"""Return (primary, secondaries). contacts is a list of HubSpot result dicts."""
TEST_DOMAINS = {"mailinator.com","example.com","test.com","yopmail.com"}
def is_test(email: str) -> bool:
return (email or "").split("@")[-1].lower() in TEST_DOMAINS
# Sort oldest first (default primary)
sorted_c = sorted(contacts, key=lambda c: c["properties"]["createdate"])
primary = sorted_c[0]
# Opt-out override
if primary["properties"].get("hs_email_optout") == "true":
opted_in = next((c for c in sorted_c[1:] if c["properties"].get("hs_email_optout") != "true"), None)
if opted_in:
primary = opted_in
# Test email override
if is_test(primary["properties"].get("email", "")):
real = next((c for c in sorted_c if not is_test(c["properties"].get("email", ""))), None)
if real:
primary = real
secondaries = [c for c in contacts if c["id"] != primary["id"]]
return primary, secondaries
Step 3. Normalize emails and phones for fuzzy matching
Exact-email dedup leaves a shadow population. Normalize before comparing:
import phonenumbers
def normalize_email(raw: str) -> str:
lower = (raw or "").strip().lower().replace("@googlemail.com", "@gmail.com")
local, _, domain = lower.partition("@")
if domain == "gmail.com":
local = local.split("+")[0].replace(".", "")
return f"{local}@{domain}" if domain else lower
def normalize_phone(raw: str, region: str = "US") -> str | None:
try:
p = phonenumbers.parse((raw or "").strip(), region)
if phonenumbers.is_valid_number(p):
return phonenumbers.format_number(p, phonenumbers.PhoneNumberFormat.E164)
except Exception:
pass
return None
For name similarity and the full confidence-scoring matrix, see implementation-guide.md § Stage 2.
Step 4. Pre-merge compliance check
Before merging, verify neither contact has blocking compliance flags:
def pre_merge_check(a: dict, b: dict) -> tuple[bool, str]:
"""Returns (can_merge, reason). False = queue for human review."""
pa, pb = a["properties"], b["properties"]
if pa.get("hs_email_hard_bounce_reason_enum") or pb.get("hs_email_hard_bounce_reason_enum"):
return False, "hard_bounce_present"
# Asymmetric GDPR legal basis requires human review
a_gdpr = bool(pa.get("hs_legal_basis"))
b_gdpr = bool(pb.get("hs_legal_basis"))
if a_gdpr != b_gdpr:
return False, "gdpr_basis_asymmetry"
return True, "ok"
# Expected post-merge opt-out: conservative — opted out if either contact is opted out
def resolve_optout(a: dict, b: dict) -> bool:
return (a["properties"].get("hs_email_optout") == "true" or
b["properties"].get("hs_email_optout") == "true")
Step 5. Execute merge with rate limiting
import time, requests
MERGE_URL = "https://api.hubapi.com/crm/v3/objects/contacts/merge"
_window_start = time.monotonic()
_window_calls = 0
def rate_gate(burst_limit: int = 90) -> None:
"""Enforce burst limit (90/10s — leaves buffer below HubSpot's 100/10s cap)."""
global _window_start, _window_calls
elapsed_ms = (time.monotonic() - _window_start) * 1000
if elapsed_ms >= 10_000:
_window_start = time.monotonic()
_window_calls = 0
if _window_calls >= burst_limit:
time.sleep((10_000 - elapsed_ms) / 1000 + 0.05)
_window_start = time.monotonic()
_window_calls = 0
_window_calls += 1
def merge_contacts(token: str, primary_id: str, secondary_id: str) -> bool:
headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
for attempt in range(3):
rate_gate()
resp = requests.post(MERGE_URL, headers=headers,
json={"primaryObjectId": primary_id, "objectIdToMerge": secondary_id},
timeout=30)
if resp.status_code == 200:
return True
if resp.status_code == 429:
time.sleep(int(resp.headers.get("Retry-After", "10")))
continue
if resp.status_code >= 500:
time.sleep(min(60, 5 * 2 ** attempt))
continue
# Non-retryable (400, 404, 409)
print(f"Merge failed {resp.status_code}: {resp.text}")
return False
return False
Stop the pipeline before hitting the daily quota:
DAILY_STOP_AT = 480_000 # Stop at 96% of 500K quota
def check_quota(resp: requests.Response) -> None:
remaining = int(resp.headers.get("X-HubSpot-RateLimit-Daily-Remaining", 500_000))
if (500_000 - remaining) >= DAILY_STOP_AT:
raise SystemExit("Daily quota near limit — stopping. Resume after midnight UTC reset.")
Step 6. Post-merge verification and association repair
After merging, verify that the surviving contact's hs_email_optout matches the expected value (Step 4) and patch it if it drifted. Then audit associations that may not have transferred automatically:
# Check associations on surviving contact (replace 12345 with actual primary contact ID)
curl -s "https://api.hubapi.com/crm/v4/objects/contacts/12345/associations/deals" \
-H "Authorization: Bearer {your-token}" | jq '[.results[].toObjectId]'
# Manually create a missing association (replace 12345 with primary ID, 67890 with deal ID)
curl -s -X PUT \
"https://api.hubapi.com/crm/v4/objects/contacts/12345/associations/deals/67890" \
-H "Authorization: Bearer {your-token}" \
-H "Content-Type: application/json" \
-d '[{"associationCategory":"HUBSPOT_DEFINED","associationTypeId":3}]'
The full four-stage Python pipeline (scan → pair → qualify → execute) with automatic association repair is in implementation-guide.md.
Error Handling
| HTTP Status | Error | Root Cause | Action |
|---|---|---|---|
| 400 | CONTACT_ALREADY_MERGED | Secondary was already merged into another record | Re-fetch secondary; check hs_merged_object_ids for surviving primary ID |
| 400 | SAME_OBJECT_MERGE | Both IDs are identical | Remove self-merge pairs from candidate list before executing |
| 400 | INVALID_OBJECT_TYPE | One ID belongs to a different CRM object type | Verify via GET /crm/v3/objects/contacts/{id} before merging |
| 404 | OBJECT_NOT_FOUND | Contact was deleted between discovery and merge | Re-fetch to confirm existence; skip if deleted |
| 409 | MERGE_IN_PROGRESS | A concurrent merge is already running for this contact | Retry after 30 seconds |
| 429 | Rate limit | Burst or daily quota exceeded | Honor Retry-After header; check X-HubSpot-RateLimit-Daily-Remaining |
| 500 | INTERNAL_ERROR | Transient HubSpot platform fault | Exponential back-off, max 3 retries; log X-HubSpot-Correlation-Id for support |
| 200 (silent) | Opt-out propagated incorrectly | "Most recently updated wins" resolved compliance flag wrong | Run post-merge hs_email_optout verification; patch via PATCH endpoint |
Examples
Merge two contacts via curl
# Step 1: find the duplicate pair sorted oldest-first
SEARCH=$(curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/search" \
-H "Authorization: Bearer {your-token}" -H "Content-Type: application/json" \
-d '{"filterGroups":[{"filters":[{"propertyName":"email","operator":"EQ","value":"jane.doe@example.com"}]}],
"properties":["email","createdate"],"sorts":[{"propertyName":"createdate","direction":"ASCENDING"}],"limit":5}')
PRIMARY_ID=$(echo "$SEARCH" | jq -r '.results[0].id')
SECONDARY_ID=$(echo "$SEARCH" | jq -r '.results[1].id')
echo "primary=$PRIMARY_ID secondary=$SECONDARY_ID"
# Step 2: merge
curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/merge" \
-H "Authorization: Bearer {your-token}" -H "Content-Type: application/json" \
-d "{\"primaryObjectId\":\"$PRIMARY_ID\",\"objectIdToMerge\":\"$SECONDARY_ID\"}" \
| jq '{id, email: .properties.email}'
Dry-run dedup report
python3 - <<'EOF'
import json, subprocess, sys
TOKEN = "{your-token}"
EMAIL = "jane.doe@example.com"
out = subprocess.run([
"curl","-s","-X","POST","https://api.hubapi.com/crm/v3/objects/contacts/search",
"-H",f"Authorization: Bearer {TOKEN}","-H","Content-Type: application/json",
"-d", json.dumps({"filterGroups":[{"filters":[{"propertyName":"email","operator":"EQ","value":EMAIL}]}],
"properties":["email","firstname","lastname","createdate","lifecyclestage"],
"sorts":[{"propertyName":"createdate","direction":"ASCENDING"}],"limit":10}),
], capture_output=True, text=True).stdout
data = json.loads(out)
contacts = data["results"]
if len(contacts) < 2:
print("No duplicates found"); sys.exit(0)
print(f"Found {len(contacts)} contacts for {EMAIL}:")
for c in contacts:
p = c["properties"]
print(f" ID {c['id']} | created {p['createdate']} | stage {p.get('lifecyclestage')}")
print(f"\nWould merge: primary={contacts[0]['id']}, secondaries={[c['id'] for c in contacts[1:]]}")
EOF
Batch read to pre-fetch properties before deciding primary
curl -s -X POST "https://api.hubapi.com/crm/v3/objects/contacts/batch/read" \
-H "Authorization: Bearer {your-token}" -H "Content-Type: application/json" \
-d '{
"inputs": [{"id":"101"},{"id":"202"},{"id":"303"}],
"properties": ["email","phone","firstname","lastname","createdate",
"lifecyclestage","hs_email_optout","hs_email_hard_bounce_reason_enum"]
}' | jq '[.results[] | {id, email:.properties.email, created:.properties.createdate}]'
Output
- Candidate list grouped by normalized email, phone, or name similarity with confidence scores
- Winner selection rationale per merge pair (oldest contact, opt-out override, test-email override)
- Compliance pre-check table per pair (opt-out status, lifecycle, GDPR basis, hard-bounce flag)
- Association audit report — which transferred automatically and which required manual re-parenting
- Merge execution log with rate-limit headers and daily quota burn rate
- Post-merge verification confirming
hs_email_optouton surviving records matches expected value - Human review queue for pairs flagged with conflicting compliance flags or 0.70–0.84 confidence
Resources
- HubSpot Contacts Merge API
- CRM Search API Reference
- CRM Batch Read API
- Associations API v4
- Rate Limits Reference
- GDPR and Marketing Emails in HubSpot
- API_REFERENCE.md — merge endpoint shapes, search filter syntax, error codes, property uniqueness enforcement
- implementation-guide.md — full four-stage Python pipeline, fuzzy matching, post-merge association cleanup runbook