Client Emails Extractor Skill
Thesis: Extract and organize client email communications from ClickHouse CS messages table, with context validation to ensure only client-related emails are included, grouped by thread, filtered from noise, and saved as markdown files.
Overview
graph LR
A[Client Name] --> B[ClickHouse Query]
B --> C[Filter Noise]
C --> D[Validate Context]
D --> E[Group by Thread]
E --> F[Save .md Files]
F --> G[client_cases/CLIENT/emails/]
This skill extracts all email communications with a specific client from internal_analytics_src.src_mv_google_gmail_cs_messages (RAW source with full body), applies noise filtering (auto-replies, calendar invites, Read AI, etc.), validates that emails actually relate to the client (not just shared domains), groups messages by thread, and saves each thread as a markdown file.
1.0 Usage
1.1 Trigger Phrases
- "extract emails for [CLIENT]"
- "get client emails [CLIENT]"
- "download [CLIENT] email threads"
- "скачай письма [CLIENT]"
1.2 Command Line Usage
python extract_client_emails.py "Kaiser Permanente" \
--primary "kp.org" \
--associated "omc.com,hearts-science.com" \
--keywords "kaiser,kp,hearts,permanente,pmdi" \
--output "/path/to/output"
1.3 Parameters
| Parameter | Flag | Description | Example |
|-----------|------|-------------|---------|
| client_name | positional | Client name to search | "Kaiser Permanente" |
| --primary | -p | Primary client domains (always trusted) | "kp.org" |
| --associated | -a | Agency domains (require context validation) | "omc.com,hearts-science.com" |
| --keywords | -k | Context keywords for validation | "kaiser,kp,hearts" |
| --output | -o | Output directory | Auto-detected from client_cases |
| --months | -m | Months back to search (default: 6) | 6 |
2.0 Context Validation (v2.0.0)
2.1 Problem Solved
Domains like omc.com (Omnicom) are used by multiple clients (Kaiser/Hearts & Science, TBWA Canada, DDB, etc.). Without context validation, emails from other clients would be incorrectly included.
2.2 Validation Strategy
graph TD
A[Email] --> B{Primary Domain?}
B -->|Yes kp.org| C[✅ VALID]
B -->|No| D{Context Keywords?}
D -->|Yes kaiser/kp/hearts| E[✅ VALID]
D -->|No| F{Associated Domain?}
F -->|Yes omc.com| G[❌ REJECTED]
F -->|No| H[❌ REJECTED]
- Primary domains (e.g.,
kp.org): Always trusted, no additional validation - Associated domains (e.g.,
omc.com): Require context keywords in subject/body - Context keywords: Client-specific terms that validate relevance
2.3 Example
Email from: contact@example.com
Subject: "Open ticket ISD-19334"
Body: "SFTP connection for TBWA Canada..."
→ Rejected: omc.com is associated domain, but no Kaiser context keywords found
3.0 Data Source (v2.1.0)
3.1 Primary Table - RAW Source
internal_analytics_src.src_mv_google_gmail_cs_messages
v2.1.0 improvement: Uses RAW source table instead of staging view. RAW table contains full email body for ~57% of emails (vs empty body in stg view). Falls back to snippet (~200 chars) when body not available.
Note: The staging view internal_analytics.stg_gmail_cs_messages has empty gmail_msg_body due to Airbyte mapping issues. Always use RAW source for full body.
3.2 RAW Source Fields
| RAW Field | Mapped As | Description |
|-----------|-----------|-------------|
| id | gmail_msg_id | Unique message ID |
| date | gmail_msg_date | Message timestamp |
| threadId | gmail_thread_id | Thread grouping ID |
| accountEmail | gmail_account_email | CS account email |
| `from` | gmail_msg_from_field | Full "Name <email>" |
| (parsed) | gmail_msg_from_email | Extracted email only |
| `to` | gmail_msg_to_field | Recipients |
| subject | gmail_msg_subject | Subject line |
| body | gmail_msg_body | FULL body (~57% available) |
| snippet | gmail_msg_snippet | First ~200 chars |
4.0 Noise Filtering
4.1 Excluded Subject Patterns
EXCLUDE_SUBJECT_PATTERNS = [
r'pre-read for your upcoming', # Read AI
r'please read this message', # Read AI spam
r'your meeting recap', # Fireflies
r'read meeting report', # Read AI
r'\booo\b.*re:', # OOO auto-replies
r'out of office',
r'automatic reply',
r'^invitation\b', # Calendar
r'^updated invitation',
r'^accepted:',
r'^declined:',
r'^canceled:',
r'^tentative:',
r'invoice.*is due', # Billing auto
r'payment reminder',
r'you have a new invoice',
r'^\[jira\]', # JIRA notifications
r'early payment available', # Taulia
r'a new opportunity has been created', # Salesforce
]
4.2 Excluded From Patterns
EXCLUDE_FROM_PATTERNS = [
r'noreply',
r'no-reply',
r'donotreply',
r'notifications@',
r'calendar-notification',
r'fred@fireflies', # Fireflies bot
r'@e\.read\.ai', # Read AI
r'@read\.ai',
r'billing@improvado', # Billing auto
r'@taulia\.com', # Payment portal
r'portal@',
]
5.0 Output Format
5.1 File Structure
client_cases/[CLIENT]/emails/
├── _metadata.json # Extraction stats
├── 2025_06_13_re_brand_dashboard.md
├── 2025_07_26_status_report.md
└── ...
5.2 Thread File Format
# [Subject]
**Thread:** X messages
**Period:** YYYY-MM-DD → YYYY-MM-DD
**Participants:** email1, email2, ...
---
## Message 1
**From:** Name <email>
**Date:** YYYY-MM-DD HH:MM:SS
[Body or "[Snippet only] snippet text"]
---
## Message 2
...
5.3 Metadata JSON
{
"client": "Kaiser Permanente",
"primary_domains": ["kp.org"],
"associated_domains": ["omc.com", "hearts-science.com"],
"context_keywords": ["kaiser", "kp", "hearts"],
"total_raw": 652,
"after_noise_filter": 518,
"total_clean": 488,
"total_threads": 143,
"excluded_noise": 134,
"excluded_context": 30,
"version": "2.0.0"
}
6.0 Limitations
| Limitation | Impact | Workaround |
|------------|--------|------------|
| Body ~57% available | ~43% emails have only snippet | Airbyte sync limitation - cannot be fixed without pipeline changes |
| CS accounts only | Daniel's inbox not included | Use gmail-operations skill for personal |
| 6-month default | Older emails not extracted | Adjust --months parameter |
| Shared domains | Agency emails mixed | Use --associated + --keywords for context |
Ground Truth
- Source table:
internal_analytics_src.src_mv_google_gmail_cs_messages(RAW) - Shard: Palantir (NON-DEFAULT)
- Sync: Airbyte from Google Workspace
- Body availability: ~57% have full body, ~43% fallback to snippet
- Script:
.claude/skills/client-emails-extractor/extract_client_emails.py - Version: 2.1.0
- Updated: 2025-12-01