---
name: client-emails-extractor
description: Extract and organize client email communications from ClickHouse CS messages, filter non-client noise, and output clean markdown threads. Use when user says "extract client emails", "email threads for client", or requests client email communication analysis.
---

# Client Emails Extractor Skill

**Thesis:** Extract and organize client email communications from ClickHouse CS messages table, with context validation to ensure only client-related emails are included, grouped by thread, filtered from noise, and saved as markdown files.

## Overview

```mermaid
graph LR
    A[Client Name] --> B[ClickHouse Query]
    B --> C[Filter Noise]
    C --> D[Validate Context]
    D --> E[Group by Thread]
    E --> F[Save .md Files]
    F --> G[client_cases/CLIENT/emails/]
```

This skill extracts all email communications with a specific client from `internal_analytics_src.src_mv_google_gmail_cs_messages` (RAW source with full body), applies noise filtering (auto-replies, calendar invites, Read AI, etc.), validates that emails actually relate to the client (not just shared domains), groups messages by thread, and saves each thread as a markdown file.

---

## 1.0 Usage

### 1.1 Trigger Phrases

- "extract emails for [CLIENT]"
- "get client emails [CLIENT]"
- "download [CLIENT] email threads"
- "скачай письма [CLIENT]"

### 1.2 Command Line Usage

```bash
python extract_client_emails.py "Kaiser Permanente" \
  --primary "kp.org" \
  --associated "omc.com,hearts-science.com" \
  --keywords "kaiser,kp,hearts,permanente,pmdi" \
  --output "/path/to/output"
```

### 1.3 Parameters

| Parameter | Flag | Description | Example |
|-----------|------|-------------|---------|
| `client_name` | positional | Client name to search | "Kaiser Permanente" |
| `--primary` | `-p` | Primary client domains (always trusted) | "kp.org" |
| `--associated` | `-a` | Agency domains (require context validation) | "omc.com,hearts-science.com" |
| `--keywords` | `-k` | Context keywords for validation | "kaiser,kp,hearts" |
| `--output` | `-o` | Output directory | Auto-detected from client_cases |
| `--months` | `-m` | Months back to search (default: 6) | 6 |

---

## 2.0 Context Validation (v2.0.0)

### 2.1 Problem Solved

Domains like `omc.com` (Omnicom) are used by multiple clients (Kaiser/Hearts & Science, TBWA Canada, DDB, etc.). Without context validation, emails from other clients would be incorrectly included.

### 2.2 Validation Strategy

```mermaid
graph TD
    A[Email] --> B{Primary Domain?}
    B -->|Yes kp.org| C[✅ VALID]
    B -->|No| D{Context Keywords?}
    D -->|Yes kaiser/kp/hearts| E[✅ VALID]
    D -->|No| F{Associated Domain?}
    F -->|Yes omc.com| G[❌ REJECTED]
    F -->|No| H[❌ REJECTED]
```

1. **Primary domains** (e.g., `kp.org`): Always trusted, no additional validation
2. **Associated domains** (e.g., `omc.com`): Require context keywords in subject/body
3. **Context keywords**: Client-specific terms that validate relevance

### 2.3 Example

```
Email from: contact@example.com
Subject: "Open ticket ISD-19334"
Body: "SFTP connection for TBWA Canada..."

→ Rejected: omc.com is associated domain, but no Kaiser context keywords found
```

---

## 3.0 Data Source (v2.1.0)

### 3.1 Primary Table - RAW Source

```sql
internal_analytics_src.src_mv_google_gmail_cs_messages
```

**v2.1.0 improvement:** Uses RAW source table instead of staging view. RAW table contains **full email body** for ~57% of emails (vs empty body in stg view). Falls back to snippet (~200 chars) when body not available.

**Note:** The staging view `internal_analytics.stg_gmail_cs_messages` has empty `gmail_msg_body` due to Airbyte mapping issues. Always use RAW source for full body.

### 3.2 RAW Source Fields

| RAW Field | Mapped As | Description |
|-----------|-----------|-------------|
| `id` | `gmail_msg_id` | Unique message ID |
| `date` | `gmail_msg_date` | Message timestamp |
| `threadId` | `gmail_thread_id` | Thread grouping ID |
| `accountEmail` | `gmail_account_email` | CS account email |
| `` `from` `` | `gmail_msg_from_field` | Full "Name <email>" |
| (parsed) | `gmail_msg_from_email` | Extracted email only |
| `` `to` `` | `gmail_msg_to_field` | Recipients |
| `subject` | `gmail_msg_subject` | Subject line |
| `body` | `gmail_msg_body` | **FULL body** (~57% available) |
| `snippet` | `gmail_msg_snippet` | First ~200 chars |

---

## 4.0 Noise Filtering

### 4.1 Excluded Subject Patterns

```python
EXCLUDE_SUBJECT_PATTERNS = [
    r'pre-read for your upcoming',      # Read AI
    r'please read this message',        # Read AI spam
    r'your meeting recap',              # Fireflies
    r'read meeting report',             # Read AI
    r'\booo\b.*re:',                    # OOO auto-replies
    r'out of office',
    r'automatic reply',
    r'^invitation\b',                   # Calendar
    r'^updated invitation',
    r'^accepted:',
    r'^declined:',
    r'^canceled:',
    r'^tentative:',
    r'invoice.*is due',                 # Billing auto
    r'payment reminder',
    r'you have a new invoice',
    r'^\[jira\]',                       # JIRA notifications
    r'early payment available',         # Taulia
    r'a new opportunity has been created', # Salesforce
]
```

### 4.2 Excluded From Patterns

```python
EXCLUDE_FROM_PATTERNS = [
    r'noreply',
    r'no-reply',
    r'donotreply',
    r'notifications@',
    r'calendar-notification',
    r'fred@fireflies',                  # Fireflies bot
    r'@e\.read\.ai',                    # Read AI
    r'@read\.ai',
    r'billing@improvado',               # Billing auto
    r'@taulia\.com',                    # Payment portal
    r'portal@',
]
```

---

## 5.0 Output Format

### 5.1 File Structure

```
client_cases/[CLIENT]/emails/
├── _metadata.json          # Extraction stats
├── 2025_06_13_re_brand_dashboard.md
├── 2025_07_26_status_report.md
└── ...
```

### 5.2 Thread File Format

```markdown
# [Subject]

**Thread:** X messages
**Period:** YYYY-MM-DD → YYYY-MM-DD
**Participants:** email1, email2, ...

---

## Message 1

**From:** Name <email>
**Date:** YYYY-MM-DD HH:MM:SS

[Body or "[Snippet only] snippet text"]

---

## Message 2
...
```

### 5.3 Metadata JSON

```json
{
  "client": "Kaiser Permanente",
  "primary_domains": ["kp.org"],
  "associated_domains": ["omc.com", "hearts-science.com"],
  "context_keywords": ["kaiser", "kp", "hearts"],
  "total_raw": 652,
  "after_noise_filter": 518,
  "total_clean": 488,
  "total_threads": 143,
  "excluded_noise": 134,
  "excluded_context": 30,
  "version": "2.0.0"
}
```

---

## 6.0 Limitations

| Limitation | Impact | Workaround |
|------------|--------|------------|
| Body ~57% available | ~43% emails have only snippet | Airbyte sync limitation - cannot be fixed without pipeline changes |
| CS accounts only | Daniel's inbox not included | Use gmail-operations skill for personal |
| 6-month default | Older emails not extracted | Adjust `--months` parameter |
| Shared domains | Agency emails mixed | Use `--associated` + `--keywords` for context |

---

## Ground Truth

- **Source table:** `internal_analytics_src.src_mv_google_gmail_cs_messages` (RAW)
- **Shard:** Palantir (NON-DEFAULT)
- **Sync:** Airbyte from Google Workspace
- **Body availability:** ~57% have full body, ~43% fallback to snippet
- **Script:** `.claude/skills/client-emails-extractor/extract_client_emails.py`
- **Version:** 2.1.0
- **Updated:** 2025-12-01
