Agent Skills: Client Emails Extractor Skill

Extract and organize client email communications from ClickHouse CS messages, filter non-client noise, and output clean markdown threads. Use when user says "extract client emails", "email threads for client", or requests client email communication analysis.

UncategorizedID: tekliner/improvado-agentic-frameworks-and-skills/client-emails-extractor

Install this agent skill to your local

pnpm dlx add-skill https://github.com/tekliner/improvado-agentic-frameworks-and-skills/tree/HEAD/skills/client-emails-extractor

Skill Files

Browse the full folder contents for client-emails-extractor.

Download Skill

Loading file tree…

skills/client-emails-extractor/SKILL.md

Skill Metadata

Name
client-emails-extractor
Description
Extract and organize client email communications from ClickHouse CS messages, filter non-client noise, and output clean markdown threads. Use when user says "extract client emails", "email threads for client", or requests client email communication analysis.

Client Emails Extractor Skill

Thesis: Extract and organize client email communications from ClickHouse CS messages table, with context validation to ensure only client-related emails are included, grouped by thread, filtered from noise, and saved as markdown files.

Overview

graph LR
    A[Client Name] --> B[ClickHouse Query]
    B --> C[Filter Noise]
    C --> D[Validate Context]
    D --> E[Group by Thread]
    E --> F[Save .md Files]
    F --> G[client_cases/CLIENT/emails/]

This skill extracts all email communications with a specific client from internal_analytics_src.src_mv_google_gmail_cs_messages (RAW source with full body), applies noise filtering (auto-replies, calendar invites, Read AI, etc.), validates that emails actually relate to the client (not just shared domains), groups messages by thread, and saves each thread as a markdown file.


1.0 Usage

1.1 Trigger Phrases

  • "extract emails for [CLIENT]"
  • "get client emails [CLIENT]"
  • "download [CLIENT] email threads"
  • "скачай письма [CLIENT]"

1.2 Command Line Usage

python extract_client_emails.py "Kaiser Permanente" \
  --primary "kp.org" \
  --associated "omc.com,hearts-science.com" \
  --keywords "kaiser,kp,hearts,permanente,pmdi" \
  --output "/path/to/output"

1.3 Parameters

| Parameter | Flag | Description | Example | |-----------|------|-------------|---------| | client_name | positional | Client name to search | "Kaiser Permanente" | | --primary | -p | Primary client domains (always trusted) | "kp.org" | | --associated | -a | Agency domains (require context validation) | "omc.com,hearts-science.com" | | --keywords | -k | Context keywords for validation | "kaiser,kp,hearts" | | --output | -o | Output directory | Auto-detected from client_cases | | --months | -m | Months back to search (default: 6) | 6 |


2.0 Context Validation (v2.0.0)

2.1 Problem Solved

Domains like omc.com (Omnicom) are used by multiple clients (Kaiser/Hearts & Science, TBWA Canada, DDB, etc.). Without context validation, emails from other clients would be incorrectly included.

2.2 Validation Strategy

graph TD
    A[Email] --> B{Primary Domain?}
    B -->|Yes kp.org| C[✅ VALID]
    B -->|No| D{Context Keywords?}
    D -->|Yes kaiser/kp/hearts| E[✅ VALID]
    D -->|No| F{Associated Domain?}
    F -->|Yes omc.com| G[❌ REJECTED]
    F -->|No| H[❌ REJECTED]
  1. Primary domains (e.g., kp.org): Always trusted, no additional validation
  2. Associated domains (e.g., omc.com): Require context keywords in subject/body
  3. Context keywords: Client-specific terms that validate relevance

2.3 Example

Email from: contact@example.com
Subject: "Open ticket ISD-19334"
Body: "SFTP connection for TBWA Canada..."

→ Rejected: omc.com is associated domain, but no Kaiser context keywords found

3.0 Data Source (v2.1.0)

3.1 Primary Table - RAW Source

internal_analytics_src.src_mv_google_gmail_cs_messages

v2.1.0 improvement: Uses RAW source table instead of staging view. RAW table contains full email body for ~57% of emails (vs empty body in stg view). Falls back to snippet (~200 chars) when body not available.

Note: The staging view internal_analytics.stg_gmail_cs_messages has empty gmail_msg_body due to Airbyte mapping issues. Always use RAW source for full body.

3.2 RAW Source Fields

| RAW Field | Mapped As | Description | |-----------|-----------|-------------| | id | gmail_msg_id | Unique message ID | | date | gmail_msg_date | Message timestamp | | threadId | gmail_thread_id | Thread grouping ID | | accountEmail | gmail_account_email | CS account email | | `from` | gmail_msg_from_field | Full "Name <email>" | | (parsed) | gmail_msg_from_email | Extracted email only | | `to` | gmail_msg_to_field | Recipients | | subject | gmail_msg_subject | Subject line | | body | gmail_msg_body | FULL body (~57% available) | | snippet | gmail_msg_snippet | First ~200 chars |


4.0 Noise Filtering

4.1 Excluded Subject Patterns

EXCLUDE_SUBJECT_PATTERNS = [
    r'pre-read for your upcoming',      # Read AI
    r'please read this message',        # Read AI spam
    r'your meeting recap',              # Fireflies
    r'read meeting report',             # Read AI
    r'\booo\b.*re:',                    # OOO auto-replies
    r'out of office',
    r'automatic reply',
    r'^invitation\b',                   # Calendar
    r'^updated invitation',
    r'^accepted:',
    r'^declined:',
    r'^canceled:',
    r'^tentative:',
    r'invoice.*is due',                 # Billing auto
    r'payment reminder',
    r'you have a new invoice',
    r'^\[jira\]',                       # JIRA notifications
    r'early payment available',         # Taulia
    r'a new opportunity has been created', # Salesforce
]

4.2 Excluded From Patterns

EXCLUDE_FROM_PATTERNS = [
    r'noreply',
    r'no-reply',
    r'donotreply',
    r'notifications@',
    r'calendar-notification',
    r'fred@fireflies',                  # Fireflies bot
    r'@e\.read\.ai',                    # Read AI
    r'@read\.ai',
    r'billing@improvado',               # Billing auto
    r'@taulia\.com',                    # Payment portal
    r'portal@',
]

5.0 Output Format

5.1 File Structure

client_cases/[CLIENT]/emails/
├── _metadata.json          # Extraction stats
├── 2025_06_13_re_brand_dashboard.md
├── 2025_07_26_status_report.md
└── ...

5.2 Thread File Format

# [Subject]

**Thread:** X messages
**Period:** YYYY-MM-DD → YYYY-MM-DD
**Participants:** email1, email2, ...

---

## Message 1

**From:** Name <email>
**Date:** YYYY-MM-DD HH:MM:SS

[Body or "[Snippet only] snippet text"]

---

## Message 2
...

5.3 Metadata JSON

{
  "client": "Kaiser Permanente",
  "primary_domains": ["kp.org"],
  "associated_domains": ["omc.com", "hearts-science.com"],
  "context_keywords": ["kaiser", "kp", "hearts"],
  "total_raw": 652,
  "after_noise_filter": 518,
  "total_clean": 488,
  "total_threads": 143,
  "excluded_noise": 134,
  "excluded_context": 30,
  "version": "2.0.0"
}

6.0 Limitations

| Limitation | Impact | Workaround | |------------|--------|------------| | Body ~57% available | ~43% emails have only snippet | Airbyte sync limitation - cannot be fixed without pipeline changes | | CS accounts only | Daniel's inbox not included | Use gmail-operations skill for personal | | 6-month default | Older emails not extracted | Adjust --months parameter | | Shared domains | Agency emails mixed | Use --associated + --keywords for context |


Ground Truth

  • Source table: internal_analytics_src.src_mv_google_gmail_cs_messages (RAW)
  • Shard: Palantir (NON-DEFAULT)
  • Sync: Airbyte from Google Workspace
  • Body availability: ~57% have full body, ~43% fallback to snippet
  • Script: .claude/skills/client-emails-extractor/extract_client_emails.py
  • Version: 2.1.0
  • Updated: 2025-12-01