CSV/Excel Merger
Intelligently merge multiple CSV or Excel files with automatic column matching and data deduplication.
Instructions
When a user needs to merge CSV or Excel files:
-
Analyze Input Files:
- How many files need to be merged?
- What format (CSV, Excel, TSV)?
- Are the files provided or need to be read from disk?
- Do columns have the same names across files?
- What is the primary key (unique identifier)?
-
Inspect File Structures:
- Read headers from each file
- Identify column names and data types
- Detect encoding (UTF-8, Latin-1, etc.)
- Check for missing columns
- Look for duplicate column names
-
Create Merge Strategy:
Column Matching:
- Exact name match: "email" = "email"
- Case-insensitive: "Email" = "email"
- Fuzzy match: "E-mail" ≈ "email"
- Common patterns:
- "first_name", "firstname", "First Name" → "first_name"
- "phone", "phone_number", "tel" → "phone"
- "email", "e-mail", "email_address" → "email"
Conflict Resolution (when same record appears in multiple files):
- Keep first: Use value from first file
- Keep last: Use value from last file (most recent)
- Keep longest: Use most complete value
- Manual review: Flag conflicts for user review
- Merge: Combine non-conflicting fields
Deduplication:
- Identify duplicate rows based on primary key
- Options: keep first, keep last, keep all, merge values
- Track source file for each row
-
Perform Merge:
# Example merge logic import pandas as pd # Read files df1 = pd.read_csv('file1.csv') df2 = pd.read_csv('file2.csv') # Normalize column names df1.columns = df1.columns.str.lower().str.strip() df2.columns = df2.columns.str.lower().str.strip() # Map similar columns column_mapping = { 'firstname': 'first_name', 'e_mail': 'email', # ... } df2 = df2.rename(columns=column_mapping) # Merge merged = pd.concat([df1, df2], ignore_index=True) # Deduplicate merged = merged.drop_duplicates(subset=['email'], keep='last') # Save merged.to_csv('merged_output.csv', index=False) -
Format Output:
📊 CSV/EXCEL MERGER REPORT ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 📁 INPUT FILES ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ File 1: contacts_jan.csv Rows: 1,245 Columns: 8 (name, email, phone, company, ...) File 2: contacts_feb.csv Rows: 987 Columns: 9 (firstname, lastname, email, mobile, ...) File 3: leads_export.xlsx Rows: 2,103 Columns: 12 (full_name, email_address, phone, ...) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🔄 COLUMN MAPPING ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Unified Schema: • first_name ← [firstname, first name, fname] • last_name ← [lastname, last name, lname] • email ← [email, e-mail, email_address] • phone ← [phone, mobile, phone_number, tel] • company ← [company, organization, org] • title ← [title, job_title, position] • source ← [file origin tracking] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 🔍 MERGE ANALYSIS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Total rows before merge: 4,335 Duplicate records found: 892 Conflicts detected: 47 Deduplication Strategy: Keep most recent (by source file date) Primary Key: email ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ⚠️ CONFLICTS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Record: john.doe@example.com File 1 phone: (555) 123-4567 File 2 phone: (555) 987-6543 Resolution: Kept most recent (File 2) [List top 10 conflicts] ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ ✅ MERGE RESULTS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Output File: merged_contacts.csv Total Rows: 3,443 Columns: 7 Duplicates Removed: 892 Breakdown by Source: • contacts_jan.csv: 1,245 rows (398 unique) • contacts_feb.csv: 987 rows (521 unique) • leads_export.xlsx: 2,103 rows (2,524 unique) Data Quality: • Email completeness: 98.2% • Phone completeness: 87.5% • Company completeness: 91.3% ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 💡 RECOMMENDATIONS ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ • Review 47 conflict records manually • Standardize phone number format • Fill missing company names (8.7% incomplete) • Export conflicts to: conflicts_review.csv -
Handle Special Cases:
Multiple Primary Keys:
- Use compound keys: (email + company)
- Offer options when ambiguous
Different Data Types:
- Convert dates to standard format
- Normalize phone numbers
- Standardize country codes
- Clean whitespace and casing
Missing Columns:
- Fill with empty values
- Flag missing data
- Offer to create new columns
Large Files:
- Use chunking for files > 100MB
- Show progress indicator
- Estimate memory usage
-
Generate Code: Provide Python/pandas script that:
- Reads all files
- Performs intelligent column matching
- Deduplicates based on strategy
- Resolves conflicts
- Saves merged output
- Generates detailed report
-
Export Options:
- CSV (UTF-8)
- Excel (.xlsx)
- JSON
- SQL INSERT statements
- Parquet (for large datasets)
Example Triggers
- "Merge these three CSV files"
- "Combine multiple Excel sheets into one file"
- "Deduplicate and merge customer data"
- "Join spreadsheets with different column names"
- "Consolidate contact lists from different sources"
Best Practices
Column Matching:
- Use fuzzy matching for similar names
- Maintain original column name mapping report
- Allow manual override of auto-matching
Data Quality:
- Trim whitespace
- Standardize formats (phone, email, dates)
- Detect and flag invalid data
- Preserve data types
Performance:
- Use chunking for large files
- Process in batches
- Show progress for long operations
- Optimize memory usage
Transparency:
- Log all merge decisions
- Track source file for each row
- Report conflicts and resolutions
- Generate detailed merge report
Output Quality
Ensure merges:
- Intelligently match columns
- Handle different schemas
- Deduplicate properly
- Preserve data integrity
- Flag conflicts for review
- Generate comprehensive report
- Maintain data quality
- Track data lineage (source)
- Handle edge cases gracefully
- Provide validation statistics
Generate clean, deduplicated merged files with full transparency and data quality checks.