eGPU Skill Skill | Agent Skills

eGPU Skill

<<<<<<< HEAD External GPU knowledge for hardware detection, bandwidth analysis, hotplug recovery, and Project's GPU infrastructure.

External GPU knowledge for hardware detection, bandwidth analysis, hotplug recovery, and Basin's GPU infrastructure.

origin/main

Trigger Conditions

User asks about eGPU, external GPU, or Thunderbolt GPU
Working on GPU hotplug, device lifecycle, or workload migration
Questions about Thunderbolt bandwidth, PCIe tunneling, or GPU enclosures
Multi-GPU routing, failover, or compute offloading
Apple Silicon eGPU limitations <<<<<<< HEAD
Project's gpu-display/src/egpu.rs or gpu-runtime/src/workload_migrator.rs =======
Basin's basin-display/src/egpu.rs or basin-gpu/src/workload_migrator.rs

origin/main

What Is an eGPU?

An external GPU (eGPU) is a desktop-class GPU housed in an external enclosure, connected to a computer via Thunderbolt. It tunnels PCIe over the Thunderbolt cable, giving laptops and compact machines access to high-end GPU compute.

Physical Stack

┌────────────────────┐     Thunderbolt Cable     ┌──────────────────────┐
│  Host Computer     │  ◄═══════════════════════► │  eGPU Enclosure      │
│  (TB Controller)   │   PCIe tunneled over TB    │  (TB Controller)     │
│                    │                            │  ┌────────────────┐  │
│  CPU ◄── PCIe ──►  │                            │  │ Desktop GPU    │  │
│  iGPU              │                            │  │ (RTX 4090,     │  │
│                    │                            │  │  RX 7900 XTX,  │  │
│                    │                            │  │  etc.)         │  │
└────────────────────┘                            │  └────────────────┘  │
                                                  │  PSU (500-700W)     │
                                                  └──────────────────────┘

Thunderbolt Bandwidth

| Version | Total BW | PCIe Tunneled | Effective GPU BW | PCIe Equivalent | |---------|----------|---------------|------------------|-----------------| | TB1 | 10 Gbps | ~8 Gbps | ~1.0 GB/s | ~PCIe 2.0 x1 | | TB2 | 20 Gbps | ~16 Gbps | ~2.0 GB/s | ~PCIe 2.0 x2 | | TB3 | 40 Gbps | ~22 Gbps | ~2.8 GB/s | ~PCIe 3.0 x2 | | TB4 | 40 Gbps | ~22 Gbps | ~2.8 GB/s | ~PCIe 3.0 x2 | | TB5 | 80 Gbps | ~48 Gbps | ~6.0 GB/s | ~PCIe 4.0 x4 |

Internal PCIe 4.0 x16 = ~32 GB/s -- so even TB5 is ~5x less bandwidth than a slot.

Performance Implications

Compute-bound workloads (shader-heavy, large VRAM working sets):

eGPU performs at 85-95% of internal GPU
Hash joins, e-graph saturation, matrix multiply, neural net inference

Bandwidth-bound workloads (constant CPU↔GPU data transfer):

eGPU performs at 15-40% of internal GPU
Small batch sizes, frequent readback, streaming data

Rule of thumb: If the GPU kernel runs >1ms per dispatch, the TB overhead is negligible.

OS-Level Detection

macOS (IOKit)

IORegistry hierarchy:
  AppleThunderboltNHIType4 (or Type3/Type5)
    └── IOThunderboltPort
        └── IOPCIDevice (GPU)
            └── AGXAccelerator or IOGPUDevice

is_egpu() checks for Thunderbolt parent service in IORegistry
TB version detected via AppleThunderboltNHIType{1..5} class name
Metal API: MTLDevice.isRemovable (Intel Macs only)

Apple Silicon note: macOS officially dropped eGPU support on Apple Silicon. The TB hardware can still tunnel PCIe, but macOS won't render graphics to an external AMD/NVIDIA GPU. Compute-only use requires custom drivers.

Linux (udev/DRM)

# Detect Thunderbolt devices
cat /sys/bus/thunderbolt/devices/*/device_name

# Check if GPU is on Thunderbolt bus
udevadm info /sys/class/drm/card1 | grep THUNDERBOLT

# DRM hotplug events
inotifywait /sys/class/drm/

udev rules trigger on TB device add/remove
DRM subsystem emits hotplug events
sysfs: /sys/bus/thunderbolt/devices/*/authorized

wgpu (Cross-Platform)

// wgpu device-lost callback
device.on_uncaptured_error(|error| {
    if matches!(error, wgpu::Error::DeviceLost { .. }) {
        // GPU disconnected - trigger migration
    }
});

<<<<<<< HEAD

Project eGPU Infrastructure

=======

Basin eGPU Infrastructure

origin/main

Key Files

| File | Purpose | |------|---------| <<<<<<< HEAD | `crates/gpu-display/src/egpu.rs` | `EGpuManager`, `EGpuDevice`, `ThunderboltMesh`, `AppleSiliconEGpuInfo` | | `crates/gpu-runtime/src/device_lifecycle.rs` | `GpuDeviceRegistry`, `GpuDeviceState`, health scoring, simulated GPU | | `crates/gpu-runtime/src/workload_migrator.rs` | `WorkloadMigrator`, `MigrationConfig`, `MigrationStrategy` (Eager/Lazy/Preemptive/LoadBalanced) | | `crates/gpu-runtime/src/workload_checkpoint.rs` | `CheckpointManager`, `Checkpointable` trait for GPU state snapshots | | `foundation/hardware-layer/src/iokit.rs` | IOKit FFI for TB speed detection, `is_egpu()` |

| crates/basin-display/src/egpu.rs | EGpuManager, EGpuDevice, ThunderboltMesh, AppleSiliconEGpuInfo | | crates/basin-gpu/src/device_lifecycle.rs | GpuDeviceRegistry, GpuDeviceState, health scoring, simulated GPU | | crates/basin-gpu/src/workload_migrator.rs | WorkloadMigrator, MigrationConfig, MigrationStrategy (Eager/Lazy/Preemptive/LoadBalanced) | | crates/basin-gpu/src/workload_checkpoint.rs | CheckpointManager, Checkpointable trait for GPU state snapshots | | foundation/basin-hardware/src/iokit.rs | IOKit FFI for TB speed detection, is_egpu() |

origin/main | docs/design/EGPU_HOTPLUG_ROADMAP.md | 5-phase implementation plan |

Core Types

// EGpuDevice - full device info
pub struct EGpuDevice {
    pub gpu: GpuDevice,
    pub enclosure_name: Option<String>,
    pub thunderbolt_port: u8,
    pub thunderbolt_speed: ThunderboltSpeed,  // TB1..TB5
    pub power_delivery_watts: Option<u32>,
    pub power_state: EGpuPowerState,          // Active/Idle/Suspended/Disconnected
    pub has_displays: bool,
    pub display_ids: Vec<DisplayId>,
}

// EGpuEvent - hotplug lifecycle
pub enum EGpuEvent {
    Connected { device, thunderbolt_port },
    Disconnected { device_id, thunderbolt_port, graceful },
    LinkSpeedChanged { device_id, old_speed, new_speed },
    PowerStateChanged { device_id, state },
    MigrationStarted { device_id, target_gpu },
    MigrationCompleted { device_id, target_gpu, duration_ms },
}

// ThunderboltSpeed - link speed enum
pub enum ThunderboltSpeed { TB1, TB2, TB3, TB4, TB5 }

EGpuManager Usage

<<<<<<< HEAD
use gpu_display::egpu::EGpuManager;
=======
use basin_display::egpu::EGpuManager;
>>>>>>> origin/main

let manager = EGpuManager::new();  // Auto-enumerates
manager.start_monitoring()?;       // Hotplug events

// Register callback
manager.on_event(Box::new(|event| match event {
    EGpuEvent::Disconnected { device_id, graceful, .. } => {
        if !graceful {
            // Surprise disconnect - trigger migration
        }
    }
    _ => {}
}));

// Safe disconnect (migrates workloads first)
manager.safe_disconnect(gpu_id)?;

Workload Migration

<<<<<<< HEAD
use gpu_runtime::workload_migrator::{WorkloadMigrator, MigrationConfig, MigrationStrategy};
=======
use basin_gpu::workload_migrator::{WorkloadMigrator, MigrationConfig, MigrationStrategy};
>>>>>>> origin/main

let config = MigrationConfig {
    strategy: MigrationStrategy::Eager,
    max_concurrent_migrations: 4,
    migration_timeout: Duration::from_secs(30),
    prefer_cpu_fallback: true,
    ..Default::default()
};

// Migration flow:
// 1. GPU disconnect detected (device_lifecycle callback)
// 2. WorkloadMigrator::on_device_failure called
// 3. Checkpoint GPU state (buffers, compute state)
// 4. Select target device (integrated GPU, another eGPU, or CPU)
// 5. Restore state on target
// 6. Resume execution from checkpoint

Migration Strategies

| Strategy | When | Tradeoff | |----------|------|----------| | Eager | Migrate on first warning (high temp, low memory) | More migrations, faster recovery | | Lazy | Only migrate on device failure | Fewer migrations, risk of data loss | | Preemptive | Predict failures, migrate in advance | Best UX, complex implementation | | LoadBalanced | Distribute across all available devices | Best throughput, more complexity |

Thunderbolt Mesh Networking

<<<<<<< HEAD Project supports TB mesh networking for GPU cluster communication (IP over Thunderbolt):

use gpu_display::egpu::ThunderboltMesh;
=======
Basin supports TB mesh networking for GPU cluster communication (IP over Thunderbolt):

```rust
use basin_display::egpu::ThunderboltMesh;
>>>>>>> origin/main

let mesh = ThunderboltMesh::discover()?;
println!("Active peers: {}", mesh.active_peer_count());
println!("Aggregate bandwidth: {} Gbps", mesh.total_bandwidth_gbps);

if let Some(best) = mesh.best_peer() {
    // Lowest latency peer for frame streaming
    println!("{} @ {}us latency", best.hostname, best.latency_us);
}

<<<<<<< HEAD

Implementation Status (Project)

=======

Implementation Status (Basin)

origin/main

| Component | Status | |-----------|--------| | Hardware detection (IOKit/udev) | Complete | | TB speed detection (TB1-TB5) | Complete | | eGPU detection (is_egpu()) | Complete | | GPU vendor routing (Metal/wgpu/CUDA) | Complete | | EGpuManager + events | Complete | | Device lifecycle registry | Skeleton | | Workload migration | Missing (types exist, execution not wired) | | State checkpointing | Missing | | GPU passthrough for VMs | Missing |

See docs/design/EGPU_HOTPLUG_ROADMAP.md for the 5-phase plan.

Common eGPU Enclosures

| Enclosure | GPU Support | TB Version | Power | Notes | |-----------|------------|------------|-------|-------| | Razer Core X | Full-length, 3-slot | TB3 | 650W PSU | Most popular | | Sonnet Breakaway | Full-length, 2-slot | TB3 | 550W PSU | Mac-optimized | | Mantiz Saturn Pro | Full-length, 2-slot | TB3 | 550W PSU | Built-in hub | | ASUS XG Station Pro | Full-length, 2.5-slot | TB3 | 330W PSU | Compact |

When to Use eGPU vs Cloud GPU

| Factor | eGPU | Cloud GPU (A100/H100) | |--------|------|----------------------| | Latency | <1ms (local TB) | 1-50ms (network) | | Bandwidth | 2.8-6.0 GB/s | Network-limited | | Cost | One-time ($300-2000) | Per-hour ($1-30/hr) | | Availability | Always on | Capacity-dependent | | VRAM | Consumer (8-24GB) | Enterprise (40-80GB) | | Best for | Dev/test, small models | Training, large models |

Agent Skills: eGPU Skill

Install this agent skill to your local

Skill Files

eGPU Skill

<<<<<<< HEAD External GPU knowledge for hardware detection, bandwidth analysis, hotplug recovery, and Project's GPU infrastructure.

Trigger Conditions

What Is an eGPU?

Physical Stack

Thunderbolt Bandwidth

Performance Implications

OS-Level Detection

macOS (IOKit)

Linux (udev/DRM)

wgpu (Cross-Platform)

Project eGPU Infrastructure

Basin eGPU Infrastructure

Key Files

Core Types

EGpuManager Usage

Workload Migration

Migration Strategies

Thunderbolt Mesh Networking

Implementation Status (Project)

Implementation Status (Basin)

Common eGPU Enclosures

When to Use eGPU vs Cloud GPU