Agent Skills: eGPU Skill

External GPU technology fundamentals, Thunderbolt bandwidth math, hotplug detection, workload migration, and Basin's eGPU infrastructure (basin-display, basin-gpu). Use when working with eGPU detection, GPU failover, Thunderbolt networking, or multi-GPU workload routing.

UncategorizedID: plurigrid/asi/egpu

Install this agent skill to your local

pnpm dlx add-skill https://github.com/plurigrid/asi/tree/HEAD/skills/egpu

Skill Files

Browse the full folder contents for egpu.

Download Skill

Loading file tree…

skills/egpu/SKILL.md

Skill Metadata

Name
egpu
Description
"External GPU technology fundamentals, Thunderbolt bandwidth math, hotplug detection, workload migration, and Basin's eGPU infrastructure (basin-display, basin-gpu). Use when working with eGPU detection, GPU failover, Thunderbolt networking, or multi-GPU workload routing."

eGPU Skill

<<<<<<< HEAD External GPU knowledge for hardware detection, bandwidth analysis, hotplug recovery, and Project's GPU infrastructure.

External GPU knowledge for hardware detection, bandwidth analysis, hotplug recovery, and Basin's GPU infrastructure.

origin/main

Trigger Conditions

  • User asks about eGPU, external GPU, or Thunderbolt GPU
  • Working on GPU hotplug, device lifecycle, or workload migration
  • Questions about Thunderbolt bandwidth, PCIe tunneling, or GPU enclosures
  • Multi-GPU routing, failover, or compute offloading
  • Apple Silicon eGPU limitations <<<<<<< HEAD
  • Project's gpu-display/src/egpu.rs or gpu-runtime/src/workload_migrator.rs =======
  • Basin's basin-display/src/egpu.rs or basin-gpu/src/workload_migrator.rs

origin/main

What Is an eGPU?

An external GPU (eGPU) is a desktop-class GPU housed in an external enclosure, connected to a computer via Thunderbolt. It tunnels PCIe over the Thunderbolt cable, giving laptops and compact machines access to high-end GPU compute.

Physical Stack

┌────────────────────┐     Thunderbolt Cable     ┌──────────────────────┐
│  Host Computer     │  ◄═══════════════════════► │  eGPU Enclosure      │
│  (TB Controller)   │   PCIe tunneled over TB    │  (TB Controller)     │
│                    │                            │  ┌────────────────┐  │
│  CPU ◄── PCIe ──►  │                            │  │ Desktop GPU    │  │
│  iGPU              │                            │  │ (RTX 4090,     │  │
│                    │                            │  │  RX 7900 XTX,  │  │
│                    │                            │  │  etc.)         │  │
└────────────────────┘                            │  └────────────────┘  │
                                                  │  PSU (500-700W)     │
                                                  └──────────────────────┘

Thunderbolt Bandwidth

| Version | Total BW | PCIe Tunneled | Effective GPU BW | PCIe Equivalent | |---------|----------|---------------|------------------|-----------------| | TB1 | 10 Gbps | ~8 Gbps | ~1.0 GB/s | ~PCIe 2.0 x1 | | TB2 | 20 Gbps | ~16 Gbps | ~2.0 GB/s | ~PCIe 2.0 x2 | | TB3 | 40 Gbps | ~22 Gbps | ~2.8 GB/s | ~PCIe 3.0 x2 | | TB4 | 40 Gbps | ~22 Gbps | ~2.8 GB/s | ~PCIe 3.0 x2 | | TB5 | 80 Gbps | ~48 Gbps | ~6.0 GB/s | ~PCIe 4.0 x4 |

Internal PCIe 4.0 x16 = ~32 GB/s -- so even TB5 is ~5x less bandwidth than a slot.

Performance Implications

Compute-bound workloads (shader-heavy, large VRAM working sets):

  • eGPU performs at 85-95% of internal GPU
  • Hash joins, e-graph saturation, matrix multiply, neural net inference

Bandwidth-bound workloads (constant CPU↔GPU data transfer):

  • eGPU performs at 15-40% of internal GPU
  • Small batch sizes, frequent readback, streaming data

Rule of thumb: If the GPU kernel runs >1ms per dispatch, the TB overhead is negligible.

OS-Level Detection

macOS (IOKit)

IORegistry hierarchy:
  AppleThunderboltNHIType4 (or Type3/Type5)
    └── IOThunderboltPort
        └── IOPCIDevice (GPU)
            └── AGXAccelerator or IOGPUDevice
  • is_egpu() checks for Thunderbolt parent service in IORegistry
  • TB version detected via AppleThunderboltNHIType{1..5} class name
  • Metal API: MTLDevice.isRemovable (Intel Macs only)

Apple Silicon note: macOS officially dropped eGPU support on Apple Silicon. The TB hardware can still tunnel PCIe, but macOS won't render graphics to an external AMD/NVIDIA GPU. Compute-only use requires custom drivers.

Linux (udev/DRM)

# Detect Thunderbolt devices
cat /sys/bus/thunderbolt/devices/*/device_name

# Check if GPU is on Thunderbolt bus
udevadm info /sys/class/drm/card1 | grep THUNDERBOLT

# DRM hotplug events
inotifywait /sys/class/drm/
  • udev rules trigger on TB device add/remove
  • DRM subsystem emits hotplug events
  • sysfs: /sys/bus/thunderbolt/devices/*/authorized

wgpu (Cross-Platform)

// wgpu device-lost callback
device.on_uncaptured_error(|error| {
    if matches!(error, wgpu::Error::DeviceLost { .. }) {
        // GPU disconnected - trigger migration
    }
});

<<<<<<< HEAD

Project eGPU Infrastructure

=======

Basin eGPU Infrastructure

origin/main

Key Files

| File | Purpose | |------|---------| <<<<<<< HEAD | crates/gpu-display/src/egpu.rs | EGpuManager, EGpuDevice, ThunderboltMesh, AppleSiliconEGpuInfo | | crates/gpu-runtime/src/device_lifecycle.rs | GpuDeviceRegistry, GpuDeviceState, health scoring, simulated GPU | | crates/gpu-runtime/src/workload_migrator.rs | WorkloadMigrator, MigrationConfig, MigrationStrategy (Eager/Lazy/Preemptive/LoadBalanced) | | crates/gpu-runtime/src/workload_checkpoint.rs | CheckpointManager, Checkpointable trait for GPU state snapshots | | foundation/hardware-layer/src/iokit.rs | IOKit FFI for TB speed detection, is_egpu() |

| crates/basin-display/src/egpu.rs | EGpuManager, EGpuDevice, ThunderboltMesh, AppleSiliconEGpuInfo | | crates/basin-gpu/src/device_lifecycle.rs | GpuDeviceRegistry, GpuDeviceState, health scoring, simulated GPU | | crates/basin-gpu/src/workload_migrator.rs | WorkloadMigrator, MigrationConfig, MigrationStrategy (Eager/Lazy/Preemptive/LoadBalanced) | | crates/basin-gpu/src/workload_checkpoint.rs | CheckpointManager, Checkpointable trait for GPU state snapshots | | foundation/basin-hardware/src/iokit.rs | IOKit FFI for TB speed detection, is_egpu() |

origin/main | docs/design/EGPU_HOTPLUG_ROADMAP.md | 5-phase implementation plan |

Core Types

// EGpuDevice - full device info
pub struct EGpuDevice {
    pub gpu: GpuDevice,
    pub enclosure_name: Option<String>,
    pub thunderbolt_port: u8,
    pub thunderbolt_speed: ThunderboltSpeed,  // TB1..TB5
    pub power_delivery_watts: Option<u32>,
    pub power_state: EGpuPowerState,          // Active/Idle/Suspended/Disconnected
    pub has_displays: bool,
    pub display_ids: Vec<DisplayId>,
}

// EGpuEvent - hotplug lifecycle
pub enum EGpuEvent {
    Connected { device, thunderbolt_port },
    Disconnected { device_id, thunderbolt_port, graceful },
    LinkSpeedChanged { device_id, old_speed, new_speed },
    PowerStateChanged { device_id, state },
    MigrationStarted { device_id, target_gpu },
    MigrationCompleted { device_id, target_gpu, duration_ms },
}

// ThunderboltSpeed - link speed enum
pub enum ThunderboltSpeed { TB1, TB2, TB3, TB4, TB5 }

EGpuManager Usage

<<<<<<< HEAD
use gpu_display::egpu::EGpuManager;
=======
use basin_display::egpu::EGpuManager;
>>>>>>> origin/main

let manager = EGpuManager::new();  // Auto-enumerates
manager.start_monitoring()?;       // Hotplug events

// Register callback
manager.on_event(Box::new(|event| match event {
    EGpuEvent::Disconnected { device_id, graceful, .. } => {
        if !graceful {
            // Surprise disconnect - trigger migration
        }
    }
    _ => {}
}));

// Safe disconnect (migrates workloads first)
manager.safe_disconnect(gpu_id)?;

Workload Migration

<<<<<<< HEAD
use gpu_runtime::workload_migrator::{WorkloadMigrator, MigrationConfig, MigrationStrategy};
=======
use basin_gpu::workload_migrator::{WorkloadMigrator, MigrationConfig, MigrationStrategy};
>>>>>>> origin/main

let config = MigrationConfig {
    strategy: MigrationStrategy::Eager,
    max_concurrent_migrations: 4,
    migration_timeout: Duration::from_secs(30),
    prefer_cpu_fallback: true,
    ..Default::default()
};

// Migration flow:
// 1. GPU disconnect detected (device_lifecycle callback)
// 2. WorkloadMigrator::on_device_failure called
// 3. Checkpoint GPU state (buffers, compute state)
// 4. Select target device (integrated GPU, another eGPU, or CPU)
// 5. Restore state on target
// 6. Resume execution from checkpoint

Migration Strategies

| Strategy | When | Tradeoff | |----------|------|----------| | Eager | Migrate on first warning (high temp, low memory) | More migrations, faster recovery | | Lazy | Only migrate on device failure | Fewer migrations, risk of data loss | | Preemptive | Predict failures, migrate in advance | Best UX, complex implementation | | LoadBalanced | Distribute across all available devices | Best throughput, more complexity |

Thunderbolt Mesh Networking

<<<<<<< HEAD Project supports TB mesh networking for GPU cluster communication (IP over Thunderbolt):

use gpu_display::egpu::ThunderboltMesh;
=======
Basin supports TB mesh networking for GPU cluster communication (IP over Thunderbolt):

```rust
use basin_display::egpu::ThunderboltMesh;
>>>>>>> origin/main

let mesh = ThunderboltMesh::discover()?;
println!("Active peers: {}", mesh.active_peer_count());
println!("Aggregate bandwidth: {} Gbps", mesh.total_bandwidth_gbps);

if let Some(best) = mesh.best_peer() {
    // Lowest latency peer for frame streaming
    println!("{} @ {}us latency", best.hostname, best.latency_us);
}

<<<<<<< HEAD

Implementation Status (Project)

=======

Implementation Status (Basin)

origin/main

| Component | Status | |-----------|--------| | Hardware detection (IOKit/udev) | Complete | | TB speed detection (TB1-TB5) | Complete | | eGPU detection (is_egpu()) | Complete | | GPU vendor routing (Metal/wgpu/CUDA) | Complete | | EGpuManager + events | Complete | | Device lifecycle registry | Skeleton | | Workload migration | Missing (types exist, execution not wired) | | State checkpointing | Missing | | GPU passthrough for VMs | Missing |

See docs/design/EGPU_HOTPLUG_ROADMAP.md for the 5-phase plan.

Common eGPU Enclosures

| Enclosure | GPU Support | TB Version | Power | Notes | |-----------|------------|------------|-------|-------| | Razer Core X | Full-length, 3-slot | TB3 | 650W PSU | Most popular | | Sonnet Breakaway | Full-length, 2-slot | TB3 | 550W PSU | Mac-optimized | | Mantiz Saturn Pro | Full-length, 2-slot | TB3 | 550W PSU | Built-in hub | | ASUS XG Station Pro | Full-length, 2.5-slot | TB3 | 330W PSU | Compact |

When to Use eGPU vs Cloud GPU

| Factor | eGPU | Cloud GPU (A100/H100) | |--------|------|----------------------| | Latency | <1ms (local TB) | 1-50ms (network) | | Bandwidth | 2.8-6.0 GB/s | Network-limited | | Cost | One-time ($300-2000) | Per-hour ($1-30/hr) | | Availability | Always on | Capacity-dependent | | VRAM | Consumer (8-24GB) | Enterprise (40-80GB) | | Best for | Dev/test, small models | Training, large models |