EdgeVLA Skill Skill | Agent Skills

EdgeVLA Skill

Trit: -1 (MINUS - analysis/verification) Color: #DBA51D (Golden Yellow) URI: skill://evla-vla#DBA51D

Overview

EdgeVLA is an open-source edge vision-language-action model for robotics. It standardizes diverse robotics datasets from the Open-X Embodiment (OXE) collection for consistent training and deployment.

Architecture

┌────────────────────────────────────────────────────────────────┐
│                     EdgeVLA ARCHITECTURE                        │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │              Open-X Embodiment Datasets                   │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐         │  │
│  │  │ DROID   │ │ Bridge  │ │ LIBERO  │ │ RT-X    │ + 60... │  │
│  │  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘         │  │
│  └───────┼───────────┼───────────┼───────────┼──────────────┘  │
│          │           │           │           │                  │
│          ▼           ▼           ▼           ▼                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │           OXE_DATASET_CONFIGS Standardization            │  │
│  │  • image_obs_keys: primary, secondary, wrist cameras     │  │
│  │  • state_encoding: POS_EULER, POS_QUAT, JOINT           │  │
│  │  • action_encoding: EEF_POS, JOINT_POS                  │  │
│  └──────────────────────────────────────────────────────────┘  │
│                              │                                  │
│                              ▼                                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                  Unified Data Format                      │  │
│  │  ┌─────────────────────────────────────────────────────┐ │  │
│  │  │ Images: resized, normalized, multi-view             │ │  │
│  │  │ States: 8-dim standardized proprioception           │ │  │
│  │  │ Actions: 7-dim EEF or joint actions                 │ │  │
│  │  └─────────────────────────────────────────────────────┘ │  │
│  └──────────────────────────────────────────────────────────┘  │
│                              │                                  │
│                              ▼                                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                     VLA Model                             │  │
│  │  Vision Encoder → Language Model → Action Decoder        │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘

Dataset Configuration

from evla.config import OXE_DATASET_CONFIGS, StateEncoding, ActionEncoding

# DROID dataset configuration
droid_config = OXE_DATASET_CONFIGS["droid"]
# {
#     "image_obs_keys": {
#         "primary": "exterior_image_1_left",
#         "secondary": "exterior_image_2_left",
#         "wrist": "wrist_image_left",
#     },
#     "state_encoding": StateEncoding.POS_QUAT,
#     "action_encoding": ActionEncoding.EEF_POS,
# }

# Bridge dataset configuration
bridge_config = OXE_DATASET_CONFIGS["bridge"]
# {
#     "image_obs_keys": {
#         "primary": "image_0",
#         "wrist": "image_1",
#     },
#     "state_encoding": StateEncoding.POS_EULER,
#     "action_encoding": ActionEncoding.EEF_POS,
# }

Named Mixtures

from evla.config import OXE_NAMED_MIXTURES

# Comprehensive multi-dataset training
oxe_magic_soup = OXE_NAMED_MIXTURES["oxe_magic_soup"]

# RT-X reproduction
rtx_mixture = OXE_NAMED_MIXTURES["rtx"]

# Custom mixture with weights
custom_mixture = {
    "droid": 1.0,
    "bridge": 0.5,
    "libero": 0.3,
}

Usage

from evla import EdgeVLA, DataLoader

# Load model
model = EdgeVLA.from_pretrained("kscale/evla-base")

# Create dataloader with mixture
loader = DataLoader(
    mixture="oxe_magic_soup",
    batch_size=32,
    image_size=(224, 224),
)

# Training loop
for batch in loader:
    images = batch["images"]  # (B, V, H, W, C)
    states = batch["states"]  # (B, 8)
    actions = batch["actions"]  # (B, 7)
    
    loss = model.train_step(images, states, actions)

# Inference
with torch.no_grad():
    image = camera.capture()
    state = robot.get_state()
    action = model.predict(image, state, "pick up the red block")
    robot.execute(action)

Key Contributors

budzianowski: Core architecture, dataset configs, finetuning
moojink: LIBERO eval, dataset transforms
WT-MM: README, integration

GF(3) Triads

This skill participates in balanced triads:

evla-vla (-1) ⊗ kos-firmware (+1) ⊗ mujoco-scenes (0) = 0 ✓
ksim-rl (-1) ⊗ topos-generate (+1) ⊗ evla-vla (-1) = needs balancing

Related Skills

kos-firmware (+1): Robot firmware for deployment
ksim-rl (-1): RL training for locomotion
kbot-humanoid (-1): K-Bot configuration
mujoco-scenes (0): Scene composition

References

@misc{evla2024,
  title={EdgeVLA: Open-Source Edge Vision-Language-Action Model},
  author={K-Scale Labs},
  year={2024},
  url={https://github.com/kscalelabs/evla}
}

@article{openvla2024,
  title={OpenVLA: An Open-Source Vision-Language-Action Model},
  author={Kim, Moo Jin and others},
  journal={arXiv:2406.09246},
  year={2024}
}

Agent Skills: EdgeVLA Skill

Install this agent skill to your local

Skill Files