Hugging Face Jobs Training Manager Skill

Hugging Face Jobs Training Manager

Run Unsloth training on Hugging Face Jobs (cloud GPU training).

Prerequisites

HF Authentication: huggingface-cli whoami (login if needed)
HF Jobs Access: Requires PRO subscription or org compute access
Training notebook/script: From funsloth-train

Workflow

1. Select Hardware

| GPU | VRAM | Cost | Best For | |-----|------|------|----------| | A10G | 24GB | ~$1.50/hr | 7-14B LoRA | | A100 40GB | 40GB | ~$4/hr | 14-34B | | A100 80GB | 80GB | ~$6/hr | 70B | | H100 | 80GB | ~$8/hr | Fastest |

See references/HARDWARE_GUIDE.md for model-to-GPU mapping.

2. Convert Notebook to Script

HF Jobs requires PEP 723 script format:

# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git",
#     "torch>=2.0",
#     "transformers>=4.45",
#     "trl>=0.12",
#     "peft>=0.13",
#     "datasets>=2.18",
# ]
# ///

Use scripts/train_sft.py as a template.

3. Optional: WandB Integration

Add to script:

import wandb
wandb.init(project="funsloth-training")
# Add report_to="wandb" in TrainingArguments

Set: export WANDB_API_KEY="your-key"

4. Estimate Costs

Use the cost estimator:

python scripts/estimate_cost.py --tokens {total_tokens} --platform hfjobs

5. Launch Job

# Create job config
cat > job_config.yaml << 'EOF'
compute:
  gpu: {gpu_type}
  gpu_count: 1
script: train_hfjobs.py
outputs:
  - /outputs/*
EOF

# Submit
huggingface-cli jobs create --config job_config.yaml

6. Monitor Progress

huggingface-cli jobs status {job_id}
huggingface-cli jobs logs {job_id} --follow

WandB: https://wandb.ai/{username}/funsloth-training

7. Download Artifacts

from huggingface_hub import snapshot_download
snapshot_download(repo_id="{username}/funsloth-job", local_dir="./outputs")

8. Handoff

Offer funsloth-upload for Hub upload with model card.

Error Handling

| Error | Resolution | |-------|------------| | No HF Jobs access | Get PRO subscription | | OOM | Reduce batch size or upgrade GPU | | Job timeout | Enable checkpointing | | Script error | Check PEP 723 dependencies |

Bundled Resources

scripts/train_sft.py - PEP 723 script template
scripts/estimate_cost.py - Cost estimation
references/PLATFORM_COMPARISON.md - HF Jobs vs alternatives
references/HARDWARE_GUIDE.md - VRAM requirements
references/TROUBLESHOOTING.md - Common issues

Agent Skills: Hugging Face Jobs Training Manager

Install this agent skill to your local

Skill Files