Back to tags
Tag

Agent Skills with tag: inference-optimization

6 skills match this tag. Use tags to discover related Agent Skills and explore similar workflows.

serving-llms-vllm

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

llmvllminference-optimizationgpu-memory-management
ovachiever
ovachiever
81

tensorrt-llm

Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.

gpuinference-optimizationtensorrtllm
ovachiever
ovachiever
81

gptq

Post-training 4-bit quantization for LLMs with minimal accuracy loss. Use for deploying large models (70B, 405B) on consumer GPUs, when you need 4× memory reduction with <2% perplexity degradation, or for faster inference (3-4× speedup) vs FP16. Integrates with transformers and PEFT for QLoRA fine-tuning.

model-compressionquantizationtransformersinference-optimization
ovachiever
ovachiever
81

model-pruning

Reduce LLM size and accelerate inference using pruning techniques like Wanda and SparseGPT. Use when compressing models without retraining, achieving 50% sparsity with minimal accuracy loss, or enabling faster inference on hardware accelerators. Covers unstructured pruning, structured pruning, N:M sparsity, magnitude pruning, and one-shot methods.

model-compressionpruningllminference-optimization
ovachiever
ovachiever
81

llm-basics

LLM architecture, tokenization, transformers, and inference optimization. Use for understanding and working with language models.

llmtransformerstokenizationinference-optimization
pluginagentmarketplace
pluginagentmarketplace
1

model-serving

Master model serving - inference optimization, scaling, deployment, edge serving

model-deploymentinference-optimizationscalingedge-computing
pluginagentmarketplace
pluginagentmarketplace
1