unsloth-inference Skill

Overview

Unsloth models can be deployed using native optimized inference or through production serving engines like vLLM and SGLang. Native inference is accelerated 2x via for_inference(), while production serving requires merging LoRA weights into the base model.

When to Use

When performing local testing or simple application deployment.
When building high-throughput production endpoints using vLLM or SGLang.
When creating OpenAI-compatible APIs for drop-in replacement in existing apps.

Decision Tree

Is throughput the priority?
- Yes: Merge to 16-bit and use vLLM.
Is VRAM for serving extremely limited?
- Yes: Merge to 4-bit or use GGUF via Ollama.
Running simple Python inference?
- Yes: Call FastLanguageModel.for_inference(model).

Workflows

Native Optimized Inference

Load fine-tuned model and tokenizer using FastLanguageModel.
Call FastLanguageModel.for_inference(model) to enable optimized kernels.
Use model.generate() with inputs formatted via the chat template.

Merging LoRA for vLLM Serving

Select export method: 'merged_16bit' (quality) or 'merged_4bit' (VRAM).
Run model.save_pretrained_merged("serving_model", tokenizer, save_method='merged_16bit').
Start vLLM server pointing to the 'serving_model' directory.

Non-Obvious Insights

Calling FastLanguageModel.for_inference(model) is mandatory for speed; it performs on-the-fly weight fusion and enables specialized Triton kernels for the forward pass.
Most production serving engines (vLLM, SGLang) cannot use LoRA adapters directly without performance hits; merging them into the base model during export is the standard best practice.
The unsloth-cli provides a pre-built OpenAI-compatible endpoint, making it easy to serve models locally and test with standard API clients.

Evidence

"Unsloth itself provides 2x faster inference natively as well, so always do not forget to call FastLanguageModel.for_inference(model)." Source
"model.save_pretrained_merged("output", tokenizer, save_method = "merged_16bit")" Source

Scripts

scripts/unsloth-inference_tool.py: Script for running local optimized inference.
scripts/unsloth-inference_tool.js: Helper to test OpenAI-compatible endpoints.

Dependencies

unsloth
torch
vllm (optional for production)
sglang (optional for production)

References

[[references/README.md]]

Agent Skills: unsloth-inference

Install this agent skill to your local

Skill Files