Overview
Unsloth models can be deployed using native optimized inference or through production serving engines like vLLM and SGLang. Native inference is accelerated 2x via for_inference(), while production serving requires merging LoRA weights into the base model.
When to Use
- When performing local testing or simple application deployment.
- When building high-throughput production endpoints using vLLM or SGLang.
- When creating OpenAI-compatible APIs for drop-in replacement in existing apps.
Decision Tree
- Is throughput the priority?
- Yes: Merge to 16-bit and use vLLM.
- Is VRAM for serving extremely limited?
- Yes: Merge to 4-bit or use GGUF via Ollama.
- Running simple Python inference?
- Yes: Call
FastLanguageModel.for_inference(model).
- Yes: Call
Workflows
Native Optimized Inference
- Load fine-tuned model and tokenizer using
FastLanguageModel. - Call
FastLanguageModel.for_inference(model)to enable optimized kernels. - Use
model.generate()with inputs formatted via the chat template.
Merging LoRA for vLLM Serving
- Select export method: 'merged_16bit' (quality) or 'merged_4bit' (VRAM).
- Run
model.save_pretrained_merged("serving_model", tokenizer, save_method='merged_16bit'). - Start vLLM server pointing to the 'serving_model' directory.
Non-Obvious Insights
- Calling
FastLanguageModel.for_inference(model)is mandatory for speed; it performs on-the-fly weight fusion and enables specialized Triton kernels for the forward pass. - Most production serving engines (vLLM, SGLang) cannot use LoRA adapters directly without performance hits; merging them into the base model during export is the standard best practice.
- The
unsloth-cliprovides a pre-built OpenAI-compatible endpoint, making it easy to serve models locally and test with standard API clients.
Evidence
- "Unsloth itself provides 2x faster inference natively as well, so always do not forget to call FastLanguageModel.for_inference(model)." Source
- "model.save_pretrained_merged("output", tokenizer, save_method = "merged_16bit")" Source
Scripts
scripts/unsloth-inference_tool.py: Script for running local optimized inference.scripts/unsloth-inference_tool.js: Helper to test OpenAI-compatible endpoints.
Dependencies
- unsloth
- torch
- vllm (optional for production)
- sglang (optional for production)
References
- [[references/README.md]]