serving-llms-vllm
Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.
llmvllminference-optimizationgpu-memory-management
ovachiever
81
claude-api
|
anthropic-sdkllmapi-integrationbackend
ovachiever
81
llamafile
When setting up local LLM inference without cloud APIs. When running GGUF models locally. When needing OpenAI-compatible API from a local model. When building offline/air-gapped AI tools. When troubleshooting local LLM server connections.
llm-integrationlocal-developmentoffline-accesstroubleshooting
Jamie-BitFlight
111