Agent Skills: Using MLX for LLMs on Apple Silicon

Running and fine-tuning LLMs on Apple Silicon with MLX. Use when working with models locally on Mac, converting Hugging Face models to MLX format, fine-tuning with LoRA/QLoRA on Apple Silicon, or serving models via HTTP API.

UncategorizedID: itsmostafa/llm-engineering-skills/mlx

Skill Files

Browse the full folder contents for mlx.

Download Skill

Loading file tree…

skills/mlx/SKILL.md

Skill Metadata

Name
mlx
Description
Running and fine-tuning LLMs on Apple Silicon with MLX. Use when working with models locally on Mac, converting Hugging Face models to MLX format, fine-tuning with LoRA/QLoRA on Apple Silicon, or serving models via HTTP API.

Using MLX for LLMs on Apple Silicon

MLX-LM is a Python package for running large language models on Apple Silicon, leveraging the MLX framework for optimized performance with unified memory architecture.

Table of Contents

Core Concepts

Why MLX

| Aspect | PyTorch on Mac | MLX | |--------|----------------|-----| | Memory | Separate CPU/GPU copies | Unified memory, no copies | | Optimization | Generic Metal backend | Apple Silicon native | | Model loading | Slower, more memory | Lazy loading, efficient | | Quantization | Limited support | Built-in 4/8-bit |

MLX arrays live in shared memory, accessible by both CPU and GPU without data transfer overhead.

Supported Models

MLX-LM supports most popular architectures: Llama, Mistral, Qwen, Phi, Gemma, Cohere, and many more. Check the mlx-community on Hugging Face for pre-converted models.

Installation

pip install mlx-lm

Requires macOS 13.5+ and Apple Silicon (M1/M2/M3/M4).

Text Generation

Python API

from mlx_lm import load, generate

# Load model (from HF hub or local path)
model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

# Generate text
response = generate(
    model,
    tokenizer,
    prompt="Explain quantum computing in simple terms:",
    max_tokens=256,
    temp=0.7,
)
print(response)

Streaming Generation

from mlx_lm import load, stream_generate

model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit")

prompt = "Write a haiku about programming:"
for response in stream_generate(model, tokenizer, prompt, max_tokens=100):
    print(response.text, end="", flush=True)
print()

Batch Generation

from mlx_lm import load, batch_generate

model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit")

prompts = [
    "What is machine learning?",
    "Explain neural networks:",
    "Define deep learning:",
]

responses = batch_generate(
    model,
    tokenizer,
    prompts,
    max_tokens=100,
)

for prompt, response in zip(prompts, responses):
    print(f"Q: {prompt}\nA: {response}\n")

CLI Generation

# Basic generation
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --prompt "Explain recursion:" \
    --max-tokens 256

# With sampling parameters
mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
    --prompt "Write a poem about AI:" \
    --temp 0.8 \
    --top-p 0.95

Interactive Chat

CLI Chat

# Start chat REPL (context preserved between turns)
mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit

Python Chat

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What's the capital of France?"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, max_tokens=256)
print(response)

Model Conversion

Convert Hugging Face models to MLX format:

CLI Conversion

# Convert with 4-bit quantization
mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct \
    -q  # Quantize to 4-bit

# With specific quantization
mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 \
    -q \
    --q-bits 8 \
    --q-group-size 64

# Upload to Hugging Face Hub
mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct \
    -q \
    --upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx

Python Conversion

from mlx_lm import convert

convert(
    hf_path="meta-llama/Llama-3.2-3B-Instruct",
    mlx_path="./llama-3.2-3b-mlx",
    quantize=True,
    q_bits=4,
    q_group_size=64,
)

Conversion Options

| Option | Default | Description | |--------|---------|-------------| | --q-bits | 4 | Quantization bits (4 or 8) | | --q-group-size | 64 | Group size for quantization | | --dtype | float16 | Data type for non-quantized weights |

Quantization

MLX supports multiple quantization methods for different use cases:

| Method | Best For | Command | |--------|----------|---------| | Basic | Quick conversion | mlx_lm.convert -q | | DWQ | Quality-preserving | mlx_lm.dwq | | AWQ | Activation-aware | mlx_lm.awq | | Dynamic | Per-layer precision | mlx_lm.dynamic_quant | | GPTQ | Established method | mlx_lm.gptq |

Quick Quantization

# 4-bit quantization during conversion
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q

# 8-bit for higher quality
mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8

For detailed coverage of each method, see reference/quantization.md.

Fine-tuning with LoRA

MLX supports LoRA and QLoRA fine-tuning for efficient adaptation on Apple Silicon.

Quick Start

# Prepare training data (JSONL format)
# {"text": "Your training text here"}
# or
# {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

# Fine-tune with LoRA
mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --train \
    --data ./data \
    --iters 1000

# Generate with adapter
mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --adapter-path ./adapters \
    --prompt "Your prompt here"

Fuse Adapter into Model

# Merge LoRA weights into base model
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --adapter-path ./adapters \
    --save-path ./fused-model

# Or export to GGUF
mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \
    --adapter-path ./adapters \
    --export-gguf

For detailed LoRA configuration and training patterns, see reference/fine-tuning.md.

Serving Models

OpenAI-Compatible Server

# Start server
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080

# Use with OpenAI client
curl http://localhost:8080/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "default",
        "messages": [{"role": "user", "content": "Hello!"}],
        "max_tokens": 256
    }'

Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain MLX in one sentence."}],
    max_tokens=100,
)
print(response.choices[0].message.content)

Best Practices

  1. Use pre-quantized models: Download from mlx-community on Hugging Face for immediate use

  2. Match quantization to your hardware: M1/M2 with 8GB: use 4-bit; M2/M3 Pro/Max: 8-bit for quality

  3. Leverage unified memory: Unlike CUDA, MLX models can exceed "GPU memory" by using swap (slower but works)

  4. Use streaming for UX: stream_generate provides responsive output for interactive applications

  5. Cache prompt prefixes: Use mlx_lm.cache_prompt for repeated prompts with varying suffixes

  6. Batch similar requests: batch_generate is more efficient than sequential generation

  7. Start with 4-bit quantization: Good quality/size tradeoff; upgrade to 8-bit if quality issues

  8. Fuse adapters for deployment: After fine-tuning, fuse adapters for faster inference without loading separately

  9. Monitor memory with Activity Monitor: Watch memory pressure to avoid swap thrashing

  10. Use chat templates: Always apply tokenizer.apply_chat_template() for instruction-tuned models

References

See reference/ for detailed documentation:

  • quantization.md - Detailed quantization methods and when to use each
  • fine-tuning.md - Complete LoRA/QLoRA training guide with data formats and configuration