LLM应用开发——本地模型部署

前言

本地部署 LLM 可以保护数据隐私、降低成本、减少延迟。本文详细介绍多种本地部署方案，包括 Ollama、vLLM、TGI 等工具，以及模型量化技术。

部署方案对比

┌─────────────────────────────────────────────────────────────────┐
│                   本地部署方案选择                               │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  你的需求是什么？                                                │
│       │                                                         │
│       ├── 快速体验/开发 → Ollama                                │
│       │                                                         │
│       ├── 高并发生产 → vLLM                                     │
│       │                                                         │
│       ├── Hugging Face 生态 → TGI                              │
│       │                                                         │
│       ├── Apple Silicon → MLX                                  │
│       │                                                         │
│       └── 边缘设备 → llama.cpp                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

方案	易用性	性能	并发	量化支持	适用场景
Ollama	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐	[OK]	开发/个人
vLLM	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	[OK]	生产环境
TGI	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐	[OK]	HF 生态
llama.cpp	⭐⭐	⭐⭐⭐	⭐⭐	⭐⭐⭐⭐⭐	CPU/边缘
MLX	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐	[OK]	Mac

Ollama 部署

安装与基本使用

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# 启动服务
ollama serve

# 拉取模型
ollama pull llama3.1:8b
ollama pull qwen2.5:7b
ollama pull codellama:13b

# 运行模型
ollama run llama3.1:8b

# 列出模型
ollama list

Python SDK

import ollama
from typing import Generator

class OllamaClient:
    """Ollama 客户端封装"""
    
    def __init__(self, host: str = "http://localhost:11434"):
        self.client = ollama.Client(host=host)
    
    def chat(
        self,
        model: str,
        messages: list,
        stream: bool = False
    ) -> str | Generator:
        """对话"""
        response = self.client.chat(
            model=model,
            messages=messages,
            stream=stream
        )
        
        if stream:
            return self._stream_response(response)
        return response["message"]["content"]
    
    def _stream_response(self, response) -> Generator:
        """流式响应"""
        for chunk in response:
            yield chunk["message"]["content"]
    
    def generate(
        self,
        model: str,
        prompt: str,
        system: str = None
    ) -> str:
        """生成"""
        response = self.client.generate(
            model=model,
            prompt=prompt,
            system=system
        )
        return response["response"]
    
    def embed(self, model: str, text: str) -> list:
        """获取嵌入向量"""
        response = self.client.embeddings(
            model=model,
            prompt=text
        )
        return response["embedding"]
    
    def list_models(self) -> list:
        """列出模型"""
        response = self.client.list()
        return [m["name"] for m in response["models"]]

# 使用
client = OllamaClient()

# 对话
messages = [
    {"role": "user", "content": "用Python写一个快速排序"}
]
response = client.chat("llama3.1:8b", messages)
print(response)

# 流式对话
for chunk in client.chat("llama3.1:8b", messages, stream=True):
    print(chunk, end="", flush=True)

# 嵌入
embedding = client.embed("nomic-embed-text", "Hello world")

自定义模型

# Modelfile
FROM llama3.1:8b

# 设置参数
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
PARAMETER top_p 0.9

# 系统提示
SYSTEM """
你是一个专业的Python编程助手。
你会提供清晰、简洁、高效的代码解决方案。
"""

# 创建模型
ollama create python-assistant -f Modelfile

# 运行
ollama run python-assistant

REST API

import requests
import json

class OllamaAPI:
    """Ollama REST API"""
    
    def __init__(self, base_url: str = "http://localhost:11434"):
        self.base_url = base_url
    
    def generate(self, model: str, prompt: str) -> str:
        """生成"""
        response = requests.post(
            f"{self.base_url}/api/generate",
            json={"model": model, "prompt": prompt, "stream": False}
        )
        return response.json()["response"]
    
    def chat_stream(self, model: str, messages: list):
        """流式对话"""
        response = requests.post(
            f"{self.base_url}/api/chat",
            json={"model": model, "messages": messages, "stream": True},
            stream=True
        )
        
        for line in response.iter_lines():
            if line:
                data = json.loads(line)
                if "message" in data:
                    yield data["message"]["content"]
    
    def pull_model(self, model: str):
        """拉取模型"""
        response = requests.post(
            f"{self.base_url}/api/pull",
            json={"name": model},
            stream=True
        )
        
        for line in response.iter_lines():
            if line:
                print(json.loads(line))

vLLM 生产级部署

vLLM 是目前最先进的推理引擎，其核心优势在于 PagedAttention 和 Continuous Batching。

1. 核心技术解析

PagedAttention：借鉴了操作系统的虚拟内存管理，将 KV Cache 分页存储在非连续的内存空间中，解决了显存碎片化问题，显存利用率提升至 90% 以上。
Continuous Batching：不同于传统的静态 Batching（需等待所有请求完成），vLLM 可以在每个 Token 生成步动态插入新请求，极大提升了吞吐量。

2. 高级部署配置

# 启动支持多 LoRA 的 vLLM 服务
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-8b \
    --enable-lora \
    --lora-modules sql-lora=/path/to/sql-adapter chat-lora=/path/to/chat-adapter \
    --max-loras 4 \
    --tensor-parallel-size 2 \ # 使用 2 张显卡进行张量并行
    --gpu-memory-utilization 0.95 \
    --max-model-len 8192

3. 投机采样 (Speculative Decoding)

使用一个小模型（如 Llama-68M）来预测 Token，大模型仅负责验证，可提升 2x-3x 的推理速度。

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70b \
    --speculative-model ibm-fms/llama-light-125m \
    --num-speculative-tokens 5 \
    --tensor-parallel-size 4

4. 性能监控 (Prometheus)

vLLM 默认在 :8000/metrics 暴露监控指标。

# 关键指标说明
# vllm:num_requests_running: 当前运行中的请求数
# vllm:iteration_tokens_total: 每秒生成的 Token 总数
# vllm:gpu_cache_usage_perc: KV Cache 显存占用率

Docker 部署

# docker-compose.yml
version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    command: >
      --model meta-llama/Llama-2-7b-chat-hf
      --dtype float16
      --max-model-len 4096
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Text Generation Inference (TGI)

启动服务

# Docker 启动
docker run --gpus all -p 8080:80 \
    -v $PWD/data:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id meta-llama/Llama-2-7b-chat-hf \
    --quantize bitsandbytes-nf4 \
    --max-input-length 4096 \
    --max-total-tokens 8192

Python 客户端

from huggingface_hub import InferenceClient
from text_generation import Client

class TGIClient:
    """TGI 客户端"""
    
    def __init__(self, url: str = "http://localhost:8080"):
        self.client = Client(url)
        self.hf_client = InferenceClient(url)
    
    def generate(
        self,
        prompt: str,
        max_new_tokens: int = 512,
        temperature: float = 0.7
    ) -> str:
        """生成"""
        response = self.client.generate(
            prompt,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=0.95,
            do_sample=True
        )
        return response.generated_text
    
    def stream(self, prompt: str, max_new_tokens: int = 512):
        """流式生成"""
        for token in self.client.generate_stream(
            prompt,
            max_new_tokens=max_new_tokens
        ):
            yield token.token.text
    
    def chat(self, messages: list) -> str:
        """对话"""
        response = self.hf_client.chat_completion(
            messages,
            max_tokens=512
        )
        return response.choices[0].message.content

# 使用
tgi_client = TGIClient()
response = tgi_client.generate("写一个Python函数计算斐波那契数列")

for token in tgi_client.stream("解释什么是机器学习"):
    print(token, end="", flush=True)

llama.cpp 部署

编译与运行

# 克隆
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# 编译（CPU）
make

# 编译（CUDA）
make LLAMA_CUDA=1

# 编译（Metal - macOS）
make LLAMA_METAL=1

# 下载/转换模型
python convert.py /path/to/model --outtype f16

# 量化
./quantize models/llama-7b/ggml-model-f16.gguf \
    models/llama-7b/ggml-model-q4_k_m.gguf q4_k_m

# 运行
./main -m models/llama-7b/ggml-model-q4_k_m.gguf \
    -p "Hello" -n 128

# 启动服务器
./server -m models/llama-7b/ggml-model-q4_k_m.gguf \
    --host 0.0.0.0 --port 8080

Python 绑定

from llama_cpp import Llama

class LlamaCppClient:
    """llama.cpp Python 客户端"""
    
    def __init__(
        self,
        model_path: str,
        n_ctx: int = 4096,
        n_gpu_layers: int = -1  # -1 = 全部上 GPU
    ):
        self.llm = Llama(
            model_path=model_path,
            n_ctx=n_ctx,
            n_gpu_layers=n_gpu_layers,
            chat_format="llama-2",
            verbose=False
        )
    
    def generate(
        self,
        prompt: str,
        max_tokens: int = 512,
        temperature: float = 0.7
    ) -> str:
        """生成"""
        output = self.llm(
            prompt,
            max_tokens=max_tokens,
            temperature=temperature,
            stop=["</s>", "\n\n"]
        )
        return output["choices"][0]["text"]
    
    def chat(self, messages: list) -> str:
        """对话"""
        response = self.llm.create_chat_completion(
            messages,
            max_tokens=512
        )
        return response["choices"][0]["message"]["content"]
    
    def stream(self, prompt: str):
        """流式生成"""
        for chunk in self.llm(
            prompt,
            max_tokens=512,
            stream=True
        ):
            yield chunk["choices"][0]["text"]

# 使用
client = LlamaCppClient(
    "models/llama-7b-q4_k_m.gguf",
    n_gpu_layers=35
)

response = client.chat([
    {"role": "user", "content": "你好"}
])

模型量化

量化方法对比

量化	精度损失	大小	速度	推荐场景
FP16	无	1x	基准	高精度要求
INT8	很小	0.5x	快	平衡方案
INT4	小	0.25x	更快	资源受限
GPTQ	小	0.25x	快	GPU 推理
AWQ	很小	0.25x	快	准确性优先
GGUF	可变	可变	中	CPU/通用

GPTQ 量化

from transformers import AutoModelForCausalLM, AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig

def quantize_gptq(
    model_name: str,
    output_dir: str,
    bits: int = 4
):
    """GPTQ 量化"""
    # 加载模型
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # 量化配置
    quantize_config = BaseQuantizeConfig(
        bits=bits,
        group_size=128,
        desc_act=True
    )
    
    # 准备校准数据
    calibration_data = [
        "这是一段用于校准的文本...",
        "另一段校准文本..."
    ]
    
    # 加载并量化
    model = AutoGPTQForCausalLM.from_pretrained(
        model_name,
        quantize_config
    )
    
    model.quantize(calibration_data)
    model.save_quantized(output_dir)
    tokenizer.save_pretrained(output_dir)

# 使用量化模型
model = AutoGPTQForCausalLM.from_quantized(
    "model-gptq",
    device="cuda:0"
)

AWQ 量化

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

def quantize_awq(
    model_name: str,
    output_dir: str
):
    """AWQ 量化"""
    # 加载
    model = AutoAWQForCausalLM.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    
    # 量化配置
    quant_config = {
        "zero_point": True,
        "q_group_size": 128,
        "w_bit": 4
    }
    
    # 量化
    model.quantize(tokenizer, quant_config=quant_config)
    
    # 保存
    model.save_quantized(output_dir)
    tokenizer.save_pretrained(output_dir)

# 使用
model = AutoAWQForCausalLM.from_quantized(
    "model-awq",
    fuse_layers=True
)

统一 API 服务

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional, List

app = FastAPI()

class ChatRequest(BaseModel):
    model: str
    messages: List[dict]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 512
    stream: Optional[bool] = False

class LocalLLMRouter:
    """本地 LLM 路由器"""
    
    def __init__(self):
        self.backends = {}
    
    def register_backend(self, name: str, client):
        """注册后端"""
        self.backends[name] = client
    
    def chat(self, request: ChatRequest) -> str:
        """路由请求"""
        if request.model not in self.backends:
            raise ValueError(f"Model {request.model} not found")
        
        client = self.backends[request.model]
        return client.chat(request.messages)

router = LocalLLMRouter()

# 注册不同后端
router.register_backend("llama3", OllamaClient())
router.register_backend("codellama", VLLMServer("codellama/CodeLlama-7b"))

@app.post("/v1/chat/completions")
async def chat(request: ChatRequest):
    """OpenAI 兼容接口"""
    try:
        response = router.chat(request)
        return {
            "choices": [{
                "message": {"role": "assistant", "content": response}
            }]
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

最佳实践

场景	推荐方案	配置建议
个人开发	Ollama	默认配置
高并发	vLLM	连续批处理
Mac 本地	MLX/Ollama	Metal 加速
CPU 部署	llama.cpp	Q4_K_M 量化
生产环境	vLLM + K8s	多副本

参考资源

（采用 CC BY-NC-SA 4.0 许可协议进行授权）

本文标题:LLM应用开发——本地模型部署

本文链接:https://www.sshipanoo.com/blog/ai/llm-app/本地模型部署/

本文最后一次更新为天前，文章中的某些内容可能已过时！