模型压缩与加速技术

前言

模型量化和剪枝是部署深度学习模型的关键技术,可以显著减少模型大小和推理延迟。本文详细介绍这些模型压缩技术。


模型压缩概述

import numpy as np

print("模型压缩的动机:")
print("=" * 50)
print("• 部署限制: 边缘设备内存和算力有限")
print("• 延迟要求: 实时应用需要快速响应")
print("• 成本考虑: 减少云端推理成本")
print("• 能耗限制: 移动设备电池有限")
print()

print("主要压缩技术:")
print("• 量化 (Quantization): 降低数值精度")
print("• 剪枝 (Pruning): 移除不重要的参数")
print("• 知识蒸馏: 用小模型学习大模型")
print("• 神经架构搜索: 自动设计高效结构")

量化基础

量化原理

def linear_quantize(x, bits=8, symmetric=True):
    """线性量化"""
    
    if symmetric:
        # 对称量化
        abs_max = np.max(np.abs(x))
        scale = abs_max / (2 ** (bits - 1) - 1)
        zero_point = 0
    else:
        # 非对称量化
        x_min, x_max = np.min(x), np.max(x)
        scale = (x_max - x_min) / (2 ** bits - 1)
        zero_point = int(round(-x_min / scale))
    
    # 量化
    x_quant = np.round(x / scale) + zero_point
    x_quant = np.clip(x_quant, 0 if not symmetric else -(2**(bits-1)), 
                      2**bits - 1 if not symmetric else 2**(bits-1) - 1)
    
    # 反量化
    x_dequant = (x_quant - zero_point) * scale
    
    return x_quant.astype(np.int8 if bits == 8 else np.int32), x_dequant, scale, zero_point

# 示例
np.random.seed(42)
weights = np.random.randn(100) * 0.5

# 不同位宽量化
print("量化效果对比:")
print("-" * 50)

for bits in [8, 4, 2]:
    _, w_dequant, scale, zp = linear_quantize(weights, bits=bits)
    error = np.mean((weights - w_dequant) ** 2)
    print(f"{bits}-bit量化: MSE = {error:.6f}, 压缩比 = {32/bits:.1f}x")

可视化量化误差

import matplotlib.pyplot as plt

def visualize_quantization(weights, bits_list=[8, 4, 2]):
    """可视化量化效果"""
    
    fig, axes = plt.subplots(1, len(bits_list) + 1, figsize=(15, 4))
    
    # 原始分布
    axes[0].hist(weights, bins=50, alpha=0.7)
    axes[0].set_title('原始权重 (FP32)')
    axes[0].set_xlabel('')
    axes[0].set_ylabel('频数')
    
    for i, bits in enumerate(bits_list):
        _, w_dequant, _, _ = linear_quantize(weights, bits=bits)
        
        axes[i+1].hist(w_dequant, bins=50, alpha=0.7)
        axes[i+1].set_title(f'{bits}-bit量化')
        axes[i+1].set_xlabel('')
    
    plt.tight_layout()
    plt.show()

# 可视化
weights_large = np.random.randn(10000) * 0.5
visualize_quantization(weights_large)

训练后量化 (PTQ)

try:
    import torch
    import torch.nn as nn
    
    class PostTrainingQuantizer:
        """训练后量化"""
        
        def __init__(self, bits=8):
            self.bits = bits
            self.calibration_data = []
        
        def calibrate(self, model, dataloader, num_batches=100):
            """校准:收集激活值统计信息"""
            
            activation_ranges = {}
            
            def hook_fn(name):
                def hook(module, input, output):
                    if name not in activation_ranges:
                        activation_ranges[name] = {'min': float('inf'), 
                                                   'max': float('-inf')}
                    activation_ranges[name]['min'] = min(
                        activation_ranges[name]['min'], 
                        output.min().item()
                    )
                    activation_ranges[name]['max'] = max(
                        activation_ranges[name]['max'], 
                        output.max().item()
                    )
                return hook
            
            # 注册钩子
            hooks = []
            for name, module in model.named_modules():
                if isinstance(module, (nn.Linear, nn.Conv2d)):
                    hooks.append(module.register_forward_hook(hook_fn(name)))
            
            # 运行校准数据
            model.eval()
            with torch.no_grad():
                for i, (data, _) in enumerate(dataloader):
                    if i >= num_batches:
                        break
                    model(data)
            
            # 移除钩子
            for hook in hooks:
                hook.remove()
            
            return activation_ranges
        
        def quantize_tensor(self, tensor, min_val, max_val):
            """量化张量"""
            scale = (max_val - min_val) / (2 ** self.bits - 1)
            zero_point = int(round(-min_val / scale))
            
            q_tensor = torch.round(tensor / scale) + zero_point
            q_tensor = torch.clamp(q_tensor, 0, 2 ** self.bits - 1)
            
            return q_tensor.to(torch.int8), scale, zero_point
    
    print("PTQ流程:")
    print("  1. 准备校准数据(通常100-1000个样本)")
    print("  2. 前向传播收集激活值范围")
    print("  3. 计算量化参数(scale, zero_point)")
    print("  4. 量化权重和激活")
    
except ImportError:
    print("PyTorch未安装")

量化感知训练 (QAT)

try:
    class FakeQuantize(nn.Module):
        """伪量化模块(用于QAT)"""
        
        def __init__(self, bits=8):
            super().__init__()
            self.bits = bits
            self.register_buffer('scale', torch.tensor(1.0))
            self.register_buffer('zero_point', torch.tensor(0))
            self.register_buffer('min_val', torch.tensor(float('inf')))
            self.register_buffer('max_val', torch.tensor(float('-inf')))
        
        def forward(self, x):
            if self.training:
                # 更新统计信息
                self.min_val = torch.min(self.min_val, x.min())
                self.max_val = torch.max(self.max_val, x.max())
                
                # 计算scale和zero_point
                self.scale = (self.max_val - self.min_val) / (2 ** self.bits - 1)
                self.zero_point = torch.round(-self.min_val / self.scale)
            
            # 伪量化:量化后立即反量化
            x_q = torch.round(x / self.scale) + self.zero_point
            x_q = torch.clamp(x_q, 0, 2 ** self.bits - 1)
            x_dq = (x_q - self.zero_point) * self.scale
            
            # 直通估计器:前向用量化值,反向传原始梯度
            return x + (x_dq - x).detach()
    
    
    class QATLinear(nn.Module):
        """量化感知训练的Linear层"""
        
        def __init__(self, in_features, out_features, bits=8):
            super().__init__()
            
            self.linear = nn.Linear(in_features, out_features)
            self.weight_quantizer = FakeQuantize(bits)
            self.activation_quantizer = FakeQuantize(bits)
        
        def forward(self, x):
            # 量化权重
            q_weight = self.weight_quantizer(self.linear.weight)
            
            # 线性变换
            out = nn.functional.linear(x, q_weight, self.linear.bias)
            
            # 量化激活
            out = self.activation_quantizer(out)
            
            return out
    
    print("QAT vs PTQ:")
    print("  PTQ: 训练后量化,简单但精度损失大")
    print("  QAT: 训练中模拟量化,精度更高但需要重训练")
    
except NameError:
    print("需要先导入PyTorch")

模型剪枝

非结构化剪枝

def unstructured_pruning(weights, sparsity=0.5):
    """非结构化剪枝:按幅度剪枝"""
    
    # 计算阈值
    threshold = np.percentile(np.abs(weights), sparsity * 100)
    
    # 创建掩码
    mask = np.abs(weights) > threshold
    
    # 应用掩码
    pruned_weights = weights * mask
    
    actual_sparsity = 1 - np.count_nonzero(pruned_weights) / weights.size
    
    return pruned_weights, mask, actual_sparsity

# 测试
weights = np.random.randn(100, 100) * 0.5

for sparsity in [0.5, 0.7, 0.9]:
    pruned, mask, actual = unstructured_pruning(weights, sparsity)
    print(f"目标稀疏度: {sparsity*100:.0f}%, 实际: {actual*100:.1f}%")
    print(f"  非零参数: {np.count_nonzero(pruned)}/{weights.size}")

结构化剪枝

def structured_pruning_channels(weights, prune_ratio=0.5):
    """结构化剪枝:按通道剪枝"""
    
    # weights: [out_channels, in_channels, H, W] 或 [out, in]
    
    # 计算每个输出通道的重要性(L1范数)
    if weights.ndim == 4:
        importance = np.sum(np.abs(weights), axis=(1, 2, 3))
    else:
        importance = np.sum(np.abs(weights), axis=1)
    
    # 确定要保留的通道数
    n_keep = int(len(importance) * (1 - prune_ratio))
    
    # 选择最重要的通道
    keep_indices = np.argsort(importance)[-n_keep:]
    
    # 剪枝
    pruned_weights = weights[sorted(keep_indices)]
    
    return pruned_weights, sorted(keep_indices)

# 测试
conv_weights = np.random.randn(64, 32, 3, 3)  # 64个3x3卷积核

pruned, kept = structured_pruning_channels(conv_weights, prune_ratio=0.5)
print(f"原始形状: {conv_weights.shape}")
print(f"剪枝后: {pruned.shape}")
print(f"保留通道: {len(kept)}")

迭代剪枝

try:
    class IterativePruner:
        """迭代剪枝"""
        
        def __init__(self, model, final_sparsity=0.9, num_iterations=10):
            self.model = model
            self.final_sparsity = final_sparsity
            self.num_iterations = num_iterations
            self.masks = {}
        
        def compute_mask(self, weight, sparsity):
            """计算剪枝掩码"""
            threshold = torch.quantile(weight.abs().flatten(), sparsity)
            return (weight.abs() > threshold).float()
        
        def prune_step(self, iteration):
            """单步剪枝"""
            # 逐步增加稀疏度
            current_sparsity = self.final_sparsity * (iteration / self.num_iterations)
            
            for name, module in self.model.named_modules():
                if isinstance(module, nn.Linear):
                    mask = self.compute_mask(module.weight.data, current_sparsity)
                    self.masks[name] = mask
                    module.weight.data *= mask
            
            return current_sparsity
        
        def apply_masks(self):
            """应用掩码(训练后)"""
            for name, module in self.model.named_modules():
                if name in self.masks:
                    module.weight.data *= self.masks[name]
    
    print("迭代剪枝策略:")
    print("  1. 训练完整模型")
    print("  2. 剪枝一小部分参数")
    print("  3. 微调恢复精度")
    print("  4. 重复直到达到目标稀疏度")
    
except NameError:
    print("需要先导入PyTorch")

量化+剪枝组合

def compress_model_stats(original_params, sparsity=0.9, bits=8):
    """计算压缩后的模型大小"""
    
    # 原始大小 (FP32)
    original_size = original_params * 4  # 4 bytes
    
    # 量化后大小
    quantized_size = original_params * bits / 8
    
    # 剪枝后大小(假设稀疏存储)
    # 需要存储:非零值 + 索引
    non_zero = original_params * (1 - sparsity)
    pruned_quantized_size = non_zero * (bits / 8 + 4)  # 值 + 32位索引
    
    # 如果使用块稀疏
    block_size = 4
    block_sparse_size = non_zero * bits / 8 + (original_params / block_size) / 8  # bitmap
    
    print(f"原始模型: {original_size/1e6:.1f} MB")
    print(f"量化后 ({bits}-bit): {quantized_size/1e6:.1f} MB ({quantized_size/original_size*100:.1f}%)")
    print(f"剪枝+量化 ({sparsity*100:.0f}%稀疏, {bits}-bit): {pruned_quantized_size/1e6:.1f} MB ({pruned_quantized_size/original_size*100:.1f}%)")
    print(f"块稀疏+量化: {block_sparse_size/1e6:.1f} MB ({block_sparse_size/original_size*100:.1f}%)")

# 7B模型压缩
compress_model_stats(7e9, sparsity=0.5, bits=4)

量化方法对比

方法 精度 速度提升 适用场景
INT8 PTQ 轻微下降 2-4x 通用部署
INT8 QAT 几乎无损 2-4x 高精度要求
INT4 中等下降 4-8x 边缘设备
混合精度 最小下降 1.5-2x 训练加速

常见问题

Q1: 量化会导致多少精度损失?

  • INT8 PTQ: 通常<1%
  • INT4: 可能2-5%,取决于任务
  • QAT可以显著减少损失

Q2: 剪枝后需要微调吗?

是的,通常需要微调来恢复精度。

Q3: 结构化vs非结构化剪枝?

  • 非结构化:更高压缩率,但需要特殊硬件
  • 结构化:硬件友好,但压缩率较低

Q4: 如何选择量化位宽?

  • INT8:平衡精度和速度
  • INT4:追求极致压缩
  • 混合精度:敏感层用高精度

LLM 时代的量化三剑客

在大语言模型领域,传统的线性量化往往会导致严重的精度损失。目前工业界主流使用的是以下三种技术:

1. GPTQ (Post-Training Quantization)

  • 原理:基于二阶导数信息(Hessian 矩阵)进行量化,通过最小化量化前后的输出误差来调整权重。
  • 特点:量化速度快,适合 4-bit 量化,推理性能极佳。
  • 适用场景:GPU 推理(AutoGPTQ)。

2. AWQ (Activation-aware Weight Quantization)

  • 原理:发现权重中只有 1% 的“显著权重”对精度至关重要。AWQ 通过观察激活值的分布,对这些显著权重进行保护。
  • 特点:比 GPTQ 更能保持模型精度,且不依赖于特定的校准数据集。
  • 适用场景:高性能 GPU 推理(vLLM 默认支持)。

3. GGUF (llama.cpp)

  • 原理:一种专为 CPU 推理设计的格式,支持多种量化级别(q2_k, q4_k_m, q8_0 等)。
  • 特点:支持 Apple Silicon (Metal) 加速,允许模型在显存不足时将部分层卸载到内存。
  • 适用场景:本地部署、个人电脑、边缘设备。

模型剪枝 (Pruning)

剪枝通过移除模型中“不重要”的连接或神经元来减少计算量。

类型 描述 优点 缺点
非结构化剪枝 随机移除单个权重参数 精度损失最小 需要特殊硬件支持才能加速
结构化剪枝 移除整个通道 (Channel) 或层 (Layer) 直接在通用硬件上加速 精度损失相对较大

知识蒸馏 (Knowledge Distillation)

让一个小模型(Student)去模仿一个大模型(Teacher)的输出概率分布。

# 蒸馏损失函数示例
def distillation_loss(student_logits, teacher_logits, labels, T=2.0, alpha=0.5):
    # 1. 软目标损失 (KL 散度)
    soft_loss = nn.KLDivLoss()(
        F.log_softmax(student_logits / T, dim=1),
        F.softmax(teacher_logits / T, dim=1)
    ) * (T * T)
    
    # 2. 硬目标损失 (交叉熵)
    hard_loss = F.cross_entropy(student_logits, labels)
    
    return alpha * soft_loss + (1 - alpha) * hard_loss

总结

技术 原理 压缩比
INT8量化 降低数值精度 4x
INT4量化 极低精度 8x
非结构化剪枝 移除小权重 10-20x
结构化剪枝 移除整个通道 2-4x

参考资料

  • Jacob, B. et al. (2018). “Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference”
  • Han, S. et al. (2015). “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”
  • Frankle, J. & Carlin, M. (2019). “The Lottery Ticket Hypothesis”
  • Dettmers, T. et al. (2022). “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale”

版权声明: 如无特别声明,本文版权归 sshipanoo 所有,转载请注明本文链接。

(采用 CC BY-NC-SA 4.0 许可协议进行授权)

本文标题:《 机器学习基础系列——模型量化与剪枝 》

本文链接:http://localhost:3015/ai/%E6%A8%A1%E5%9E%8B%E9%87%8F%E5%8C%96%E4%B8%8E%E5%89%AA%E6%9E%9D.html

本文最后一次更新为 天前,文章中的某些内容可能已过时!