MSE、交叉熵、梯度下降变体

前言

损失函数(Loss Function)是机器学习的灵魂——它定义了”什么是好的预测”。选择合适的损失函数,配合高效的优化算法,是训练成功模型的关键。


损失函数概述

什么是损失函数

损失函数衡量模型预测值与真实值之间的差距:

\[L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \ell(y_i, \hat{y}_i)\]
  • $y_i$:真实标签
  • $\hat{y}_i$:模型预测
  • $\theta$:模型参数

损失函数的选择原则

任务类型 推荐损失函数
回归 MSE、MAE、Huber
二分类 二元交叉熵
多分类 多元交叉熵
排序 对比损失、Triplet损失

回归损失函数

均方误差(MSE)

\[L_{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2\]
import numpy as np

def mse_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

def mse_gradient(y_true, y_pred):
    return -2 * (y_true - y_pred) / len(y_true)

# 示例
y_true = np.array([1.0, 2.0, 3.0, 4.0])
y_pred = np.array([1.1, 2.2, 2.8, 4.1])

loss = mse_loss(y_true, y_pred)
print(f"MSE Loss: {loss:.4f}")  # 0.025

特点

  • 对大误差惩罚更重(平方)
  • 可微,梯度计算简单
  • 对异常值敏感

平均绝对误差(MAE)

\[L_{MAE} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|\]
def mae_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

loss = mae_loss(y_true, y_pred)
print(f"MAE Loss: {loss:.4f}")  # 0.15

特点

  • 对异常值更鲁棒
  • 在零点不可导

Huber 损失

结合 MSE 和 MAE 的优点:

\[L_{\delta}(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & |y - \hat{y}| \leq \delta \\ \delta(|y - \hat{y}| - \frac{1}{2}\delta) & |y - \hat{y}| > \delta \end{cases}\]
def huber_loss(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small = np.abs(error) <= delta
    
    squared_loss = 0.5 * error ** 2
    linear_loss = delta * np.abs(error) - 0.5 * delta ** 2
    
    return np.mean(np.where(is_small, squared_loss, linear_loss))

loss = huber_loss(y_true, y_pred, delta=0.5)
print(f"Huber Loss: {loss:.4f}")

损失函数对比

import matplotlib.pyplot as plt

errors = np.linspace(-3, 3, 100)

mse = errors ** 2
mae = np.abs(errors)
huber = np.where(np.abs(errors) <= 1, 0.5 * errors**2, np.abs(errors) - 0.5)

plt.figure(figsize=(10, 6))
plt.plot(errors, mse, label='MSE')
plt.plot(errors, mae, label='MAE')
plt.plot(errors, huber, label='Huber (δ=1)')
plt.xlabel('预测误差')
plt.ylabel('损失')
plt.legend()
plt.title('回归损失函数对比')
plt.grid(True)
plt.show()

分类损失函数

二元交叉熵(Binary Cross-Entropy)

\[L_{BCE} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]\]
def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(
        y_true * np.log(y_pred) + 
        (1 - y_true) * np.log(1 - y_pred)
    )

def bce_gradient(y_true, y_pred, epsilon=1e-15):
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -(y_true / y_pred - (1 - y_true) / (1 - y_pred)) / len(y_true)

# 示例
y_true = np.array([1, 0, 1, 1, 0])
y_pred = np.array([0.9, 0.1, 0.8, 0.7, 0.2])

loss = binary_cross_entropy(y_true, y_pred)
print(f"BCE Loss: {loss:.4f}")  # 约 0.21

多元交叉熵(Categorical Cross-Entropy)

\[L_{CCE} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{c=1}^{C} y_{ic} \log(\hat{y}_{ic})\]
def categorical_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """
    y_true: one-hot 编码 (N, C)
    y_pred: softmax 输出 (N, C)
    """
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))

# 示例:3分类
y_true = np.array([
    [1, 0, 0],
    [0, 1, 0],
    [0, 0, 1]
])
y_pred = np.array([
    [0.7, 0.2, 0.1],
    [0.1, 0.8, 0.1],
    [0.2, 0.2, 0.6]
])

loss = categorical_cross_entropy(y_true, y_pred)
print(f"CCE Loss: {loss:.4f}")

交叉熵与信息论

交叉熵的信息论解释:

  • $H(p)$:真实分布 $p$ 的熵
  • $H(p, q)$:用分布 $q$ 编码分布 $p$ 的平均码长
  • $D_{KL}(p   q) = H(p,q) - H(p)$:KL散度

最小化交叉熵等价于最小化 KL 散度。


其他常用损失函数

Focal Loss

解决类别不平衡问题:

\[L_{FL} = -\alpha (1-\hat{y})^\gamma \log(\hat{y})\]
def focal_loss(y_true, y_pred, gamma=2.0, alpha=0.25):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # 计算交叉熵
    ce = -y_true * np.log(y_pred) - (1 - y_true) * np.log(1 - y_pred)
    
    # 计算调制因子
    p_t = y_true * y_pred + (1 - y_true) * (1 - y_pred)
    modulating_factor = (1 - p_t) ** gamma
    
    # 计算 alpha 权重
    alpha_weight = y_true * alpha + (1 - y_true) * (1 - alpha)
    
    return np.mean(alpha_weight * modulating_factor * ce)

Contrastive Loss

用于相似性学习:

\[L = (1-y) \frac{1}{2} D^2 + y \frac{1}{2} \max(0, m-D)^2\]
def contrastive_loss(y, d, margin=1.0):
    """
    y: 1表示不同类,0表示同类
    d: 样本对之间的距离
    """
    return np.mean(
        (1 - y) * 0.5 * d**2 + 
        y * 0.5 * np.maximum(0, margin - d)**2
    )

Triplet Loss

\[L = \max(0, d(a, p) - d(a, n) + m)\]
def triplet_loss(anchor, positive, negative, margin=1.0):
    """
    三元组损失
    anchor: 锚点样本
    positive: 正样本(与锚点同类)
    negative: 负样本(与锚点不同类)
    """
    d_pos = np.linalg.norm(anchor - positive)
    d_neg = np.linalg.norm(anchor - negative)
    return np.maximum(0, d_pos - d_neg + margin)

优化算法

批量梯度下降(BGD)

\[\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta; X, y)\]
def batch_gradient_descent(X, y, theta, lr=0.01, n_iters=1000):
    m = len(y)
    history = []
    
    for _ in range(n_iters):
        gradient = (1/m) * X.T @ (X @ theta - y)
        theta = theta - lr * gradient
        loss = np.mean((X @ theta - y) ** 2)
        history.append(loss)
    
    return theta, history

随机梯度下降(SGD)

每次只用一个样本:

def stochastic_gradient_descent(X, y, theta, lr=0.01, n_epochs=100):
    m = len(y)
    history = []
    
    for epoch in range(n_epochs):
        indices = np.random.permutation(m)
        for i in indices:
            xi = X[i:i+1]
            yi = y[i:i+1]
            gradient = xi.T @ (xi @ theta - yi)
            theta = theta - lr * gradient
        
        loss = np.mean((X @ theta - y) ** 2)
        history.append(loss)
    
    return theta, history

小批量梯度下降(Mini-batch GD)

def minibatch_gradient_descent(X, y, theta, lr=0.01, batch_size=32, n_epochs=100):
    m = len(y)
    history = []
    
    for epoch in range(n_epochs):
        indices = np.random.permutation(m)
        X_shuffled = X[indices]
        y_shuffled = y[indices]
        
        for i in range(0, m, batch_size):
            xi = X_shuffled[i:i+batch_size]
            yi = y_shuffled[i:i+batch_size]
            gradient = (1/len(yi)) * xi.T @ (xi @ theta - yi)
            theta = theta - lr * gradient
        
        loss = np.mean((X @ theta - y) ** 2)
        history.append(loss)
    
    return theta, history

动量与自适应学习率

Momentum

\(v_t = \gamma v_{t-1} + \eta \nabla_\theta L\) \(\theta_t = \theta_{t-1} - v_t\)

def sgd_momentum(X, y, theta, lr=0.01, momentum=0.9, n_epochs=100):
    velocity = np.zeros_like(theta)
    history = []
    
    for epoch in range(n_epochs):
        gradient = (1/len(y)) * X.T @ (X @ theta - y)
        velocity = momentum * velocity + lr * gradient
        theta = theta - velocity
        
        loss = np.mean((X @ theta - y) ** 2)
        history.append(loss)
    
    return theta, history

RMSprop

\(s_t = \rho s_{t-1} + (1-\rho) g_t^2\) \(\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{s_t + \epsilon}} g_t\)

def rmsprop(X, y, theta, lr=0.01, rho=0.9, epsilon=1e-8, n_epochs=100):
    s = np.zeros_like(theta)
    history = []
    
    for epoch in range(n_epochs):
        gradient = (1/len(y)) * X.T @ (X @ theta - y)
        s = rho * s + (1 - rho) * gradient ** 2
        theta = theta - lr * gradient / (np.sqrt(s) + epsilon)
        
        loss = np.mean((X @ theta - y) ** 2)
        history.append(loss)
    
    return theta, history

Adam

结合 Momentum 和 RMSprop:

\(m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\) \(v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\) \(\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}\) \(\theta_t = \theta_{t-1} - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\)

def adam(X, y, theta, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, n_epochs=100):
    m = np.zeros_like(theta)
    v = np.zeros_like(theta)
    history = []
    
    for t in range(1, n_epochs + 1):
        gradient = (1/len(y)) * X.T @ (X @ theta - y)
        
        m = beta1 * m + (1 - beta1) * gradient
        v = beta2 * v + (1 - beta2) * gradient ** 2
        
        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)
        
        theta = theta - lr * m_hat / (np.sqrt(v_hat) + epsilon)
        
        loss = np.mean((X @ theta - y) ** 2)
        history.append(loss)
    
    return theta, history

优化器对比

优化器 特点 适用场景
SGD 简单,需要调参 凸问题
Momentum 加速收敛 有噪声的梯度
RMSprop 自适应学习率 非平稳目标
Adam 综合性能好 默认选择

学习率调度

学习率衰减

def learning_rate_decay(initial_lr, epoch, decay_rate=0.1, decay_steps=10):
    return initial_lr * (decay_rate ** (epoch // decay_steps))

# 余弦退火
def cosine_annealing(initial_lr, epoch, T_max):
    return initial_lr * (1 + np.cos(np.pi * epoch / T_max)) / 2

# Warmup
def warmup_lr(initial_lr, epoch, warmup_epochs=5):
    if epoch < warmup_epochs:
        return initial_lr * (epoch + 1) / warmup_epochs
    return initial_lr

常见问题

Q1: 为什么分类问题不用 MSE?

  • MSE 的梯度在预测概率接近 0 或 1 时很小(梯度消失)
  • 交叉熵的梯度与误差成正比,更利于优化
  • 交叉熵有概率论支撑(最大似然)

Q2: Adam 总是最好的选择吗?

不一定。某些情况下:

  • SGD + momentum 在图像分类上泛化更好
  • AdamW(带权重衰减的Adam)通常比 Adam 更好
  • 需要根据任务实验选择

Q3: Batch Size 如何选择?

  • 较大 batch:训练更稳定,但泛化可能变差
  • 较小 batch:噪声更大,可能跳出局部最优
  • 实践中常用 32、64、128、256

Q4: 如何判断是否收敛?

  • 训练损失不再下降
  • 验证损失开始上升(过拟合信号)
  • 梯度范数接近零

总结

类别 常用选项 默认推荐
回归损失 MSE、MAE、Huber MSE
分类损失 BCE、CCE 交叉熵
优化器 SGD、Adam、AdamW Adam
学习率 1e-3 ~ 1e-4 3e-4

参考资料

  • 《深度学习》花书第8章
  • Sebastian Ruder《An overview of gradient descent optimization algorithms》
  • PyTorch 官方文档:torch.optim

版权声明: 如无特别声明,本文版权归 sshipanoo 所有,转载请注明本文链接。

(采用 CC BY-NC-SA 4.0 许可协议进行授权)

本文标题:《 机器学习基础系列——损失函数与优化 》

本文链接:http://localhost:3015/ai/%E6%8D%9F%E5%A4%B1%E5%87%BD%E6%95%B0%E4%B8%8E%E4%BC%98%E5%8C%96.html

本文最后一次更新为 天前,文章中的某些内容可能已过时!