机器学习基础系列——多项式回归与正则化

前言

上一篇介绍了基本线性回归。当模型过于复杂（特征过多）时，容易出现过拟合。正则化是解决过拟合的经典方法，通过在损失函数中添加惩罚项来约束模型复杂度。

过拟合问题

什么是过拟合

状态	训练误差	测试误差	说明
欠拟合	高	高	模型太简单
适度拟合	低	低	理想状态
过拟合	很低	高	模型太复杂

多项式回归示例

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split

np.random.seed(42)

# 生成数据
n_samples = 30
X = np.sort(np.random.rand(n_samples) * 10).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.randn(n_samples) * 0.3

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# 不同复杂度的模型
degrees = [1, 3, 15]
X_plot = np.linspace(0, 10, 100).reshape(-1, 1)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, degree in zip(axes, degrees):
    model = Pipeline([
        ('poly', PolynomialFeatures(degree=degree)),
        ('linear', LinearRegression())
    ])
    model.fit(X_train, y_train)
    
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    ax.scatter(X_train, y_train, label='训练集')
    ax.scatter(X_test, y_test, marker='x', label='测试集')
    ax.plot(X_plot, model.predict(X_plot), 'r-', label='预测')
    ax.set_title(f'Degree={degree}\nTrain R²={train_score:.3f}, Test R²={test_score:.3f}')
    ax.legend()

plt.tight_layout()
plt.show()

Ridge 回归（L2正则化）

原理

在MSE基础上添加L2惩罚项：

\[L(\mathbf{w}) = \frac{1}{2N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 + \frac{\lambda}{2}\|\mathbf{w}\|_2^2\] \[= \frac{1}{2N}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \frac{\lambda}{2}\mathbf{w}^T\mathbf{w}\]

解析解

\[\mathbf{w}^* = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]

从零实现

class RidgeRegression:
    def __init__(self, alpha=1.0):
        self.alpha = alpha
        self.w = None
        self.b = None
    
    def fit(self, X, y):
        # 添加偏置
        X_b = np.c_[np.ones((len(X), 1)), X]
        n_features = X_b.shape[1]
        
        # 正则化矩阵（不惩罚偏置项）
        reg_matrix = self.alpha * np.eye(n_features)
        reg_matrix[0, 0] = 0
        
        # 解析解
        theta = np.linalg.inv(X_b.T @ X_b + reg_matrix) @ X_b.T @ y
        
        self.b = theta[0]
        self.w = theta[1:]
        return self
    
    def predict(self, X):
        return X @ self.w + self.b

# 测试
ridge = RidgeRegression(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"Ridge 系数: {ridge.w[:5]}")

使用 Scikit-learn

from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# 创建Pipeline
ridge_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('poly', PolynomialFeatures(degree=15)),
    ('ridge', Ridge(alpha=1.0))
])

ridge_pipeline.fit(X_train, y_train)
train_score = ridge_pipeline.score(X_train, y_train)
test_score = ridge_pipeline.score(X_test, y_test)
print(f"Ridge: Train R²={train_score:.3f}, Test R²={test_score:.3f}")

选择正则化强度

from sklearn.model_selection import cross_val_score

alphas = [0.001, 0.01, 0.1, 1, 10, 100]

for alpha in alphas:
    ridge = Pipeline([
        ('poly', PolynomialFeatures(degree=15)),
        ('ridge', Ridge(alpha=alpha))
    ])
    scores = cross_val_score(ridge, X, y, cv=5, scoring='r2')
    print(f"alpha={alpha:6.3f}: CV R² = {scores.mean():.3f} ± {scores.std():.3f}")

Lasso 回归（L1正则化）

原理

使用L1范数惩罚：

\[L(\mathbf{w}) = \frac{1}{2N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 + \lambda\|\mathbf{w}\|_1\] \[= \frac{1}{2N}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\sum_{j=1}^{p}|w_j|\]

稀疏性

Lasso的关键特性：会将部分系数压缩到精确为0，实现自动特征选择。

实现

from sklearn.linear_model import Lasso

# 高维数据示例
np.random.seed(42)
n_samples, n_features = 100, 50
n_informative = 5  # 只有5个特征有用

# 生成数据
X_hd = np.random.randn(n_samples, n_features)
true_coef = np.zeros(n_features)
true_coef[:n_informative] = np.random.randn(n_informative) * 5
y_hd = X_hd @ true_coef + np.random.randn(n_samples) * 0.5

# 训练Lasso
lasso = Lasso(alpha=0.5, max_iter=10000)
lasso.fit(X_hd, y_hd)

# 查看非零系数
nonzero = np.sum(lasso.coef_ != 0)
print(f"非零系数数量: {nonzero}")
print(f"真实非零数量: {n_informative}")

# 可视化系数
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.stem(true_coef, linefmt='b-', markerfmt='bo', basefmt='r-')
plt.title('真实系数')

plt.subplot(1, 2, 2)
plt.stem(lasso.coef_, linefmt='g-', markerfmt='go', basefmt='r-')
plt.title('Lasso估计的系数')

plt.tight_layout()
plt.show()

坐标下降法

Lasso没有解析解，通常使用坐标下降法求解：

def lasso_coordinate_descent(X, y, alpha, max_iter=1000, tol=1e-4):
    """坐标下降法求解Lasso"""
    n_samples, n_features = X.shape
    w = np.zeros(n_features)
    
    for iteration in range(max_iter):
        w_old = w.copy()
        
        for j in range(n_features):
            # 固定其他参数，优化第j个
            r_j = y - X @ w + X[:, j] * w[j]
            rho_j = X[:, j] @ r_j
            
            # 软阈值
            if rho_j < -alpha * n_samples:
                w[j] = (rho_j + alpha * n_samples) / (X[:, j] @ X[:, j])
            elif rho_j > alpha * n_samples:
                w[j] = (rho_j - alpha * n_samples) / (X[:, j] @ X[:, j])
            else:
                w[j] = 0
        
        # 检查收敛
        if np.linalg.norm(w - w_old) < tol:
            break
    
    return w

ElasticNet（弹性网络）

原理

结合L1和L2惩罚：

\[L(\mathbf{w}) = \frac{1}{2N}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda_1\|\mathbf{w}\|_1 + \frac{\lambda_2}{2}\|\mathbf{w}\|_2^2\]

或使用混合参数 $\rho$：

\[L = MSE + \alpha(\rho\|\mathbf{w}\|_1 + \frac{1-\rho}{2}\|\mathbf{w}\|_2^2)\]

优点

当特征高度相关时，Lasso可能随机选择其中一个
ElasticNet倾向于选择相关特征组

from sklearn.linear_model import ElasticNet

# 创建相关特征
X_corr = X_hd.copy()
X_corr[:, n_informative:2*n_informative] = X_corr[:, :n_informative] + np.random.randn(n_samples, n_informative) * 0.1

# 比较
models = {
    'Lasso': Lasso(alpha=0.1),
    'Ridge': Ridge(alpha=0.1),
    'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}

for name, model in models.items():
    model.fit(X_corr, y_hd)
    nonzero = np.sum(np.abs(model.coef_) > 1e-6)
    print(f"{name}: 非零系数数量 = {nonzero}")

正则化对比

几何解释

正则化	约束区域形状	解的特点
L2 (Ridge)	圆形	系数变小但不为零
L1 (Lasso)	菱形	部分系数为零
ElasticNet	介于两者之间	兼具两者特点

选择指南

场景	推荐方法
所有特征都可能有用	Ridge
需要特征选择	Lasso
特征间高度相关	ElasticNet
特征数 > 样本数	Lasso / ElasticNet

# 正则化路径可视化
from sklearn.linear_model import lasso_path

alphas_lasso, coefs_lasso, _ = lasso_path(X_hd, y_hd, alphas=np.logspace(-3, 1, 50))

plt.figure(figsize=(10, 6))
for coef in coefs_lasso:
    plt.plot(np.log10(alphas_lasso), coef)

plt.xlabel('log(alpha)')
plt.ylabel('系数')
plt.title('Lasso 正则化路径')
plt.axhline(0, color='k', linestyle='--', alpha=0.3)
plt.show()

交叉验证选择超参数

from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV

# RidgeCV
ridge_cv = RidgeCV(alphas=np.logspace(-4, 4, 50), cv=5)
ridge_cv.fit(X_hd, y_hd)
print(f"Ridge 最优 alpha: {ridge_cv.alpha_:.4f}")

# LassoCV
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, max_iter=10000)
lasso_cv.fit(X_hd, y_hd)
print(f"Lasso 最优 alpha: {lasso_cv.alpha_:.4f}")

# ElasticNetCV
elastic_cv = ElasticNetCV(
    l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1],
    alphas=np.logspace(-4, 1, 50),
    cv=5,
    max_iter=10000
)
elastic_cv.fit(X_hd, y_hd)
print(f"ElasticNet 最优: alpha={elastic_cv.alpha_:.4f}, l1_ratio={elastic_cv.l1_ratio_}")

实战：波士顿房价预测

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score

# 加载数据
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names

# 划分数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 训练不同模型
models = {
    'LinearRegression': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.01),
    'ElasticNet': ElasticNet(alpha=0.01, l1_ratio=0.5)
}

results = []
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    results.append({
        'Model': name,
        'MSE': mse,
        'R²': r2,
        'Non-zero Coefs': np.sum(np.abs(model.coef_) > 1e-6)
    })

# 显示结果
import pandas as pd
pd.DataFrame(results)

常见问题

Q1: 正则化参数如何选择？

使用交叉验证（CV）自动选择：

RidgeCV、LassoCV、ElasticNetCV
通常搜索对数尺度：np.logspace(-4, 4, 50)

Q2: 需要标准化吗？

必须标准化！ 因为：

L1/L2惩罚对系数大小敏感
不同尺度的特征会导致不公平惩罚
标准化后惩罚强度才有意义

Q3: Ridge和Lasso哪个更好？

没有绝对答案：

Ridge：当认为所有特征都有贡献时
Lasso：当想要稀疏模型/特征选择时
不确定时用ElasticNet

Q4: 为什么L1能产生稀疏解？

几何解释：L1约束区域是菱形，损失函数等高线更容易在菱形的角点（某些坐标为0）处相切。

总结

方法	惩罚项	特点	使用场景
Ridge	$\lambda\|\mathbf{w}\|_2^2$	系数收缩	多重共线性
Lasso	$\lambda\|\mathbf{w}\|_1$	稀疏解	特征选择
ElasticNet	$\lambda_1\|\mathbf{w}\|_1 + \lambda_2\|\mathbf{w}\|_2^2$	结合两者	相关特征

参考资料

《统计学习导论》(ISLR) 第6章
《The Elements of Statistical Learning》第3章
scikit-learn 文档：Regularization

（采用 CC BY-NC-SA 4.0 许可协议进行授权）

本文标题：《机器学习基础系列——多项式回归与正则化》

本文链接：http://localhost:3015/ai/%E5%A4%9A%E9%A1%B9%E5%BC%8F%E5%9B%9E%E5%BD%92%E4%B8%8E%E6%AD%A3%E5%88%99%E5%8C%96.html

本文最后一次更新为天前，文章中的某些内容可能已过时！