已经是最新一篇文章了!
已经是最后一篇文章了!
Ridge、Lasso、ElasticNet
前言
上一篇介绍了基本线性回归。当模型过于复杂(特征过多)时,容易出现过拟合。正则化是解决过拟合的经典方法,通过在损失函数中添加惩罚项来约束模型复杂度。
过拟合问题
什么是过拟合
| 状态 | 训练误差 | 测试误差 | 说明 |
|---|---|---|---|
| 欠拟合 | 高 | 高 | 模型太简单 |
| 适度拟合 | 低 | 低 | 理想状态 |
| 过拟合 | 很低 | 高 | 模型太复杂 |
多项式回归示例
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
np.random.seed(42)
# 生成数据
n_samples = 30
X = np.sort(np.random.rand(n_samples) * 10).reshape(-1, 1)
y = np.sin(X).ravel() + np.random.randn(n_samples) * 0.3
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# 不同复杂度的模型
degrees = [1, 3, 15]
X_plot = np.linspace(0, 10, 100).reshape(-1, 1)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for ax, degree in zip(axes, degrees):
model = Pipeline([
('poly', PolynomialFeatures(degree=degree)),
('linear', LinearRegression())
])
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
ax.scatter(X_train, y_train, label='训练集')
ax.scatter(X_test, y_test, marker='x', label='测试集')
ax.plot(X_plot, model.predict(X_plot), 'r-', label='预测')
ax.set_title(f'Degree={degree}\nTrain R²={train_score:.3f}, Test R²={test_score:.3f}')
ax.legend()
plt.tight_layout()
plt.show()
Ridge 回归(L2正则化)
原理
在MSE基础上添加L2惩罚项:
\[L(\mathbf{w}) = \frac{1}{2N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 + \frac{\lambda}{2}\|\mathbf{w}\|_2^2\] \[= \frac{1}{2N}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \frac{\lambda}{2}\mathbf{w}^T\mathbf{w}\]解析解
\[\mathbf{w}^* = (\mathbf{X}^T\mathbf{X} + \lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}\]从零实现
class RidgeRegression:
def __init__(self, alpha=1.0):
self.alpha = alpha
self.w = None
self.b = None
def fit(self, X, y):
# 添加偏置
X_b = np.c_[np.ones((len(X), 1)), X]
n_features = X_b.shape[1]
# 正则化矩阵(不惩罚偏置项)
reg_matrix = self.alpha * np.eye(n_features)
reg_matrix[0, 0] = 0
# 解析解
theta = np.linalg.inv(X_b.T @ X_b + reg_matrix) @ X_b.T @ y
self.b = theta[0]
self.w = theta[1:]
return self
def predict(self, X):
return X @ self.w + self.b
# 测试
ridge = RidgeRegression(alpha=1.0)
ridge.fit(X_train, y_train)
print(f"Ridge 系数: {ridge.w[:5]}")
使用 Scikit-learn
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# 创建Pipeline
ridge_pipeline = Pipeline([
('scaler', StandardScaler()),
('poly', PolynomialFeatures(degree=15)),
('ridge', Ridge(alpha=1.0))
])
ridge_pipeline.fit(X_train, y_train)
train_score = ridge_pipeline.score(X_train, y_train)
test_score = ridge_pipeline.score(X_test, y_test)
print(f"Ridge: Train R²={train_score:.3f}, Test R²={test_score:.3f}")
选择正则化强度
from sklearn.model_selection import cross_val_score
alphas = [0.001, 0.01, 0.1, 1, 10, 100]
for alpha in alphas:
ridge = Pipeline([
('poly', PolynomialFeatures(degree=15)),
('ridge', Ridge(alpha=alpha))
])
scores = cross_val_score(ridge, X, y, cv=5, scoring='r2')
print(f"alpha={alpha:6.3f}: CV R² = {scores.mean():.3f} ± {scores.std():.3f}")
Lasso 回归(L1正则化)
原理
使用L1范数惩罚:
\[L(\mathbf{w}) = \frac{1}{2N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2 + \lambda\|\mathbf{w}\|_1\] \[= \frac{1}{2N}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda\sum_{j=1}^{p}|w_j|\]稀疏性
Lasso的关键特性:会将部分系数压缩到精确为0,实现自动特征选择。
实现
from sklearn.linear_model import Lasso
# 高维数据示例
np.random.seed(42)
n_samples, n_features = 100, 50
n_informative = 5 # 只有5个特征有用
# 生成数据
X_hd = np.random.randn(n_samples, n_features)
true_coef = np.zeros(n_features)
true_coef[:n_informative] = np.random.randn(n_informative) * 5
y_hd = X_hd @ true_coef + np.random.randn(n_samples) * 0.5
# 训练Lasso
lasso = Lasso(alpha=0.5, max_iter=10000)
lasso.fit(X_hd, y_hd)
# 查看非零系数
nonzero = np.sum(lasso.coef_ != 0)
print(f"非零系数数量: {nonzero}")
print(f"真实非零数量: {n_informative}")
# 可视化系数
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.stem(true_coef, linefmt='b-', markerfmt='bo', basefmt='r-')
plt.title('真实系数')
plt.subplot(1, 2, 2)
plt.stem(lasso.coef_, linefmt='g-', markerfmt='go', basefmt='r-')
plt.title('Lasso估计的系数')
plt.tight_layout()
plt.show()
坐标下降法
Lasso没有解析解,通常使用坐标下降法求解:
def lasso_coordinate_descent(X, y, alpha, max_iter=1000, tol=1e-4):
"""坐标下降法求解Lasso"""
n_samples, n_features = X.shape
w = np.zeros(n_features)
for iteration in range(max_iter):
w_old = w.copy()
for j in range(n_features):
# 固定其他参数,优化第j个
r_j = y - X @ w + X[:, j] * w[j]
rho_j = X[:, j] @ r_j
# 软阈值
if rho_j < -alpha * n_samples:
w[j] = (rho_j + alpha * n_samples) / (X[:, j] @ X[:, j])
elif rho_j > alpha * n_samples:
w[j] = (rho_j - alpha * n_samples) / (X[:, j] @ X[:, j])
else:
w[j] = 0
# 检查收敛
if np.linalg.norm(w - w_old) < tol:
break
return w
ElasticNet(弹性网络)
原理
结合L1和L2惩罚:
\[L(\mathbf{w}) = \frac{1}{2N}\|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 + \lambda_1\|\mathbf{w}\|_1 + \frac{\lambda_2}{2}\|\mathbf{w}\|_2^2\]或使用混合参数 $\rho$:
\[L = MSE + \alpha(\rho\|\mathbf{w}\|_1 + \frac{1-\rho}{2}\|\mathbf{w}\|_2^2)\]优点
- 当特征高度相关时,Lasso可能随机选择其中一个
- ElasticNet倾向于选择相关特征组
from sklearn.linear_model import ElasticNet
# 创建相关特征
X_corr = X_hd.copy()
X_corr[:, n_informative:2*n_informative] = X_corr[:, :n_informative] + np.random.randn(n_samples, n_informative) * 0.1
# 比较
models = {
'Lasso': Lasso(alpha=0.1),
'Ridge': Ridge(alpha=0.1),
'ElasticNet': ElasticNet(alpha=0.1, l1_ratio=0.5)
}
for name, model in models.items():
model.fit(X_corr, y_hd)
nonzero = np.sum(np.abs(model.coef_) > 1e-6)
print(f"{name}: 非零系数数量 = {nonzero}")
正则化对比
几何解释
| 正则化 | 约束区域形状 | 解的特点 |
|---|---|---|
| L2 (Ridge) | 圆形 | 系数变小但不为零 |
| L1 (Lasso) | 菱形 | 部分系数为零 |
| ElasticNet | 介于两者之间 | 兼具两者特点 |
选择指南
| 场景 | 推荐方法 |
|---|---|
| 所有特征都可能有用 | Ridge |
| 需要特征选择 | Lasso |
| 特征间高度相关 | ElasticNet |
| 特征数 > 样本数 | Lasso / ElasticNet |
# 正则化路径可视化
from sklearn.linear_model import lasso_path
alphas_lasso, coefs_lasso, _ = lasso_path(X_hd, y_hd, alphas=np.logspace(-3, 1, 50))
plt.figure(figsize=(10, 6))
for coef in coefs_lasso:
plt.plot(np.log10(alphas_lasso), coef)
plt.xlabel('log(alpha)')
plt.ylabel('系数')
plt.title('Lasso 正则化路径')
plt.axhline(0, color='k', linestyle='--', alpha=0.3)
plt.show()
交叉验证选择超参数
from sklearn.linear_model import RidgeCV, LassoCV, ElasticNetCV
# RidgeCV
ridge_cv = RidgeCV(alphas=np.logspace(-4, 4, 50), cv=5)
ridge_cv.fit(X_hd, y_hd)
print(f"Ridge 最优 alpha: {ridge_cv.alpha_:.4f}")
# LassoCV
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, max_iter=10000)
lasso_cv.fit(X_hd, y_hd)
print(f"Lasso 最优 alpha: {lasso_cv.alpha_:.4f}")
# ElasticNetCV
elastic_cv = ElasticNetCV(
l1_ratio=[0.1, 0.5, 0.7, 0.9, 0.95, 1],
alphas=np.logspace(-4, 1, 50),
cv=5,
max_iter=10000
)
elastic_cv.fit(X_hd, y_hd)
print(f"ElasticNet 最优: alpha={elastic_cv.alpha_:.4f}, l1_ratio={elastic_cv.l1_ratio_}")
实战:波士顿房价预测
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score
# 加载数据
housing = fetch_california_housing()
X, y = housing.data, housing.target
feature_names = housing.feature_names
# 划分数据
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 训练不同模型
models = {
'LinearRegression': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.01),
'ElasticNet': ElasticNet(alpha=0.01, l1_ratio=0.5)
}
results = []
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
results.append({
'Model': name,
'MSE': mse,
'R²': r2,
'Non-zero Coefs': np.sum(np.abs(model.coef_) > 1e-6)
})
# 显示结果
import pandas as pd
pd.DataFrame(results)
常见问题
Q1: 正则化参数如何选择?
使用交叉验证(CV)自动选择:
-
RidgeCV、LassoCV、ElasticNetCV - 通常搜索对数尺度:
np.logspace(-4, 4, 50)
Q2: 需要标准化吗?
必须标准化! 因为:
- L1/L2惩罚对系数大小敏感
- 不同尺度的特征会导致不公平惩罚
- 标准化后惩罚强度才有意义
Q3: Ridge和Lasso哪个更好?
没有绝对答案:
- Ridge:当认为所有特征都有贡献时
- Lasso:当想要稀疏模型/特征选择时
- 不确定时用ElasticNet
Q4: 为什么L1能产生稀疏解?
几何解释:L1约束区域是菱形,损失函数等高线更容易在菱形的角点(某些坐标为0)处相切。
总结
| 方法 | 惩罚项 | 特点 | 使用场景 |
|---|---|---|---|
| Ridge | $\lambda|\mathbf{w}|_2^2$ | 系数收缩 | 多重共线性 |
| Lasso | $\lambda|\mathbf{w}|_1$ | 稀疏解 | 特征选择 |
| ElasticNet | $\lambda_1|\mathbf{w}|_1 + \lambda_2|\mathbf{w}|_2^2$ | 结合两者 | 相关特征 |
参考资料
- 《统计学习导论》(ISLR) 第6章
- 《The Elements of Statistical Learning》第3章
- scikit-learn 文档:Regularization
版权声明: 如无特别声明,本文版权归 sshipanoo 所有,转载请注明本文链接。
(采用 CC BY-NC-SA 4.0 许可协议进行授权)
本文标题:《 机器学习基础系列——多项式回归与正则化 》
本文最后一次更新为 天前,文章中的某些内容可能已过时!