机器学习基础系列——线性回归

前言

线性回归是机器学习的”Hello World”。虽然简单，但它包含了机器学习的核心思想：建模、损失函数、优化。深入理解线性回归，是学习更复杂算法的基础。

问题定义

什么是线性回归

给定输入特征 $\mathbf{x} \in \mathbb{R}^n$，预测连续输出 $y \in \mathbb{R}$：

\[\hat{y} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b = \mathbf{w}^T \mathbf{x} + b\]

向量形式：

\[\hat{\mathbf{y}} = \mathbf{X}\mathbf{w} + b\]

应用场景

场景	特征	目标
房价预测	面积、房间数、位置	价格
销量预测	广告投入、季节	销售额
温度预测	历史温度、湿度	明日温度

数学推导

损失函数

使用均方误差（MSE）：

\[L(\mathbf{w}, b) = \frac{1}{2N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 = \frac{1}{2N} \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2\]

正规方程（解析解）

对 $\mathbf{w}$ 求导并令其为零：

\[\nabla_\mathbf{w} L = -\frac{1}{N} \mathbf{X}^T (\mathbf{y} - \mathbf{X}\mathbf{w}) = 0\]

解得：

\[\mathbf{w}^* = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}\]

梯度下降（迭代解）

\[\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \nabla_\mathbf{w} L = \mathbf{w}_t + \frac{\eta}{N} \mathbf{X}^T (\mathbf{y} - \mathbf{X}\mathbf{w}_t)\]

从零实现

生成数据

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# 真实参数
true_w = np.array([2.0, -3.5])
true_b = 4.2

# 生成数据
n_samples = 100
X = np.random.randn(n_samples, 2)
y = X @ true_w + true_b + np.random.randn(n_samples) * 0.5

print(f"X shape: {X.shape}")  # (100, 2)
print(f"y shape: {y.shape}")  # (100,)

正规方程实现

def linear_regression_normal(X, y):
    """使用正规方程求解"""
    # 添加偏置项
    X_b = np.c_[np.ones((len(X), 1)), X]
    
    # 正规方程
    theta = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
    
    return theta[0], theta[1:]  # b, w

b, w = linear_regression_normal(X, y)
print(f"估计的 w: {w}")
print(f"估计的 b: {b:.4f}")
print(f"真实的 w: {true_w}")
print(f"真实的 b: {true_b}")

梯度下降实现

class LinearRegressionGD:
    def __init__(self, learning_rate=0.01, n_iterations=1000):
        self.lr = learning_rate
        self.n_iterations = n_iterations
        self.w = None
        self.b = None
        self.losses = []
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # 初始化参数
        self.w = np.zeros(n_features)
        self.b = 0
        
        # 梯度下降
        for _ in range(self.n_iterations):
            # 预测
            y_pred = X @ self.w + self.b
            
            # 计算损失
            loss = np.mean((y - y_pred) ** 2) / 2
            self.losses.append(loss)
            
            # 计算梯度
            dw = -(1 / n_samples) * X.T @ (y - y_pred)
            db = -(1 / n_samples) * np.sum(y - y_pred)
            
            # 更新参数
            self.w -= self.lr * dw
            self.b -= self.lr * db
        
        return self
    
    def predict(self, X):
        return X @ self.w + self.b

# 训练
model = LinearRegressionGD(learning_rate=0.1, n_iterations=1000)
model.fit(X, y)

print(f"GD 估计的 w: {model.w}")
print(f"GD 估计的 b: {model.b:.4f}")

# 绘制损失曲线
plt.figure(figsize=(10, 4))
plt.plot(model.losses)
plt.xlabel('迭代次数')
plt.ylabel('损失')
plt.title('训练损失曲线')
plt.show()

特征缩放

为什么需要特征缩放

不同特征的量级差异大时，梯度下降会很慢：

某些方向梯度很大，某些方向梯度很小
等高线变成细长的椭圆

标准化（Z-score）

\[x' = \frac{x - \mu}{\sigma}\]

class StandardScaler:
    def fit(self, X):
        self.mean = np.mean(X, axis=0)
        self.std = np.std(X, axis=0)
        return self
    
    def transform(self, X):
        return (X - self.mean) / self.std
    
    def fit_transform(self, X):
        return self.fit(X).transform(X)

# 使用
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 标准化后训练
model_scaled = LinearRegressionGD(learning_rate=0.1, n_iterations=100)
model_scaled.fit(X_scaled, y)

归一化（Min-Max）

\[x' = \frac{x - x_{min}}{x_{max} - x_{min}}\]

class MinMaxScaler:
    def fit(self, X):
        self.min = np.min(X, axis=0)
        self.max = np.max(X, axis=0)
        return self
    
    def transform(self, X):
        return (X - self.min) / (self.max - self.min)

多项式回归

处理非线性关系

通过特征变换处理非线性：

\[\hat{y} = w_0 + w_1 x + w_2 x^2 + \cdots + w_d x^d\]

class PolynomialFeatures:
    def __init__(self, degree=2):
        self.degree = degree
    
    def transform(self, X):
        n_samples = len(X)
        features = [np.ones(n_samples)]
        
        for d in range(1, self.degree + 1):
            features.append(X ** d)
        
        return np.column_stack(features)

# 生成非线性数据
X_nonlinear = np.linspace(-3, 3, 100)
y_nonlinear = 0.5 * X_nonlinear**2 + X_nonlinear + 2 + np.random.randn(100) * 0.5

# 多项式特征
poly = PolynomialFeatures(degree=2)
X_poly = poly.transform(X_nonlinear)

# 正规方程求解
theta = np.linalg.lstsq(X_poly, y_nonlinear, rcond=None)[0]
print(f"多项式系数: {theta}")

# 预测
y_pred = X_poly @ theta

plt.scatter(X_nonlinear, y_nonlinear, alpha=0.5, label='数据')
plt.plot(X_nonlinear, y_pred, 'r-', label='拟合曲线')
plt.legend()
plt.show()

使用 Scikit-learn

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

# 划分数据
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 方法1：基本线性回归
model = LinearRegression()
model.fit(X_train, y_train)

print(f"权重: {model.coef_}")
print(f"偏置: {model.intercept_:.4f}")

# 评估
y_pred = model.predict(X_test)
print(f"MSE: {mean_squared_error(y_test, y_pred):.4f}")
print(f"R²: {r2_score(y_test, y_pred):.4f}")

# 方法2：Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('regressor', LinearRegression())
])

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
print(f"Pipeline R²: {r2_score(y_test, y_pred):.4f}")

模型评估

评估指标

指标	公式	含义
MSE	$\frac{1}{N}\sum(y-\hat{y})^2$	平均平方误差
RMSE	$\sqrt{MSE}$	与y同量纲
MAE	$\frac{1}{N}\sum	y-\hat{y}	$	平均绝对误差
R²	$1 - \frac{SS_{res}}{SS_{tot}}$	解释方差比例

def evaluate_regression(y_true, y_pred):
    mse = np.mean((y_true - y_pred) ** 2)
    rmse = np.sqrt(mse)
    mae = np.mean(np.abs(y_true - y_pred))
    
    ss_res = np.sum((y_true - y_pred) ** 2)
    ss_tot = np.sum((y_true - np.mean(y_true)) ** 2)
    r2 = 1 - ss_res / ss_tot
    
    return {
        'MSE': mse,
        'RMSE': rmse,
        'MAE': mae,
        'R²': r2
    }

metrics = evaluate_regression(y_test, y_pred)
for name, value in metrics.items():
    print(f"{name}: {value:.4f}")

残差分析

residuals = y_test - y_pred

fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# 残差分布
axes[0].hist(residuals, bins=20, edgecolor='black')
axes[0].axvline(0, color='r', linestyle='--')
axes[0].set_xlabel('残差')
axes[0].set_ylabel('频数')
axes[0].set_title('残差分布')

# 残差 vs 预测值
axes[1].scatter(y_pred, residuals, alpha=0.5)
axes[1].axhline(0, color='r', linestyle='--')
axes[1].set_xlabel('预测值')
axes[1].set_ylabel('残差')
axes[1].set_title('残差 vs 预测值')

plt.tight_layout()
plt.show()

常见问题

Q1: 什么时候用正规方程，什么时候用梯度下降？

方法	适用条件	优点	缺点
正规方程	特征少（< 10000）	一步到位	计算复杂度 $O(n^3)$
梯度下降	特征多	可扩展	需要调参

Q2: R² 可以是负数吗？

可以。当模型预测还不如取均值时，R² 为负。这说明模型很差。

Q3: 如何处理多重共线性？

多重共线性会导致：

系数估计不稳定
$\mathbf{X}^T\mathbf{X}$ 接近奇异

解决方案：

删除相关特征
使用正则化（Ridge、Lasso）
PCA降维

Q4: 线性回归有什么假设？

线性性：$y$ 与 $\mathbf{x}$ 线性相关
独立性：样本独立同分布
同方差性：误差方差恒定
正态性：误差服从正态分布

总结

方面	内容
模型	$\hat{y} = \mathbf{w}^T\mathbf{x} + b$
损失	MSE
求解	正规方程 / 梯度下降
评估	MSE, RMSE, R²
注意	特征缩放、共线性

线性回归虽然简单，但是很多复杂模型的基础。下一篇我们将介绍如何用正则化来改进线性回归。

参考资料

《统计学习方法》李航第3章
《Hands-On Machine Learning》Aurélien Géron 第4章
scikit-learn 文档：Linear Models

（采用 CC BY-NC-SA 4.0 许可协议进行授权）

本文标题：《机器学习基础系列——线性回归》

本文链接：http://localhost:3015/ai/%E7%BA%BF%E6%80%A7%E5%9B%9E%E5%BD%92.html

本文最后一次更新为天前，文章中的某些内容可能已过时！