机器学习基础系列——逻辑回归

前言

逻辑回归（Logistic Regression）是最经典的分类算法之一。虽然名字中有”回归”，但它实际上是分类算法，输出的是样本属于某一类别的概率。

从线性回归到分类

问题引入

线性回归输出范围是 $(-\infty, +\infty)$，但分类概率需要在 $[0, 1]$。

解决方案：使用Sigmoid函数将线性输出映射到概率：

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]

Sigmoid函数性质

import numpy as np
import matplotlib.pyplot as plt

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

z = np.linspace(-10, 10, 100)

plt.figure(figsize=(10, 4))

# Sigmoid函数
plt.subplot(1, 2, 1)
plt.plot(z, sigmoid(z), 'b-', linewidth=2)
plt.axhline(0.5, color='r', linestyle='--', alpha=0.5)
plt.axvline(0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('z')
plt.ylabel('σ(z)')
plt.title('Sigmoid函数')
plt.grid(True, alpha=0.3)

# 导数
plt.subplot(1, 2, 2)
sig = sigmoid(z)
derivative = sig * (1 - sig)
plt.plot(z, derivative, 'g-', linewidth=2)
plt.xlabel('z')
plt.ylabel("σ'(z)")
plt.title('Sigmoid导数: σ(1-σ)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

关键性质：

输出范围：$(0, 1)$
$\sigma(0) = 0.5$
导数：$\sigma’(z) = \sigma(z)(1-\sigma(z))$
对称性：$\sigma(-z) = 1 - \sigma(z)$

逻辑回归模型

模型定义

给定输入 $\mathbf{x}$，预测属于正类的概率：

\[P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^T\mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x} + b)}}\]

属于负类的概率：

\[P(y=0|\mathbf{x}) = 1 - P(y=1|\mathbf{x})\]

决策边界

通常以0.5为阈值：

\[\hat{y} = \begin{cases} 1 & \text{if } P(y=1|\mathbf{x}) \geq 0.5 \\ 0 & \text{otherwise} \end{cases}\]

即决策边界为：$\mathbf{w}^T\mathbf{x} + b = 0$（一个超平面）

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# 生成二分类数据
np.random.seed(42)
X, y = make_classification(n_samples=200, n_features=2, n_informative=2,
                           n_redundant=0, n_clusters_per_class=1, random_state=42)

# 训练模型
lr = LogisticRegression()
lr.fit(X, y)

# 绘制决策边界
xx, yy = np.meshgrid(np.linspace(X[:, 0].min()-1, X[:, 0].max()+1, 100),
                     np.linspace(X[:, 1].min()-1, X[:, 1].max()+1, 100))
Z = lr.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.contourf(xx, yy, Z, levels=np.linspace(0, 1, 11), cmap='RdYlBu_r', alpha=0.8)
plt.colorbar(label='P(y=1)')
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', edgecolors='k', label='Class 0')
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', edgecolors='k', label='Class 1')
plt.contour(xx, yy, Z, levels=[0.5], colors='k', linewidths=2)
plt.title('决策边界与概率等高线')
plt.legend()

plt.subplot(1, 2, 2)
# 3D概率曲面
from mpl_toolkits.mplot3d import Axes3D
ax = plt.subplot(1, 2, 2, projection='3d')
ax.plot_surface(xx, yy, Z, cmap='RdYlBu_r', alpha=0.8)
ax.set_xlabel('X1')
ax.set_ylabel('X2')
ax.set_zlabel('P(y=1)')
ax.set_title('概率曲面')

plt.tight_layout()
plt.show()

损失函数

为什么不用MSE？

如果使用MSE：$L = (y - \sigma(z))^2$，损失函数是非凸的，存在多个局部最小值。

交叉熵损失

对于单个样本：

\[L(y, \hat{p}) = -[y\log(\hat{p}) + (1-y)\log(1-\hat{p})]\]

对于整个数据集：

\[J(\mathbf{w}) = -\frac{1}{N}\sum_{i=1}^{N}[y_i\log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)]\]

直观理解：

当 $y=1$，$L = -\log(\hat{p})$，正确预测（$\hat{p} \to 1$）时损失趋近0
当 $y=0$，$L = -\log(1-\hat{p})$，正确预测（$\hat{p} \to 0$）时损失趋近0

# 交叉熵损失可视化
p = np.linspace(0.001, 0.999, 100)

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
plt.plot(p, -np.log(p), 'b-', linewidth=2)
plt.xlabel('预测概率 p')
plt.ylabel('损失')
plt.title('y=1时的损失: -log(p)')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.plot(p, -np.log(1-p), 'r-', linewidth=2)
plt.xlabel('预测概率 p')
plt.ylabel('损失')
plt.title('y=0时的损失: -log(1-p)')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

梯度推导

损失函数对参数的梯度

\[\frac{\partial J}{\partial w_j} = \frac{1}{N}\sum_{i=1}^{N}(\hat{p}_i - y_i)x_{ij}\] \[\frac{\partial J}{\partial b} = \frac{1}{N}\sum_{i=1}^{N}(\hat{p}_i - y_i)\]

向量形式：

\[\nabla_{\mathbf{w}}J = \frac{1}{N}\mathbf{X}^T(\hat{\mathbf{p}} - \mathbf{y})\]

形式与线性回归完全一致！这不是巧合，而是广义线性模型的共同性质。

从零实现

class LogisticRegressionScratch:
    def __init__(self, learning_rate=0.1, n_iterations=1000, regularization=None, alpha=0.01):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.reg = regularization  # 'l1', 'l2', or None
        self.alpha = alpha
        self.w = None
        self.b = None
        self.history = []
    
    def _sigmoid(self, z):
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))
    
    def _loss(self, y, p):
        """计算交叉熵损失"""
        eps = 1e-15
        p = np.clip(p, eps, 1 - eps)
        loss = -np.mean(y * np.log(p) + (1 - y) * np.log(1 - p))
        
        if self.reg == 'l2':
            loss += 0.5 * self.alpha * np.sum(self.w ** 2)
        elif self.reg == 'l1':
            loss += self.alpha * np.sum(np.abs(self.w))
        
        return loss
    
    def fit(self, X, y):
        n_samples, n_features = X.shape
        
        # 初始化参数
        self.w = np.zeros(n_features)
        self.b = 0
        self.history = []
        
        for i in range(self.n_iter):
            # 前向传播
            z = X @ self.w + self.b
            p = self._sigmoid(z)
            
            # 记录损失
            loss = self._loss(y, p)
            self.history.append(loss)
            
            # 计算梯度
            error = p - y
            dw = (X.T @ error) / n_samples
            db = np.mean(error)
            
            # 正则化梯度
            if self.reg == 'l2':
                dw += self.alpha * self.w
            elif self.reg == 'l1':
                dw += self.alpha * np.sign(self.w)
            
            # 更新参数
            self.w -= self.lr * dw
            self.b -= self.lr * db
        
        return self
    
    def predict_proba(self, X):
        z = X @ self.w + self.b
        return self._sigmoid(z)
    
    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X) >= threshold).astype(int)
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

# 测试
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr_scratch = LogisticRegressionScratch(learning_rate=0.5, n_iterations=500)
lr_scratch.fit(X_train, y_train)

print(f"训练准确率: {lr_scratch.score(X_train, y_train):.4f}")
print(f"测试准确率: {lr_scratch.score(X_test, y_test):.4f}")

# 损失曲线
plt.figure(figsize=(8, 4))
plt.plot(lr_scratch.history)
plt.xlabel('迭代次数')
plt.ylabel('损失')
plt.title('训练损失曲线')
plt.grid(True, alpha=0.3)
plt.show()

多分类扩展

Softmax回归

对于 $K$ 类分类问题：

\[P(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}_k^T\mathbf{x}}}{\sum_{j=1}^{K}e^{\mathbf{w}_j^T\mathbf{x}}}\]

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# 生成三分类数据
X_multi, y_multi = make_classification(n_samples=300, n_features=2, n_informative=2,
                                        n_redundant=0, n_classes=3, n_clusters_per_class=1,
                                        random_state=42)

# 训练多分类模型
lr_multi = LogisticRegression(multi_class='multinomial', solver='lbfgs')
lr_multi.fit(X_multi, y_multi)

# 绘制决策区域
xx, yy = np.meshgrid(np.linspace(X_multi[:, 0].min()-1, X_multi[:, 0].max()+1, 100),
                     np.linspace(X_multi[:, 1].min()-1, X_multi[:, 1].max()+1, 100))
Z = lr_multi.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap='viridis')
for i, color in enumerate(['blue', 'orange', 'green']):
    plt.scatter(X_multi[y_multi==i, 0], X_multi[y_multi==i, 1], 
                c=color, edgecolors='k', label=f'Class {i}')
plt.title('Softmax多分类决策区域')
plt.legend()
plt.show()

One-vs-Rest (OvR)

# OvR策略
lr_ovr = LogisticRegression(multi_class='ovr')
lr_ovr.fit(X_multi, y_multi)
print(f"OvR准确率: {lr_ovr.score(X_multi, y_multi):.4f}")

# 查看每个分类器的系数
print(f"系数形状: {lr_ovr.coef_.shape}")  # (3, 2) - 3个分类器，每个2个特征

正则化

L2正则化（默认）

from sklearn.model_selection import cross_val_score

C_values = [0.001, 0.01, 0.1, 1, 10, 100]

for C in C_values:
    lr = LogisticRegression(C=C, penalty='l2')  # C = 1/lambda
    scores = cross_val_score(lr, X, y, cv=5)
    print(f"C={C:6.3f}: Accuracy = {scores.mean():.4f} ± {scores.std():.4f}")

L1正则化（特征选择）

# 高维数据
X_hd, y_hd = make_classification(n_samples=200, n_features=100, n_informative=5,
                                  n_redundant=0, random_state=42)

lr_l1 = LogisticRegression(penalty='l1', C=0.1, solver='saga', max_iter=1000)
lr_l1.fit(X_hd, y_hd)

print(f"非零系数数量: {np.sum(lr_l1.coef_ != 0)}")
print(f"准确率: {lr_l1.score(X_hd, y_hd):.4f}")

评估指标

混淆矩阵与指标

from sklearn.metrics import (confusion_matrix, classification_report,
                             roc_curve, roc_auc_score, precision_recall_curve)

# 训练模型
lr = LogisticRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
y_proba = lr.predict_proba(X_test)[:, 1]

# 混淆矩阵
cm = confusion_matrix(y_test, y_pred)
print("混淆矩阵:")
print(cm)

# 详细报告
print("\n分类报告:")
print(classification_report(y_test, y_pred))

# ROC曲线
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(12, 4))

plt.subplot(1, 3, 1)
import seaborn as sns
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('预测')
plt.ylabel('实际')
plt.title('混淆矩阵')

plt.subplot(1, 3, 2)
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'AUC = {auc:.3f}')
plt.plot([0, 1], [0, 1], 'k--', alpha=0.5)
plt.xlabel('假正例率 (FPR)')
plt.ylabel('真正例率 (TPR)')
plt.title('ROC曲线')
plt.legend()

plt.subplot(1, 3, 3)
precision, recall, _ = precision_recall_curve(y_test, y_proba)
plt.plot(recall, precision, 'g-', linewidth=2)
plt.xlabel('召回率')
plt.ylabel('精确率')
plt.title('PR曲线')

plt.tight_layout()
plt.show()

阈值调整

# 不同阈值的影响
thresholds = [0.3, 0.5, 0.7]

for thresh in thresholds:
    y_pred_thresh = (y_proba >= thresh).astype(int)
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred_thresh).ravel()
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    print(f"阈值={thresh}: Precision={precision:.3f}, Recall={recall:.3f}")

实战示例

from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler

# 加载乳腺癌数据集
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target

# 预处理
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# 训练模型
lr = LogisticRegression(C=1.0, max_iter=1000)
lr.fit(X_train_scaled, y_train)

print(f"训练准确率: {lr.score(X_train_scaled, y_train):.4f}")
print(f"测试准确率: {lr.score(X_test_scaled, y_test):.4f}")

# 特征重要性
importance = np.abs(lr.coef_[0])
indices = np.argsort(importance)[-10:]

plt.figure(figsize=(10, 6))
plt.barh(range(10), importance[indices])
plt.yticks(range(10), [cancer.feature_names[i] for i in indices])
plt.xlabel('|系数|')
plt.title('Top 10 重要特征')
plt.tight_layout()
plt.show()

常见问题

Q1: 逻辑回归能处理非线性问题吗？

逻辑回归本身是线性分类器。处理非线性的方法：

添加多项式特征
使用核方法
改用非线性模型（如SVM、神经网络）

Q2: 如何处理类别不平衡？

设置 class_weight='balanced'
调整决策阈值
使用过采样（SMOTE）或欠采样

lr_balanced = LogisticRegression(class_weight='balanced')

Q3: 为什么叫逻辑”回归”？

历史原因。逻辑回归实际上是回归模型的输出通过Sigmoid变换，可以理解为”对数几率的回归”。

Q4: 逻辑回归的优缺点？

优点	缺点
简单高效	线性决策边界
可解释性强	难以处理非线性
输出概率	对特征工程依赖大
不易过拟合	大规模数据训练慢

总结

概念	说明
模型	$P(y=1\|x) = \sigma(w^Tx + b)$
损失函数	交叉熵（对数损失）
优化	梯度下降（凸优化问题）
多分类	Softmax / OvR
正则化	L1（稀疏）/ L2（收缩）

参考资料

《统计学习方法》李航第6章
Andrew Ng Machine Learning Course
scikit-learn 文档：Logistic Regression

（采用 CC BY-NC-SA 4.0 许可协议进行授权）

本文标题：《机器学习基础系列——逻辑回归》

本文链接：http://localhost:3015/ai/%E9%80%BB%E8%BE%91%E5%9B%9E%E5%BD%92.html

本文最后一次更新为天前，文章中的某些内容可能已过时！