机器学习基础系列——模型评估指标

前言

选择正确的评估指标是模型开发的关键。不同的业务场景需要不同的指标，本文全面介绍分类和回归任务的常用评估指标。

分类评估指标

混淆矩阵

二分类混淆矩阵：

	预测正类	预测负类
实际正类	TP	FN
实际负类	FP	TN

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (confusion_matrix, accuracy_score, precision_score,
                             recall_score, f1_score, classification_report)
import seaborn as sns

np.random.seed(42)

# 生成不平衡数据
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           weights=[0.9, 0.1], random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 训练模型
lr = LogisticRegression(random_state=42, max_iter=1000)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

# 混淆矩阵
cm = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Negative', 'Positive'],
            yticklabels=['Negative', 'Positive'])
plt.xlabel('预测')
plt.ylabel('实际')
plt.title('混淆矩阵')
plt.show()

tn, fp, fn, tp = cm.ravel()
print(f"TN={tn}, FP={fp}, FN={fn}, TP={tp}")

准确率（Accuracy）

\[\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}\]

acc = accuracy_score(y_test, y_pred)
print(f"准确率: {acc:.4f}")

# 手动计算
acc_manual = (tp + tn) / (tp + tn + fp + fn)
print(f"手动计算准确率: {acc_manual:.4f}")

注意：类别不平衡时准确率可能具有误导性。

精确率（Precision）

\[\text{Precision} = \frac{TP}{TP + FP}\]

预测为正的样本中，有多少真的是正。

prec = precision_score(y_test, y_pred)
print(f"精确率: {prec:.4f}")

应用场景：垃圾邮件检测（不想把正常邮件标为垃圾）

召回率（Recall / Sensitivity）

\[\text{Recall} = \frac{TP}{TP + FN}\]

所有正样本中，有多少被正确预测。

rec = recall_score(y_test, y_pred)
print(f"召回率: {rec:.4f}")

应用场景：癌症检测（不想漏掉任何患者）

F1分数

精确率和召回率的调和平均：

\[F1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}\]

f1 = f1_score(y_test, y_pred)
print(f"F1分数: {f1:.4f}")

# 完整报告
print("\n分类报告:")
print(classification_report(y_test, y_pred))

F-beta分数

\[F_\beta = (1 + \beta^2) \cdot \frac{\text{Precision} \cdot \text{Recall}}{\beta^2 \cdot \text{Precision} + \text{Recall}}\]

$\beta > 1$：更重视召回率
$\beta < 1$：更重视精确率

from sklearn.metrics import fbeta_score

f2 = fbeta_score(y_test, y_pred, beta=2)  # 更重视召回率
f05 = fbeta_score(y_test, y_pred, beta=0.5)  # 更重视精确率

print(f"F2分数: {f2:.4f}")
print(f"F0.5分数: {f05:.4f}")

概率预测评估

ROC曲线与AUC

from sklearn.metrics import roc_curve, roc_auc_score

# 获取概率预测
y_proba = lr.predict_proba(X_test)[:, 1]

# 计算ROC曲线
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc = roc_auc_score(y_test, y_proba)

plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'ROC (AUC={auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='随机猜测')
plt.fill_between(fpr, tpr, alpha=0.3)
plt.xlabel('假正例率 (FPR)')
plt.ylabel('真正例率 (TPR)')
plt.title('ROC曲线')
plt.legend()
plt.grid(True, alpha=0.3)

# 阈值对应的指标
plt.subplot(1, 2, 2)
plt.plot(thresholds, tpr[:-1], 'b-', label='TPR')
plt.plot(thresholds, fpr[:-1], 'r-', label='FPR')
plt.plot(thresholds, tpr[:-1] - fpr[:-1], 'g-', label='TPR-FPR')
plt.xlabel('阈值')
plt.ylabel('值')
plt.title('阈值选择')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

PR曲线

from sklearn.metrics import precision_recall_curve, average_precision_score

precision, recall, thresholds_pr = precision_recall_curve(y_test, y_proba)
ap = average_precision_score(y_test, y_proba)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, 'g-', linewidth=2, label=f'PR (AP={ap:.3f})')
plt.fill_between(recall, precision, alpha=0.3)
plt.xlabel('召回率')
plt.ylabel('精确率')
plt.title('PR曲线')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

ROC vs PR

场景	推荐曲线
类别平衡	ROC
类别不平衡	PR
关注正类	PR

多分类评估

from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix, classification_report

# 加载多分类数据
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42)

lr_multi = LogisticRegression(random_state=42, max_iter=200)
lr_multi.fit(X_train_i, y_train_i)
y_pred_i = lr_multi.predict(X_test_i)

# 混淆矩阵
cm_multi = confusion_matrix(y_test_i, y_pred_i)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_multi, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.xlabel('预测')
plt.ylabel('实际')
plt.title('多分类混淆矩阵')
plt.show()

print(classification_report(y_test_i, y_pred_i, target_names=iris.target_names))

平均方式

# 不同的平均方式
print("多分类F1分数:")
for average in ['micro', 'macro', 'weighted']:
    f1_avg = f1_score(y_test_i, y_pred_i, average=average)
    print(f"  {average}: {f1_avg:.4f}")

平均方式	计算方法
micro	全局计算TP, FP, FN
macro	各类别指标的简单平均
weighted	按样本数加权平均

回归评估指标

from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import (mean_squared_error, mean_absolute_error,
                             r2_score, mean_absolute_percentage_error)

# 加载回归数据
housing = fetch_california_housing()
X_house, y_house = housing.data, housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_house, y_house, test_size=0.2, random_state=42)

# 训练模型
reg = LinearRegression()
reg.fit(X_train_h, y_train_h)
y_pred_h = reg.predict(X_test_h)

MSE与RMSE

\[\text{MSE} = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y}_i)^2\] \[\text{RMSE} = \sqrt{\text{MSE}}\]

mse = mean_squared_error(y_test_h, y_pred_h)
rmse = np.sqrt(mse)

print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")

MAE

\[\text{MAE} = \frac{1}{N}\sum_{i=1}^{N}|y_i - \hat{y}_i|\]

mae = mean_absolute_error(y_test_h, y_pred_h)
print(f"MAE: {mae:.4f}")

指标	对离群点	单位
MSE	敏感	平方单位
RMSE	敏感	原始单位
MAE	稳健	原始单位

R²决定系数

\[R^2 = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}\]

r2 = r2_score(y_test_h, y_pred_h)
print(f"R²: {r2:.4f}")

解释：模型解释了目标变量多少比例的方差。

MAPE

\[\text{MAPE} = \frac{100\%}{N}\sum_{i=1}^{N}\left|\frac{y_i - \hat{y}_i}{y_i}\right|\]

mape = mean_absolute_percentage_error(y_test_h, y_pred_h)
print(f"MAPE: {mape:.4f}")

残差分析

residuals = y_test_h - y_pred_h

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# 残差vs预测值
axes[0].scatter(y_pred_h, residuals, alpha=0.5)
axes[0].axhline(0, color='r', linestyle='--')
axes[0].set_xlabel('预测值')
axes[0].set_ylabel('残差')
axes[0].set_title('残差图')

# 残差分布
axes[1].hist(residuals, bins=50, edgecolor='black')
axes[1].set_xlabel('残差')
axes[1].set_ylabel('频数')
axes[1].set_title('残差分布')

# Q-Q图
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[2])
axes[2].set_title('Q-Q图')

plt.tight_layout()
plt.show()

指标选择指南

分类任务

场景	推荐指标
类别平衡	准确率、F1
类别不平衡	F1、PR-AUC
重视漏检	召回率、F2
重视误报	精确率、F0.5
概率预测	AUC-ROC、Log Loss

回归任务

场景	推荐指标
通用	RMSE、R²
有离群点	MAE
相对误差重要	MAPE
业务解释	MAE（更直观）

多指标综合评估

from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier

# 多指标交叉验证
rf = RandomForestClassifier(random_state=42)

scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'roc_auc': 'roc_auc'
}

cv_results = cross_validate(rf, X, y, cv=5, scoring=scoring)

print("交叉验证结果:")
for metric in scoring.keys():
    scores = cv_results[f'test_{metric}']
    print(f"  {metric}: {scores.mean():.4f} ± {scores.std():.4f}")

常见问题

Q1: 精确率和召回率如何权衡？

取决于业务需求：

医疗诊断：宁可误报（高召回率）
垃圾邮件：宁可漏检（高精确率）
无明显偏好：使用F1分数

Q2: AUC=0.5意味着什么？

模型等同于随机猜测，没有预测能力。

Q3: R²可以为负吗？

可以。当模型比简单均值预测更差时，R²为负。

Q4: 如何处理多标签分类评估？

使用各指标的微平均或宏平均，或单独评估每个标签。

总结

任务	主要指标	补充指标
二分类	F1、AUC	Precision、Recall
多分类	Macro-F1	Confusion Matrix
回归	RMSE、R²	MAE、MAPE

参考资料

scikit-learn 文档：Model evaluation
《机器学习》周志华第2章
Davis, J., & Goadrich, M. (2006). “The relationship between Precision-Recall and ROC curves”

（采用 CC BY-NC-SA 4.0 许可协议进行授权）

本文标题：《机器学习基础系列——模型评估指标》

本文链接：http://localhost:3015/ai/%E6%A8%A1%E5%9E%8B%E8%AF%84%E4%BC%B0%E6%8C%87%E6%A0%87.html

本文最后一次更新为天前，文章中的某些内容可能已过时！