机器学习基础系列——超参数调优

前言

超参数是在训练之前设定的参数（如学习率、树的深度），而不是通过训练学习得到的。选择合适的超参数对模型性能至关重要。

参数 vs 超参数

类型	定义	示例
参数	训练中学习	权重、偏置
超参数	训练前设定	学习率、正则化系数

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

# 生成数据
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

常见模型的超参数

# 随机森林的超参数
rf = RandomForestClassifier()
print("RandomForest超参数:")
print(rf.get_params())

模型	重要超参数
决策树	max_depth, min_samples_split
随机森林	n_estimators, max_depth, max_features
SVM	C, gamma, kernel
神经网络	learning_rate, hidden_layers, batch_size
XGBoost	learning_rate, max_depth, n_estimators

手动调参

逐一调整

# 手动调整n_estimators
n_estimators_range = [10, 50, 100, 200, 500]
scores = []

for n in n_estimators_range:
    rf = RandomForestClassifier(n_estimators=n, random_state=42)
    cv_scores = cross_val_score(rf, X_train, y_train, cv=5)
    scores.append(cv_scores.mean())
    print(f"n_estimators={n}: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

plt.figure(figsize=(10, 5))
plt.plot(n_estimators_range, scores, 'bo-')
plt.xlabel('n_estimators')
plt.ylabel('CV Score')
plt.title('手动调参: n_estimators')
plt.grid(True, alpha=0.3)
plt.show()

缺点

效率低
难以发现参数交互
主观性强

网格搜索（Grid Search）

原理

穷举所有参数组合，找到最优。

from sklearn.model_selection import GridSearchCV

# 定义参数网格
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# 网格搜索
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"\n最佳参数: {grid_search.best_params_}")
print(f"最佳CV分数: {grid_search.best_score_:.4f}")
print(f"测试分数: {grid_search.score(X_test, y_test):.4f}")

结果分析

import pandas as pd

# 查看所有结果
results = pd.DataFrame(grid_search.cv_results_)
results_sorted = results.sort_values('mean_test_score', ascending=False)

print("\nTop 10 参数组合:")
print(results_sorted[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']].head(10))

# 可视化两个参数的影响
pivot = results.pivot_table(
    values='mean_test_score',
    index='param_max_depth',
    columns='param_n_estimators',
    aggfunc='mean'
)

plt.figure(figsize=(10, 6))
plt.imshow(pivot, cmap='viridis', aspect='auto')
plt.colorbar(label='CV Score')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.xticks(range(len(pivot.columns)), pivot.columns)
plt.yticks(range(len(pivot.index)), pivot.index)
plt.title('Grid Search: max_depth vs n_estimators')

for i in range(len(pivot.index)):
    for j in range(len(pivot.columns)):
        plt.text(j, i, f'{pivot.iloc[i, j]:.3f}', ha='center', va='center', color='white')

plt.show()

网格搜索的问题

计算成本：$O(n^k)$，n是每个参数的取值数，k是参数数量
网格间隔可能错过最优值

随机搜索（Random Search）

原理

从参数分布中随机采样，而非穷举。

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

# 定义参数分布
param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': [5, 10, 20, 30, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': uniform(0.1, 0.9)
}

# 随机搜索
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_distributions,
    n_iter=100,  # 采样次数
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print(f"\n最佳参数: {random_search.best_params_}")
print(f"最佳CV分数: {random_search.best_score_:.4f}")
print(f"测试分数: {random_search.score(X_test, y_test):.4f}")

网格搜索 vs 随机搜索

# 可视化比较
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# 网格搜索采样点
ax = axes[0]
grid_points = []
for n in [50, 100, 200, 300, 400]:
    for d in [5, 10, 15, 20, 25]:
        grid_points.append((n, d))

grid_points = np.array(grid_points)
ax.scatter(grid_points[:, 0], grid_points[:, 1], c='blue', s=100)
ax.set_xlabel('n_estimators')
ax.set_ylabel('max_depth')
ax.set_title('网格搜索采样点 (25个)')
ax.grid(True, alpha=0.3)

# 随机搜索采样点
ax = axes[1]
np.random.seed(42)
random_points = np.column_stack([
    np.random.randint(50, 400, 25),
    np.random.randint(5, 25, 25)
])
ax.scatter(random_points[:, 0], random_points[:, 1], c='red', s=100)
ax.set_xlabel('n_estimators')
ax.set_ylabel('max_depth')
ax.set_title('随机搜索采样点 (25个)')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

方法	优点	缺点
网格搜索	覆盖全面	计算成本高
随机搜索	效率高	可能错过最优

贝叶斯优化

原理

基于历史评估结果，构建目标函数的概率模型，智能选择下一个评估点。

try:
    from skopt import BayesSearchCV
    from skopt.space import Real, Integer, Categorical
    
    # 定义搜索空间
    search_spaces = {
        'n_estimators': Integer(50, 500),
        'max_depth': Integer(5, 30),
        'min_samples_split': Integer(2, 20),
        'min_samples_leaf': Integer(1, 10),
        'max_features': Real(0.1, 0.9)
    }
    
    # 贝叶斯搜索
    bayes_search = BayesSearchCV(
        RandomForestClassifier(random_state=42),
        search_spaces,
        n_iter=50,
        cv=5,
        scoring='accuracy',
        n_jobs=-1,
        random_state=42
    )
    
    bayes_search.fit(X_train, y_train)
    
    print(f"最佳参数: {bayes_search.best_params_}")
    print(f"最佳CV分数: {bayes_search.best_score_:.4f}")
    print(f"测试分数: {bayes_search.score(X_test, y_test):.4f}")
    
except ImportError:
    print("skopt未安装，跳过贝叶斯优化示例")
    print("安装命令: pip install scikit-optimize")

使用Optuna

try:
    import optuna
    
    def objective(trial):
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 500),
            'max_depth': trial.suggest_int('max_depth', 5, 30),
            'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
            'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
            'max_features': trial.suggest_float('max_features', 0.1, 0.9)
        }
        
        rf = RandomForestClassifier(**params, random_state=42)
        scores = cross_val_score(rf, X_train, y_train, cv=5)
        return scores.mean()
    
    # 创建study
    study = optuna.create_study(direction='maximize')
    study.optimize(objective, n_trials=50, show_progress_bar=True)
    
    print(f"\n最佳参数: {study.best_params}")
    print(f"最佳分数: {study.best_value:.4f}")
    
    # 可视化优化历史
    fig = optuna.visualization.plot_optimization_history(study)
    fig.show()
    
except ImportError:
    print("optuna未安装，跳过示例")
    print("安装命令: pip install optuna")

Halving搜索

原理

逐步增加资源（数据量或迭代次数），淘汰表现差的参数组合。

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV, HalvingRandomSearchCV

param_grid_halving = {
    'n_estimators': [50, 100, 200, 300],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10]
}

halving_search = HalvingGridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid_halving,
    cv=5,
    factor=3,  # 每轮保留1/3
    resource='n_samples',  # 使用样本数作为资源
    min_resources=100,
    scoring='accuracy',
    n_jobs=-1
)

halving_search.fit(X_train, y_train)

print(f"最佳参数: {halving_search.best_params_}")
print(f"最佳分数: {halving_search.best_score_:.4f}")
print(f"评估次数: {halving_search.n_candidates_}")

实用技巧

参数范围选择

# 对数尺度搜索
# 当参数跨越多个数量级时使用

# 学习率
learning_rates = np.logspace(-5, 0, 10)  # 1e-5 到 1
print(f"学习率范围: {learning_rates}")

# 正则化参数
C_values = np.logspace(-3, 3, 10)  # 0.001 到 1000
print(f"C值范围: {C_values}")

分阶段调优

# 第一阶段：粗略搜索
param_grid_coarse = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20]
}

grid_coarse = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid_coarse, cv=3, n_jobs=-1
)
grid_coarse.fit(X_train, y_train)

best_n = grid_coarse.best_params_['n_estimators']
best_d = grid_coarse.best_params_['max_depth']

print(f"粗略搜索最佳: n_estimators={best_n}, max_depth={best_d}")

# 第二阶段：精细搜索
param_grid_fine = {
    'n_estimators': list(range(max(10, best_n-50), best_n+50, 20)),
    'max_depth': list(range(max(1, best_d-5), best_d+5)) if best_d else [15, 20, 25, None]
}

grid_fine = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid_fine, cv=5, n_jobs=-1
)
grid_fine.fit(X_train, y_train)

print(f"精细搜索最佳: {grid_fine.best_params_}")
print(f"最终CV分数: {grid_fine.best_score_:.4f}")

并行化

import time

# 比较不同n_jobs
for n_jobs in [1, 2, 4, -1]:
    start = time.time()
    
    grid = GridSearchCV(
        RandomForestClassifier(random_state=42),
        {'n_estimators': [50, 100], 'max_depth': [5, 10]},
        cv=3, n_jobs=n_jobs
    )
    grid.fit(X_train[:500], y_train[:500])
    
    elapsed = time.time() - start
    print(f"n_jobs={n_jobs}: {elapsed:.2f}秒")

不同模型的调优策略

随机森林

# 随机森林调优策略
rf_param_grid = {
    # 最重要：树的数量和深度
    'n_estimators': [100, 200, 500],
    'max_depth': [10, 20, None],
    
    # 其次：分裂条件
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    
    # 特征采样
    'max_features': ['sqrt', 'log2', 0.5]
}

SVM

# SVM调优策略
svm_param_grid = {
    'C': np.logspace(-3, 3, 7),
    'gamma': np.logspace(-4, 1, 6),
    'kernel': ['rbf', 'poly']
}

svm_search = GridSearchCV(
    SVC(random_state=42),
    svm_param_grid,
    cv=5, n_jobs=-1
)
svm_search.fit(X_train, y_train)

print(f"SVM最佳参数: {svm_search.best_params_}")
print(f"SVM最佳分数: {svm_search.best_score_:.4f}")

XGBoost

try:
    import xgboost as xgb
    
    xgb_param_grid = {
        'n_estimators': [100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2],
        'subsample': [0.8, 1.0],
        'colsample_bytree': [0.8, 1.0]
    }
    
    xgb_search = RandomizedSearchCV(
        xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
        xgb_param_grid,
        n_iter=20, cv=5, n_jobs=-1, random_state=42
    )
    xgb_search.fit(X_train, y_train)
    
    print(f"XGBoost最佳参数: {xgb_search.best_params_}")
    print(f"XGBoost最佳分数: {xgb_search.best_score_:.4f}")
    
except ImportError:
    print("XGBoost未安装")

调优方法对比

# 比较不同方法的效率
import time

methods = {}

# 网格搜索
start = time.time()
grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    {'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]},
    cv=3, n_jobs=-1
)
grid.fit(X_train, y_train)
methods['Grid Search'] = {
    'time': time.time() - start,
    'score': grid.best_score_,
    'n_iter': 9
}

# 随机搜索
start = time.time()
random = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    {'n_estimators': randint(50, 200), 'max_depth': randint(5, 15)},
    n_iter=9, cv=3, n_jobs=-1, random_state=42
)
random.fit(X_train, y_train)
methods['Random Search'] = {
    'time': time.time() - start,
    'score': random.best_score_,
    'n_iter': 9
}

# Halving
start = time.time()
halving = HalvingRandomSearchCV(
    RandomForestClassifier(random_state=42),
    {'n_estimators': randint(50, 200), 'max_depth': randint(5, 15)},
    n_candidates=20, cv=3, n_jobs=-1, random_state=42
)
halving.fit(X_train, y_train)
methods['Halving Search'] = {
    'time': time.time() - start,
    'score': halving.best_score_,
    'n_iter': sum(halving.n_candidates_)
}

# 结果
print("\n方法对比:")
for name, result in methods.items():
    print(f"{name}: Score={result['score']:.4f}, Time={result['time']:.2f}s, Iters={result['n_iter']}")

常见问题

Q1: 什么时候用网格搜索vs随机搜索？

场景	推荐方法
参数少、范围小	网格搜索
参数多、范围大	随机搜索
计算资源充足	贝叶斯优化
需要快速结果	随机搜索

Q2: 如何避免过拟合验证集？

使用嵌套交叉验证：

外层：评估最终性能
内层：选择超参数

Q3: 应该调哪些超参数？

优先级：

影响模型容量的参数（max_depth, n_estimators）
正则化参数（C, alpha）
其他参数

Q4: 调参需要多少时间？

取决于：

数据集大小
模型复杂度
搜索空间大小
可用计算资源

总结

方法	原理	适用场景
网格搜索	穷举	参数少
随机搜索	随机采样	通用
贝叶斯优化	概率模型	评估成本高
Halving	逐步淘汰	快速筛选

参考资料

scikit-learn 文档：Tuning the hyper-parameters
Bergstra, J., & Bengio, Y. (2012). “Random Search for Hyper-Parameter Optimization”
Optuna 文档：https://optuna.org/

（采用 CC BY-NC-SA 4.0 许可协议进行授权）

本文标题：《机器学习基础系列——超参数调优》

本文链接：http://localhost:3015/ai/%E8%B6%85%E5%8F%82%E6%95%B0%E8%B0%83%E4%BC%98.html

本文最后一次更新为天前，文章中的某些内容可能已过时！