已经是最新一篇文章了!
已经是最后一篇文章了!
网格搜索、随机搜索、贝叶斯优化
前言
超参数是在训练之前设定的参数(如学习率、树的深度),而不是通过训练学习得到的。选择合适的超参数对模型性能至关重要。
参数 vs 超参数
| 类型 | 定义 | 示例 |
|---|---|---|
| 参数 | 训练中学习 | 权重、偏置 |
| 超参数 | 训练前设定 | 学习率、正则化系数 |
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)
# 生成数据
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
常见模型的超参数
# 随机森林的超参数
rf = RandomForestClassifier()
print("RandomForest超参数:")
print(rf.get_params())
| 模型 | 重要超参数 |
|---|---|
| 决策树 | max_depth, min_samples_split |
| 随机森林 | n_estimators, max_depth, max_features |
| SVM | C, gamma, kernel |
| 神经网络 | learning_rate, hidden_layers, batch_size |
| XGBoost | learning_rate, max_depth, n_estimators |
手动调参
逐一调整
# 手动调整n_estimators
n_estimators_range = [10, 50, 100, 200, 500]
scores = []
for n in n_estimators_range:
rf = RandomForestClassifier(n_estimators=n, random_state=42)
cv_scores = cross_val_score(rf, X_train, y_train, cv=5)
scores.append(cv_scores.mean())
print(f"n_estimators={n}: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")
plt.figure(figsize=(10, 5))
plt.plot(n_estimators_range, scores, 'bo-')
plt.xlabel('n_estimators')
plt.ylabel('CV Score')
plt.title('手动调参: n_estimators')
plt.grid(True, alpha=0.3)
plt.show()
缺点
- 效率低
- 难以发现参数交互
- 主观性强
网格搜索(Grid Search)
原理
穷举所有参数组合,找到最优。
from sklearn.model_selection import GridSearchCV
# 定义参数网格
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# 网格搜索
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1,
verbose=1
)
grid_search.fit(X_train, y_train)
print(f"\n最佳参数: {grid_search.best_params_}")
print(f"最佳CV分数: {grid_search.best_score_:.4f}")
print(f"测试分数: {grid_search.score(X_test, y_test):.4f}")
结果分析
import pandas as pd
# 查看所有结果
results = pd.DataFrame(grid_search.cv_results_)
results_sorted = results.sort_values('mean_test_score', ascending=False)
print("\nTop 10 参数组合:")
print(results_sorted[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']].head(10))
# 可视化两个参数的影响
pivot = results.pivot_table(
values='mean_test_score',
index='param_max_depth',
columns='param_n_estimators',
aggfunc='mean'
)
plt.figure(figsize=(10, 6))
plt.imshow(pivot, cmap='viridis', aspect='auto')
plt.colorbar(label='CV Score')
plt.xlabel('n_estimators')
plt.ylabel('max_depth')
plt.xticks(range(len(pivot.columns)), pivot.columns)
plt.yticks(range(len(pivot.index)), pivot.index)
plt.title('Grid Search: max_depth vs n_estimators')
for i in range(len(pivot.index)):
for j in range(len(pivot.columns)):
plt.text(j, i, f'{pivot.iloc[i, j]:.3f}', ha='center', va='center', color='white')
plt.show()
网格搜索的问题
- 计算成本:$O(n^k)$,n是每个参数的取值数,k是参数数量
- 网格间隔可能错过最优值
随机搜索(Random Search)
原理
从参数分布中随机采样,而非穷举。
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
# 定义参数分布
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': [5, 10, 20, 30, None],
'min_samples_split': randint(2, 20),
'min_samples_leaf': randint(1, 10),
'max_features': uniform(0.1, 0.9)
}
# 随机搜索
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=100, # 采样次数
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42,
verbose=1
)
random_search.fit(X_train, y_train)
print(f"\n最佳参数: {random_search.best_params_}")
print(f"最佳CV分数: {random_search.best_score_:.4f}")
print(f"测试分数: {random_search.score(X_test, y_test):.4f}")
网格搜索 vs 随机搜索
# 可视化比较
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# 网格搜索采样点
ax = axes[0]
grid_points = []
for n in [50, 100, 200, 300, 400]:
for d in [5, 10, 15, 20, 25]:
grid_points.append((n, d))
grid_points = np.array(grid_points)
ax.scatter(grid_points[:, 0], grid_points[:, 1], c='blue', s=100)
ax.set_xlabel('n_estimators')
ax.set_ylabel('max_depth')
ax.set_title('网格搜索采样点 (25个)')
ax.grid(True, alpha=0.3)
# 随机搜索采样点
ax = axes[1]
np.random.seed(42)
random_points = np.column_stack([
np.random.randint(50, 400, 25),
np.random.randint(5, 25, 25)
])
ax.scatter(random_points[:, 0], random_points[:, 1], c='red', s=100)
ax.set_xlabel('n_estimators')
ax.set_ylabel('max_depth')
ax.set_title('随机搜索采样点 (25个)')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
| 方法 | 优点 | 缺点 |
|---|---|---|
| 网格搜索 | 覆盖全面 | 计算成本高 |
| 随机搜索 | 效率高 | 可能错过最优 |
贝叶斯优化
原理
基于历史评估结果,构建目标函数的概率模型,智能选择下一个评估点。
try:
from skopt import BayesSearchCV
from skopt.space import Real, Integer, Categorical
# 定义搜索空间
search_spaces = {
'n_estimators': Integer(50, 500),
'max_depth': Integer(5, 30),
'min_samples_split': Integer(2, 20),
'min_samples_leaf': Integer(1, 10),
'max_features': Real(0.1, 0.9)
}
# 贝叶斯搜索
bayes_search = BayesSearchCV(
RandomForestClassifier(random_state=42),
search_spaces,
n_iter=50,
cv=5,
scoring='accuracy',
n_jobs=-1,
random_state=42
)
bayes_search.fit(X_train, y_train)
print(f"最佳参数: {bayes_search.best_params_}")
print(f"最佳CV分数: {bayes_search.best_score_:.4f}")
print(f"测试分数: {bayes_search.score(X_test, y_test):.4f}")
except ImportError:
print("skopt未安装,跳过贝叶斯优化示例")
print("安装命令: pip install scikit-optimize")
使用Optuna
try:
import optuna
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 5, 30),
'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 10),
'max_features': trial.suggest_float('max_features', 0.1, 0.9)
}
rf = RandomForestClassifier(**params, random_state=42)
scores = cross_val_score(rf, X_train, y_train, cv=5)
return scores.mean()
# 创建study
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)
print(f"\n最佳参数: {study.best_params}")
print(f"最佳分数: {study.best_value:.4f}")
# 可视化优化历史
fig = optuna.visualization.plot_optimization_history(study)
fig.show()
except ImportError:
print("optuna未安装,跳过示例")
print("安装命令: pip install optuna")
Halving搜索
原理
逐步增加资源(数据量或迭代次数),淘汰表现差的参数组合。
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV, HalvingRandomSearchCV
param_grid_halving = {
'n_estimators': [50, 100, 200, 300],
'max_depth': [5, 10, 20, None],
'min_samples_split': [2, 5, 10]
}
halving_search = HalvingGridSearchCV(
RandomForestClassifier(random_state=42),
param_grid_halving,
cv=5,
factor=3, # 每轮保留1/3
resource='n_samples', # 使用样本数作为资源
min_resources=100,
scoring='accuracy',
n_jobs=-1
)
halving_search.fit(X_train, y_train)
print(f"最佳参数: {halving_search.best_params_}")
print(f"最佳分数: {halving_search.best_score_:.4f}")
print(f"评估次数: {halving_search.n_candidates_}")
实用技巧
参数范围选择
# 对数尺度搜索
# 当参数跨越多个数量级时使用
# 学习率
learning_rates = np.logspace(-5, 0, 10) # 1e-5 到 1
print(f"学习率范围: {learning_rates}")
# 正则化参数
C_values = np.logspace(-3, 3, 10) # 0.001 到 1000
print(f"C值范围: {C_values}")
分阶段调优
# 第一阶段:粗略搜索
param_grid_coarse = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 20]
}
grid_coarse = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid_coarse, cv=3, n_jobs=-1
)
grid_coarse.fit(X_train, y_train)
best_n = grid_coarse.best_params_['n_estimators']
best_d = grid_coarse.best_params_['max_depth']
print(f"粗略搜索最佳: n_estimators={best_n}, max_depth={best_d}")
# 第二阶段:精细搜索
param_grid_fine = {
'n_estimators': list(range(max(10, best_n-50), best_n+50, 20)),
'max_depth': list(range(max(1, best_d-5), best_d+5)) if best_d else [15, 20, 25, None]
}
grid_fine = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid_fine, cv=5, n_jobs=-1
)
grid_fine.fit(X_train, y_train)
print(f"精细搜索最佳: {grid_fine.best_params_}")
print(f"最终CV分数: {grid_fine.best_score_:.4f}")
并行化
import time
# 比较不同n_jobs
for n_jobs in [1, 2, 4, -1]:
start = time.time()
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
{'n_estimators': [50, 100], 'max_depth': [5, 10]},
cv=3, n_jobs=n_jobs
)
grid.fit(X_train[:500], y_train[:500])
elapsed = time.time() - start
print(f"n_jobs={n_jobs}: {elapsed:.2f}秒")
不同模型的调优策略
随机森林
# 随机森林调优策略
rf_param_grid = {
# 最重要:树的数量和深度
'n_estimators': [100, 200, 500],
'max_depth': [10, 20, None],
# 其次:分裂条件
'min_samples_split': [2, 5],
'min_samples_leaf': [1, 2],
# 特征采样
'max_features': ['sqrt', 'log2', 0.5]
}
SVM
# SVM调优策略
svm_param_grid = {
'C': np.logspace(-3, 3, 7),
'gamma': np.logspace(-4, 1, 6),
'kernel': ['rbf', 'poly']
}
svm_search = GridSearchCV(
SVC(random_state=42),
svm_param_grid,
cv=5, n_jobs=-1
)
svm_search.fit(X_train, y_train)
print(f"SVM最佳参数: {svm_search.best_params_}")
print(f"SVM最佳分数: {svm_search.best_score_:.4f}")
XGBoost
try:
import xgboost as xgb
xgb_param_grid = {
'n_estimators': [100, 200],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'subsample': [0.8, 1.0],
'colsample_bytree': [0.8, 1.0]
}
xgb_search = RandomizedSearchCV(
xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss'),
xgb_param_grid,
n_iter=20, cv=5, n_jobs=-1, random_state=42
)
xgb_search.fit(X_train, y_train)
print(f"XGBoost最佳参数: {xgb_search.best_params_}")
print(f"XGBoost最佳分数: {xgb_search.best_score_:.4f}")
except ImportError:
print("XGBoost未安装")
调优方法对比
# 比较不同方法的效率
import time
methods = {}
# 网格搜索
start = time.time()
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
{'n_estimators': [50, 100, 200], 'max_depth': [5, 10, 15]},
cv=3, n_jobs=-1
)
grid.fit(X_train, y_train)
methods['Grid Search'] = {
'time': time.time() - start,
'score': grid.best_score_,
'n_iter': 9
}
# 随机搜索
start = time.time()
random = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
{'n_estimators': randint(50, 200), 'max_depth': randint(5, 15)},
n_iter=9, cv=3, n_jobs=-1, random_state=42
)
random.fit(X_train, y_train)
methods['Random Search'] = {
'time': time.time() - start,
'score': random.best_score_,
'n_iter': 9
}
# Halving
start = time.time()
halving = HalvingRandomSearchCV(
RandomForestClassifier(random_state=42),
{'n_estimators': randint(50, 200), 'max_depth': randint(5, 15)},
n_candidates=20, cv=3, n_jobs=-1, random_state=42
)
halving.fit(X_train, y_train)
methods['Halving Search'] = {
'time': time.time() - start,
'score': halving.best_score_,
'n_iter': sum(halving.n_candidates_)
}
# 结果
print("\n方法对比:")
for name, result in methods.items():
print(f"{name}: Score={result['score']:.4f}, Time={result['time']:.2f}s, Iters={result['n_iter']}")
常见问题
Q1: 什么时候用网格搜索vs随机搜索?
| 场景 | 推荐方法 |
|---|---|
| 参数少、范围小 | 网格搜索 |
| 参数多、范围大 | 随机搜索 |
| 计算资源充足 | 贝叶斯优化 |
| 需要快速结果 | 随机搜索 |
Q2: 如何避免过拟合验证集?
使用嵌套交叉验证:
- 外层:评估最终性能
- 内层:选择超参数
Q3: 应该调哪些超参数?
优先级:
- 影响模型容量的参数(max_depth, n_estimators)
- 正则化参数(C, alpha)
- 其他参数
Q4: 调参需要多少时间?
取决于:
- 数据集大小
- 模型复杂度
- 搜索空间大小
- 可用计算资源
总结
| 方法 | 原理 | 适用场景 |
|---|---|---|
| 网格搜索 | 穷举 | 参数少 |
| 随机搜索 | 随机采样 | 通用 |
| 贝叶斯优化 | 概率模型 | 评估成本高 |
| Halving | 逐步淘汰 | 快速筛选 |
参考资料
- scikit-learn 文档:Tuning the hyper-parameters
- Bergstra, J., & Bengio, Y. (2012). “Random Search for Hyper-Parameter Optimization”
- Optuna 文档:https://optuna.org/
版权声明: 如无特别声明,本文版权归 sshipanoo 所有,转载请注明本文链接。
(采用 CC BY-NC-SA 4.0 许可协议进行授权)
本文标题:《 机器学习基础系列——超参数调优 》
本文链接:http://localhost:3015/ai/%E8%B6%85%E5%8F%82%E6%95%B0%E8%B0%83%E4%BC%98.html
本文最后一次更新为 天前,文章中的某些内容可能已过时!