已经是最新一篇文章了!
已经是最后一篇文章了!
XGBoost、LightGBM、CatBoost
前言
梯度提升(Gradient Boosting)是另一种强大的集成方法。与Bagging并行训练不同,Boosting采用串行方式,每个新模型都专注于纠正前一个模型的错误。
Boosting基础
Bagging vs Boosting
| 特性 | Bagging | Boosting |
|---|---|---|
| 训练方式 | 并行 | 串行 |
| 样本权重 | 相同 | 动态调整 |
| 基学习器 | 相互独立 | 相互依赖 |
| 降低 | 方差 | 偏差 |
Boosting核心思想
\[F_m(x) = F_{m-1}(x) + \alpha_m h_m(x)\]- $F_m(x)$:第m轮后的集成模型
- $h_m(x)$:第m个基学习器
- $\alpha_m$:学习率
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error
np.random.seed(42)
AdaBoost
原理
Adaptive Boosting:根据样本被错误分类的次数调整权重。
from sklearn.ensemble import AdaBoostClassifier
# 生成数据
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# AdaBoost
ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # 弱分类器
n_estimators=100,
learning_rate=1.0,
random_state=42
)
ada.fit(X_train, y_train)
print(f"AdaBoost准确率: {ada.score(X_test, y_test):.4f}")
# 查看每轮的误差
staged_scores = list(ada.staged_score(X_test, y_test))
plt.figure(figsize=(10, 5))
plt.plot(range(1, len(staged_scores)+1), staged_scores, 'b-')
plt.xlabel('Boosting轮数')
plt.ylabel('测试准确率')
plt.title('AdaBoost学习曲线')
plt.grid(True, alpha=0.3)
plt.show()
从零实现AdaBoost
class AdaBoostClassifierScratch:
def __init__(self, n_estimators=50, learning_rate=1.0):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.estimators = []
self.alphas = []
def fit(self, X, y):
n_samples = len(X)
# 转换标签为-1和1
y_transformed = np.where(y == 0, -1, 1)
# 初始化样本权重
weights = np.ones(n_samples) / n_samples
for _ in range(self.n_estimators):
# 训练弱分类器
stump = DecisionTreeClassifier(max_depth=1)
stump.fit(X, y, sample_weight=weights)
# 预测
predictions = stump.predict(X)
pred_transformed = np.where(predictions == 0, -1, 1)
# 计算加权错误率
incorrect = (pred_transformed != y_transformed)
error = np.sum(weights * incorrect) / np.sum(weights)
# 避免除零
error = np.clip(error, 1e-10, 1 - 1e-10)
# 计算分类器权重
alpha = self.learning_rate * 0.5 * np.log((1 - error) / error)
# 更新样本权重
weights *= np.exp(-alpha * y_transformed * pred_transformed)
weights /= np.sum(weights) # 归一化
self.estimators.append(stump)
self.alphas.append(alpha)
return self
def predict(self, X):
# 加权投票
predictions = np.zeros(len(X))
for alpha, estimator in zip(self.alphas, self.estimators):
pred = estimator.predict(X)
pred_transformed = np.where(pred == 0, -1, 1)
predictions += alpha * pred_transformed
return np.where(predictions >= 0, 1, 0)
def score(self, X, y):
return np.mean(self.predict(X) == y)
# 测试
ada_scratch = AdaBoostClassifierScratch(n_estimators=100)
ada_scratch.fit(X_train, y_train)
print(f"自实现AdaBoost准确率: {ada_scratch.score(X_test, y_test):.4f}")
梯度提升决策树(GBDT)
原理
将Boosting视为梯度下降过程,每次拟合损失函数的负梯度。
对于回归问题(MSE损失): \(r_m = -\frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)} = y - F_{m-1}(x)\)
即拟合残差。
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
# 分类
gbc = GradientBoostingClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
gbc.fit(X_train, y_train)
print(f"GBDT分类准确率: {gbc.score(X_test, y_test):.4f}")
# 回归示例
X_reg, y_reg = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)
gbr = GradientBoostingRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
gbr.fit(X_train_r, y_train_r)
print(f"GBDT回归R²: {gbr.score(X_test_r, y_test_r):.4f}")
从零实现GBDT回归
class GBDTRegressorScratch:
def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
self.n_estimators = n_estimators
self.learning_rate = learning_rate
self.max_depth = max_depth
self.trees = []
self.init_prediction = None
def fit(self, X, y):
# 初始预测(使用均值)
self.init_prediction = np.mean(y)
predictions = np.full(len(y), self.init_prediction)
for _ in range(self.n_estimators):
# 计算残差(负梯度)
residuals = y - predictions
# 拟合残差
tree = DecisionTreeRegressor(max_depth=self.max_depth)
tree.fit(X, residuals)
# 更新预测
predictions += self.learning_rate * tree.predict(X)
self.trees.append(tree)
return self
def predict(self, X):
predictions = np.full(len(X), self.init_prediction)
for tree in self.trees:
predictions += self.learning_rate * tree.predict(X)
return predictions
def score(self, X, y):
y_pred = self.predict(X)
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
return 1 - ss_res / ss_tot
# 测试
gbdt_scratch = GBDTRegressorScratch(n_estimators=100, learning_rate=0.1, max_depth=3)
gbdt_scratch.fit(X_train_r, y_train_r)
print(f"自实现GBDT回归R²: {gbdt_scratch.score(X_test_r, y_test_r):.4f}")
可视化残差学习过程
# 一维回归可视化
np.random.seed(42)
X_1d = np.sort(np.random.rand(100) * 10).reshape(-1, 1)
y_1d = np.sin(X_1d.ravel()) + np.random.randn(100) * 0.2
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
X_plot = np.linspace(0, 10, 100).reshape(-1, 1)
predictions = np.zeros(len(X_1d))
init_pred = np.mean(y_1d)
predictions[:] = init_pred
stages = [0, 1, 5, 10, 50, 100]
trees_list = []
for i in range(100):
residuals = y_1d - predictions
tree = DecisionTreeRegressor(max_depth=3)
tree.fit(X_1d, residuals)
predictions += 0.1 * tree.predict(X_1d)
trees_list.append(tree)
for ax, stage in zip(axes.ravel(), stages):
if stage == 0:
pred_plot = np.full(100, init_pred)
else:
pred_plot = np.full(100, init_pred)
for t in trees_list[:stage]:
pred_plot += 0.1 * t.predict(X_plot)
ax.scatter(X_1d, y_1d, c='blue', alpha=0.5, label='数据')
ax.plot(X_plot, pred_plot, 'r-', linewidth=2, label='预测')
ax.plot(X_plot, np.sin(X_plot), 'g--', alpha=0.5, label='真实')
ax.set_title(f'迭代 {stage} 轮')
ax.legend()
plt.tight_layout()
plt.show()
XGBoost
特点
- 正则化目标函数
- 高效的实现(列块、缓存优化)
- 支持稀疏数据
- 内置交叉验证
- 支持GPU
try:
import xgboost as xgb
# 分类
xgb_clf = xgb.XGBClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
min_child_weight=1,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0, # L1正则化
reg_lambda=1, # L2正则化
random_state=42,
use_label_encoder=False,
eval_metric='logloss'
)
xgb_clf.fit(X_train, y_train)
print(f"XGBoost分类准确率: {xgb_clf.score(X_test, y_test):.4f}")
# 回归
xgb_reg = xgb.XGBRegressor(
n_estimators=100,
learning_rate=0.1,
max_depth=3,
random_state=42
)
xgb_reg.fit(X_train_r, y_train_r)
print(f"XGBoost回归R²: {xgb_reg.score(X_test_r, y_test_r):.4f}")
except ImportError:
print("XGBoost未安装,跳过")
XGBoost特征重要性
try:
# 特征重要性
importances = xgb_clf.feature_importances_
plt.figure(figsize=(10, 6))
indices = np.argsort(importances)[::-1][:15]
plt.barh(range(len(indices)), importances[indices])
plt.yticks(range(len(indices)), [f'Feature {i}' for i in indices])
plt.xlabel('重要性')
plt.title('XGBoost特征重要性')
plt.tight_layout()
plt.show()
except:
pass
LightGBM
特点
- 基于直方图的算法(更快)
- 叶子优先生长(leaf-wise)
- 支持类别特征
- 更低内存占用
try:
import lightgbm as lgb
# 分类
lgb_clf = lgb.LGBMClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=-1, # -1表示无限制
num_leaves=31, # 叶子数
min_child_samples=20,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0,
reg_lambda=0,
random_state=42
)
lgb_clf.fit(X_train, y_train)
print(f"LightGBM分类准确率: {lgb_clf.score(X_test, y_test):.4f}")
# 回归
lgb_reg = lgb.LGBMRegressor(
n_estimators=100,
learning_rate=0.1,
random_state=42
)
lgb_reg.fit(X_train_r, y_train_r)
print(f"LightGBM回归R²: {lgb_reg.score(X_test_r, y_test_r):.4f}")
except ImportError:
print("LightGBM未安装,跳过")
CatBoost
特点
- 原生类别特征支持
- 对称树结构
- 有序提升(减少过拟合)
- 更好的默认参数
try:
from catboost import CatBoostClassifier, CatBoostRegressor
# 分类
cat_clf = CatBoostClassifier(
n_estimators=100,
learning_rate=0.1,
max_depth=6,
random_state=42,
verbose=False
)
cat_clf.fit(X_train, y_train)
print(f"CatBoost分类准确率: {cat_clf.score(X_test, y_test):.4f}")
# 回归
cat_reg = CatBoostRegressor(
n_estimators=100,
learning_rate=0.1,
random_state=42,
verbose=False
)
cat_reg.fit(X_train_r, y_train_r)
print(f"CatBoost回归R²: {cat_reg.score(X_test_r, y_test_r):.4f}")
except ImportError:
print("CatBoost未安装,跳过")
三大框架对比
| 特性 | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| 树生长策略 | 按层 | 按叶子 | 对称 |
| 类别特征 | 需编码 | 原生支持 | 原生支持 |
| 速度 | 中等 | 最快 | 中等 |
| 内存 | 中等 | 最低 | 中等 |
| 默认效果 | 需调参 | 需调参 | 开箱即用 |
import time
# 性能对比(如果库都安装了)
results = []
models = {
'sklearn GBDT': GradientBoostingClassifier(n_estimators=100, random_state=42)
}
try:
import xgboost as xgb
models['XGBoost'] = xgb.XGBClassifier(n_estimators=100, random_state=42,
use_label_encoder=False, eval_metric='logloss')
except ImportError:
pass
try:
import lightgbm as lgb
models['LightGBM'] = lgb.LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)
except ImportError:
pass
try:
from catboost import CatBoostClassifier
models['CatBoost'] = CatBoostClassifier(n_estimators=100, random_state=42, verbose=False)
except ImportError:
pass
for name, model in models.items():
start = time.time()
model.fit(X_train, y_train)
train_time = time.time() - start
start = time.time()
accuracy = model.score(X_test, y_test)
pred_time = time.time() - start
results.append({
'Model': name,
'Accuracy': accuracy,
'Train Time': train_time,
'Predict Time': pred_time
})
print(f"{name}: Acc={accuracy:.4f}, Train={train_time:.3f}s")
import pandas as pd
df_results = pd.DataFrame(results)
print(df_results)
超参数调优
from sklearn.model_selection import GridSearchCV
# XGBoost参数调优示例
try:
param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [50, 100, 200],
'min_child_weight': [1, 3, 5],
'subsample': [0.7, 0.8, 0.9]
}
xgb_search = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
# 使用随机搜索节省时间
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
xgb_search,
param_grid,
n_iter=20,
cv=3,
scoring='accuracy',
random_state=42,
n_jobs=-1
)
random_search.fit(X_train, y_train)
print(f"最佳参数: {random_search.best_params_}")
print(f"最佳CV分数: {random_search.best_score_:.4f}")
print(f"测试分数: {random_search.score(X_test, y_test):.4f}")
except:
pass
早停(Early Stopping)
try:
# XGBoost早停
xgb_early = xgb.XGBClassifier(
n_estimators=1000,
learning_rate=0.1,
random_state=42,
use_label_encoder=False,
eval_metric='logloss',
early_stopping_rounds=10
)
xgb_early.fit(
X_train, y_train,
eval_set=[(X_test, y_test)],
verbose=False
)
print(f"最佳迭代次数: {xgb_early.best_iteration}")
print(f"测试准确率: {xgb_early.score(X_test, y_test):.4f}")
except:
pass
实战:Kaggle风格数据
from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler
# 加载数据
housing = fetch_california_housing()
X_house, y_house = housing.data, housing.target
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
X_house, y_house, test_size=0.2, random_state=42)
# 标准化
scaler = StandardScaler()
X_train_h = scaler.fit_transform(X_train_h)
X_test_h = scaler.transform(X_test_h)
# 比较不同方法
regressors = {
'GBDT': GradientBoostingRegressor(n_estimators=100, random_state=42)
}
try:
import xgboost as xgb
regressors['XGBoost'] = xgb.XGBRegressor(n_estimators=100, random_state=42)
except:
pass
try:
import lightgbm as lgb
regressors['LightGBM'] = lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1)
except:
pass
print("房价预测结果:")
for name, reg in regressors.items():
reg.fit(X_train_h, y_train_h)
train_r2 = reg.score(X_train_h, y_train_h)
test_r2 = reg.score(X_test_h, y_test_h)
rmse = np.sqrt(mean_squared_error(y_test_h, reg.predict(X_test_h)))
print(f" {name}: Train R²={train_r2:.4f}, Test R²={test_r2:.4f}, RMSE={rmse:.4f}")
常见问题
Q1: GBDT vs Random Forest?
| 特性 | Random Forest | GBDT |
|---|---|---|
| 训练方式 | 并行 | 串行 |
| 过拟合风险 | 低 | 较高 |
| 训练速度 | 快 | 慢 |
| 通常表现 | 稳定 | 更高精度 |
Q2: 如何选择XGBoost/LightGBM/CatBoost?
- 数据量大:LightGBM(最快)
- 类别特征多:CatBoost
- 需要稳定性:XGBoost
- 快速原型:CatBoost(默认参数好)
Q3: 学习率和树数量的关系?
- 学习率小 → 需要更多树
- 学习率大 → 容易过拟合
- 常用:learning_rate=0.1, n_estimators=100-1000
Q4: 如何防止过拟合?
- 降低学习率,增加树数量
- 限制树深度
- 使用子采样(subsample)
- 增加正则化(reg_alpha, reg_lambda)
- 使用早停
总结
| 概念 | 说明 |
|---|---|
| Boosting | 串行训练,每轮纠正前一轮的错误 |
| AdaBoost | 调整样本权重 |
| GBDT | 拟合残差(负梯度) |
| XGBoost | 正则化 + 工程优化 |
| LightGBM | 直方图 + 叶子优先 |
| CatBoost | 类别特征 + 对称树 |
参考资料
- Friedman, J.H. (2001). “Greedy function approximation: A gradient boosting machine”
- Chen, T., & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System”
- Ke, G. et al. (2017). “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”
- Prokhorenkova, L. et al. (2018). “CatBoost: unbiased boosting with categorical features”
版权声明: 如无特别声明,本文版权归 sshipanoo 所有,转载请注明本文链接。
(采用 CC BY-NC-SA 4.0 许可协议进行授权)
本文标题:《 机器学习基础系列——梯度提升 》
本文链接:http://localhost:3015/ai/%E6%A2%AF%E5%BA%A6%E6%8F%90%E5%8D%87.html
本文最后一次更新为 天前,文章中的某些内容可能已过时!