XGBoost、LightGBM、CatBoost

前言

梯度提升(Gradient Boosting)是另一种强大的集成方法。与Bagging并行训练不同,Boosting采用串行方式,每个新模型都专注于纠正前一个模型的错误。


Boosting基础

Bagging vs Boosting

特性 Bagging Boosting
训练方式 并行 串行
样本权重 相同 动态调整
基学习器 相互独立 相互依赖
降低 方差 偏差

Boosting核心思想

\[F_m(x) = F_{m-1}(x) + \alpha_m h_m(x)\]
  • $F_m(x)$:第m轮后的集成模型
  • $h_m(x)$:第m个基学习器
  • $\alpha_m$:学习率
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error

np.random.seed(42)

AdaBoost

原理

Adaptive Boosting:根据样本被错误分类的次数调整权重。

from sklearn.ensemble import AdaBoostClassifier

# 生成数据
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10,
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# AdaBoost
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # 弱分类器
    n_estimators=100,
    learning_rate=1.0,
    random_state=42
)
ada.fit(X_train, y_train)

print(f"AdaBoost准确率: {ada.score(X_test, y_test):.4f}")

# 查看每轮的误差
staged_scores = list(ada.staged_score(X_test, y_test))

plt.figure(figsize=(10, 5))
plt.plot(range(1, len(staged_scores)+1), staged_scores, 'b-')
plt.xlabel('Boosting轮数')
plt.ylabel('测试准确率')
plt.title('AdaBoost学习曲线')
plt.grid(True, alpha=0.3)
plt.show()

从零实现AdaBoost

class AdaBoostClassifierScratch:
    def __init__(self, n_estimators=50, learning_rate=1.0):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.estimators = []
        self.alphas = []
    
    def fit(self, X, y):
        n_samples = len(X)
        # 转换标签为-1和1
        y_transformed = np.where(y == 0, -1, 1)
        
        # 初始化样本权重
        weights = np.ones(n_samples) / n_samples
        
        for _ in range(self.n_estimators):
            # 训练弱分类器
            stump = DecisionTreeClassifier(max_depth=1)
            stump.fit(X, y, sample_weight=weights)
            
            # 预测
            predictions = stump.predict(X)
            pred_transformed = np.where(predictions == 0, -1, 1)
            
            # 计算加权错误率
            incorrect = (pred_transformed != y_transformed)
            error = np.sum(weights * incorrect) / np.sum(weights)
            
            # 避免除零
            error = np.clip(error, 1e-10, 1 - 1e-10)
            
            # 计算分类器权重
            alpha = self.learning_rate * 0.5 * np.log((1 - error) / error)
            
            # 更新样本权重
            weights *= np.exp(-alpha * y_transformed * pred_transformed)
            weights /= np.sum(weights)  # 归一化
            
            self.estimators.append(stump)
            self.alphas.append(alpha)
        
        return self
    
    def predict(self, X):
        # 加权投票
        predictions = np.zeros(len(X))
        
        for alpha, estimator in zip(self.alphas, self.estimators):
            pred = estimator.predict(X)
            pred_transformed = np.where(pred == 0, -1, 1)
            predictions += alpha * pred_transformed
        
        return np.where(predictions >= 0, 1, 0)
    
    def score(self, X, y):
        return np.mean(self.predict(X) == y)

# 测试
ada_scratch = AdaBoostClassifierScratch(n_estimators=100)
ada_scratch.fit(X_train, y_train)
print(f"自实现AdaBoost准确率: {ada_scratch.score(X_test, y_test):.4f}")

梯度提升决策树(GBDT)

原理

将Boosting视为梯度下降过程,每次拟合损失函数的负梯度。

对于回归问题(MSE损失): \(r_m = -\frac{\partial L(y, F_{m-1}(x))}{\partial F_{m-1}(x)} = y - F_{m-1}(x)\)

即拟合残差。

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor

# 分类
gbc = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gbc.fit(X_train, y_train)
print(f"GBDT分类准确率: {gbc.score(X_test, y_test):.4f}")

# 回归示例
X_reg, y_reg = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=42)
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_reg, y_reg, test_size=0.2, random_state=42)

gbr = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gbr.fit(X_train_r, y_train_r)
print(f"GBDT回归R²: {gbr.score(X_test_r, y_test_r):.4f}")

从零实现GBDT回归

class GBDTRegressorScratch:
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.trees = []
        self.init_prediction = None
    
    def fit(self, X, y):
        # 初始预测(使用均值)
        self.init_prediction = np.mean(y)
        predictions = np.full(len(y), self.init_prediction)
        
        for _ in range(self.n_estimators):
            # 计算残差(负梯度)
            residuals = y - predictions
            
            # 拟合残差
            tree = DecisionTreeRegressor(max_depth=self.max_depth)
            tree.fit(X, residuals)
            
            # 更新预测
            predictions += self.learning_rate * tree.predict(X)
            
            self.trees.append(tree)
        
        return self
    
    def predict(self, X):
        predictions = np.full(len(X), self.init_prediction)
        
        for tree in self.trees:
            predictions += self.learning_rate * tree.predict(X)
        
        return predictions
    
    def score(self, X, y):
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred) ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        return 1 - ss_res / ss_tot

# 测试
gbdt_scratch = GBDTRegressorScratch(n_estimators=100, learning_rate=0.1, max_depth=3)
gbdt_scratch.fit(X_train_r, y_train_r)
print(f"自实现GBDT回归R²: {gbdt_scratch.score(X_test_r, y_test_r):.4f}")

可视化残差学习过程

# 一维回归可视化
np.random.seed(42)
X_1d = np.sort(np.random.rand(100) * 10).reshape(-1, 1)
y_1d = np.sin(X_1d.ravel()) + np.random.randn(100) * 0.2

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

X_plot = np.linspace(0, 10, 100).reshape(-1, 1)
predictions = np.zeros(len(X_1d))
init_pred = np.mean(y_1d)
predictions[:] = init_pred

stages = [0, 1, 5, 10, 50, 100]
trees_list = []

for i in range(100):
    residuals = y_1d - predictions
    tree = DecisionTreeRegressor(max_depth=3)
    tree.fit(X_1d, residuals)
    predictions += 0.1 * tree.predict(X_1d)
    trees_list.append(tree)

for ax, stage in zip(axes.ravel(), stages):
    if stage == 0:
        pred_plot = np.full(100, init_pred)
    else:
        pred_plot = np.full(100, init_pred)
        for t in trees_list[:stage]:
            pred_plot += 0.1 * t.predict(X_plot)
    
    ax.scatter(X_1d, y_1d, c='blue', alpha=0.5, label='数据')
    ax.plot(X_plot, pred_plot, 'r-', linewidth=2, label='预测')
    ax.plot(X_plot, np.sin(X_plot), 'g--', alpha=0.5, label='真实')
    ax.set_title(f'迭代 {stage}')
    ax.legend()

plt.tight_layout()
plt.show()

XGBoost

特点

  • 正则化目标函数
  • 高效的实现(列块、缓存优化)
  • 支持稀疏数据
  • 内置交叉验证
  • 支持GPU
try:
    import xgboost as xgb
    
    # 分类
    xgb_clf = xgb.XGBClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        min_child_weight=1,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_alpha=0,         # L1正则化
        reg_lambda=1,        # L2正则化
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss'
    )
    xgb_clf.fit(X_train, y_train)
    print(f"XGBoost分类准确率: {xgb_clf.score(X_test, y_test):.4f}")
    
    # 回归
    xgb_reg = xgb.XGBRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    )
    xgb_reg.fit(X_train_r, y_train_r)
    print(f"XGBoost回归R²: {xgb_reg.score(X_test_r, y_test_r):.4f}")
    
except ImportError:
    print("XGBoost未安装,跳过")

XGBoost特征重要性

try:
    # 特征重要性
    importances = xgb_clf.feature_importances_
    
    plt.figure(figsize=(10, 6))
    indices = np.argsort(importances)[::-1][:15]
    plt.barh(range(len(indices)), importances[indices])
    plt.yticks(range(len(indices)), [f'Feature {i}' for i in indices])
    plt.xlabel('重要性')
    plt.title('XGBoost特征重要性')
    plt.tight_layout()
    plt.show()
except:
    pass

LightGBM

特点

  • 基于直方图的算法(更快)
  • 叶子优先生长(leaf-wise)
  • 支持类别特征
  • 更低内存占用
try:
    import lightgbm as lgb
    
    # 分类
    lgb_clf = lgb.LGBMClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=-1,         # -1表示无限制
        num_leaves=31,        # 叶子数
        min_child_samples=20,
        subsample=0.8,
        colsample_bytree=0.8,
        reg_alpha=0,
        reg_lambda=0,
        random_state=42
    )
    lgb_clf.fit(X_train, y_train)
    print(f"LightGBM分类准确率: {lgb_clf.score(X_test, y_test):.4f}")
    
    # 回归
    lgb_reg = lgb.LGBMRegressor(
        n_estimators=100,
        learning_rate=0.1,
        random_state=42
    )
    lgb_reg.fit(X_train_r, y_train_r)
    print(f"LightGBM回归R²: {lgb_reg.score(X_test_r, y_test_r):.4f}")
    
except ImportError:
    print("LightGBM未安装,跳过")

CatBoost

特点

  • 原生类别特征支持
  • 对称树结构
  • 有序提升(减少过拟合)
  • 更好的默认参数
try:
    from catboost import CatBoostClassifier, CatBoostRegressor
    
    # 分类
    cat_clf = CatBoostClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=6,
        random_state=42,
        verbose=False
    )
    cat_clf.fit(X_train, y_train)
    print(f"CatBoost分类准确率: {cat_clf.score(X_test, y_test):.4f}")
    
    # 回归
    cat_reg = CatBoostRegressor(
        n_estimators=100,
        learning_rate=0.1,
        random_state=42,
        verbose=False
    )
    cat_reg.fit(X_train_r, y_train_r)
    print(f"CatBoost回归R²: {cat_reg.score(X_test_r, y_test_r):.4f}")
    
except ImportError:
    print("CatBoost未安装,跳过")

三大框架对比

特性 XGBoost LightGBM CatBoost
树生长策略 按层 按叶子 对称
类别特征 需编码 原生支持 原生支持
速度 中等 最快 中等
内存 中等 最低 中等
默认效果 需调参 需调参 开箱即用
import time

# 性能对比(如果库都安装了)
results = []

models = {
    'sklearn GBDT': GradientBoostingClassifier(n_estimators=100, random_state=42)
}

try:
    import xgboost as xgb
    models['XGBoost'] = xgb.XGBClassifier(n_estimators=100, random_state=42, 
                                           use_label_encoder=False, eval_metric='logloss')
except ImportError:
    pass

try:
    import lightgbm as lgb
    models['LightGBM'] = lgb.LGBMClassifier(n_estimators=100, random_state=42, verbose=-1)
except ImportError:
    pass

try:
    from catboost import CatBoostClassifier
    models['CatBoost'] = CatBoostClassifier(n_estimators=100, random_state=42, verbose=False)
except ImportError:
    pass

for name, model in models.items():
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    
    start = time.time()
    accuracy = model.score(X_test, y_test)
    pred_time = time.time() - start
    
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Train Time': train_time,
        'Predict Time': pred_time
    })
    print(f"{name}: Acc={accuracy:.4f}, Train={train_time:.3f}s")

import pandas as pd
df_results = pd.DataFrame(results)
print(df_results)

超参数调优

from sklearn.model_selection import GridSearchCV

# XGBoost参数调优示例
try:
    param_grid = {
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2],
        'n_estimators': [50, 100, 200],
        'min_child_weight': [1, 3, 5],
        'subsample': [0.7, 0.8, 0.9]
    }
    
    xgb_search = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
    
    # 使用随机搜索节省时间
    from sklearn.model_selection import RandomizedSearchCV
    
    random_search = RandomizedSearchCV(
        xgb_search,
        param_grid,
        n_iter=20,
        cv=3,
        scoring='accuracy',
        random_state=42,
        n_jobs=-1
    )
    
    random_search.fit(X_train, y_train)
    
    print(f"最佳参数: {random_search.best_params_}")
    print(f"最佳CV分数: {random_search.best_score_:.4f}")
    print(f"测试分数: {random_search.score(X_test, y_test):.4f}")
except:
    pass

早停(Early Stopping)

try:
    # XGBoost早停
    xgb_early = xgb.XGBClassifier(
        n_estimators=1000,
        learning_rate=0.1,
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss',
        early_stopping_rounds=10
    )
    
    xgb_early.fit(
        X_train, y_train,
        eval_set=[(X_test, y_test)],
        verbose=False
    )
    
    print(f"最佳迭代次数: {xgb_early.best_iteration}")
    print(f"测试准确率: {xgb_early.score(X_test, y_test):.4f}")
except:
    pass

实战:Kaggle风格数据

from sklearn.datasets import fetch_california_housing
from sklearn.preprocessing import StandardScaler

# 加载数据
housing = fetch_california_housing()
X_house, y_house = housing.data, housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_house, y_house, test_size=0.2, random_state=42)

# 标准化
scaler = StandardScaler()
X_train_h = scaler.fit_transform(X_train_h)
X_test_h = scaler.transform(X_test_h)

# 比较不同方法
regressors = {
    'GBDT': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

try:
    import xgboost as xgb
    regressors['XGBoost'] = xgb.XGBRegressor(n_estimators=100, random_state=42)
except:
    pass

try:
    import lightgbm as lgb
    regressors['LightGBM'] = lgb.LGBMRegressor(n_estimators=100, random_state=42, verbose=-1)
except:
    pass

print("房价预测结果:")
for name, reg in regressors.items():
    reg.fit(X_train_h, y_train_h)
    train_r2 = reg.score(X_train_h, y_train_h)
    test_r2 = reg.score(X_test_h, y_test_h)
    rmse = np.sqrt(mean_squared_error(y_test_h, reg.predict(X_test_h)))
    print(f"  {name}: Train R²={train_r2:.4f}, Test R²={test_r2:.4f}, RMSE={rmse:.4f}")

常见问题

Q1: GBDT vs Random Forest?

特性 Random Forest GBDT
训练方式 并行 串行
过拟合风险 较高
训练速度
通常表现 稳定 更高精度

Q2: 如何选择XGBoost/LightGBM/CatBoost?

  • 数据量大:LightGBM(最快)
  • 类别特征多:CatBoost
  • 需要稳定性:XGBoost
  • 快速原型:CatBoost(默认参数好)

Q3: 学习率和树数量的关系?

  • 学习率小 → 需要更多树
  • 学习率大 → 容易过拟合
  • 常用:learning_rate=0.1, n_estimators=100-1000

Q4: 如何防止过拟合?

  • 降低学习率,增加树数量
  • 限制树深度
  • 使用子采样(subsample)
  • 增加正则化(reg_alpha, reg_lambda)
  • 使用早停

总结

概念 说明
Boosting 串行训练,每轮纠正前一轮的错误
AdaBoost 调整样本权重
GBDT 拟合残差(负梯度)
XGBoost 正则化 + 工程优化
LightGBM 直方图 + 叶子优先
CatBoost 类别特征 + 对称树

参考资料

  • Friedman, J.H. (2001). “Greedy function approximation: A gradient boosting machine”
  • Chen, T., & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System”
  • Ke, G. et al. (2017). “LightGBM: A Highly Efficient Gradient Boosting Decision Tree”
  • Prokhorenkova, L. et al. (2018). “CatBoost: unbiased boosting with categorical features”

版权声明: 如无特别声明,本文版权归 sshipanoo 所有,转载请注明本文链接。

(采用 CC BY-NC-SA 4.0 许可协议进行授权)

本文标题:《 机器学习基础系列——梯度提升 》

本文链接:http://localhost:3015/ai/%E6%A2%AF%E5%BA%A6%E6%8F%90%E5%8D%87.html

本文最后一次更新为 天前,文章中的某些内容可能已过时!