Paper Title
LightGBM: A Highly Efficient Gradient Boosting Decision Tree
Basic algorithm and main steps
Basic ideas
It introduces a novel GBDT algorithm to reduct features and data, including :
-
Gradient-based One-Side Sampling (GOSS) : Retainde some small gradient samples randomly while preserving large gradient samples.
-
Exclusive Feature Bundling (EFB) : merge sparse features with little conflict to reduct feature deminision.
Main Agorithm
-
GOSS
- Sort the traning data descendly based on the gradient of the data, and retain the top A th set.
- Ramdomly sample in the U\A set.
-
EFB
-
Construct a weighted undirected graph with vertices as features and edges as weights.Sort the nodes by the degree. The greater the degree, the greater the conflict with other features.
-
Improve the first step: sort the features by the number of non-zero values.
-
Calculate the conflict ratio between different features
-
Iterate through each feature and merge the features to minimize the conflict ratio. (need to add constant bia)
?
Motivation
- Conventional GBDT models need to scan all the data instances for every feature to estimate the information gain of all the possible split points, which is in high complexity.
- Pre-sort, one of the method to find the best split points, is inefficient in both training speed and memory consumption.
- Histogram-based algorithm, another main method, does not have efficient sparse optimization solutions
Contribution
- Proposed a novel GBDT algorithm called LightGBM, containing Gradient-based One-Side Sampling and Exclusive Feature Bundling, which bring a significant speed-up for large GBDT training process.
- LightGBM can significantly outperform XGBoost and SGB in terms of computational speed and memory consumption
My own idea
- Relations to what I had read.
- Shortcomings I assume: the LightGBM may be sensitive to noisy points since the EFB is based on bias.
- Potential change I assume: explore the process of find the optimal selection of a and b in Gradient-based One-Side Sampling; generate the model to non-sparse data; paralle the LightGBM.
【DM论文阅读杂记】LightGBM
原文:https://www.cnblogs.com/chensy86/p/13623704.html