比較隨機搜索和網格搜索以進行超參數估計?
本案例比較帶有SGD訓練的隨機搜索和網格搜索,以優化線性SVM的超參數。 同時搜索所有影響學習的參數(考慮到訓練時間與質量的平衡,我們不考慮估計量的數量)。
隨機搜索和網格搜索探索的參數空間完全相同。 參數設置的結果非常相似,而隨機搜索的運行時間大大減少。
對于隨機搜索,性能可能會稍差一些,不過這可能是由于噪聲效應導致的,并且不會延續到保留的測試集中。
請注意,實際上,不會使用網格搜索同時搜索這么多不同的參數,而只會選擇那些最重要的參數。
輸出:
RandomizedSearchCV took 29.15 seconds for 20 candidates parameter settings.
Model with rank: 1
Mean validation score: 0.920 (std: 0.028)
Parameters: {'alpha': 0.07316411520495676, 'average': False, 'l1_ratio': 0.29007760721044407}
Model with rank: 2
Mean validation score: 0.920 (std: 0.029)
Parameters: {'alpha': 0.0005223493320259539, 'average': True, 'l1_ratio': 0.7936977033574206}
Model with rank: 3
Mean validation score: 0.918 (std: 0.031)
Parameters: {'alpha': 0.00025790124268693137, 'average': True, 'l1_ratio': 0.5699649107012649}
GridSearchCV took 150.68 seconds for 100 candidate parameter settings.
Model with rank: 1
Mean validation score: 0.931 (std: 0.026)
Parameters: {'alpha': 0.0001, 'average': True, 'l1_ratio': 0.0}
Model with rank: 2
Mean validation score: 0.928 (std: 0.030)
Parameters: {'alpha': 0.0001, 'average': True, 'l1_ratio': 0.1111111111111111}
Model with rank: 3
Mean validation score: 0.927 (std: 0.026)
Parameters: {'alpha': 0.0001, 'average': True, 'l1_ratio': 0.5555555555555556}
輸入:
print(__doc__)
import numpy as np
from time import time
import scipy.stats as stats
from sklearn.utils.fixes import loguniform
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.linear_model import SGDClassifier
# 獲得一些數據
X, y = load_digits(return_X_y=True)
# 建立一個分類器
clf = SGDClassifier(loss='hinge', penalty='elasticnet',
fit_intercept=True)
# 實用功能呈現最佳成績
def report(results, n_top=3):
for i in range(1, n_top + 1):
candidates = np.flatnonzero(results['rank_test_score'] == i)
for candidate in candidates:
print("Model with rank: {0}".format(i))
print("Mean validation score: {0:.3f} (std: {1:.3f})"
.format(results['mean_test_score'][candidate],
results['std_test_score'][candidate]))
print("Parameters: {0}".format(results['params'][candidate]))
print("")
# 指定要從中采樣的參數和分布
param_dist = {'average': [True, False],
'l1_ratio': stats.uniform(0, 1),
'alpha': loguniform(1e-4, 1e0)}
# 運行隨機搜索
n_iter_search = 20
random_search = RandomizedSearchCV(clf, param_distributions=param_dist,
n_iter=n_iter_search)
start = time()
random_search.fit(X, y)
print("RandomizedSearchCV took %.2f seconds for %d candidates"
" parameter settings." % ((time() - start), n_iter_search))
report(random_search.cv_results_)
# 對所有參數使用完整的網格搜索
param_grid = {'average': [True, False],
'l1_ratio': np.linspace(0, 1, num=10),
'alpha': np.power(10, np.arange(-4, 1, dtype=float))}
# 運行網格搜索
grid_search = GridSearchCV(clf, param_grid=param_grid)
start = time()
grid_search.fit(X, y)
print("GridSearchCV took %.2f seconds for %d candidate parameter settings."
% (time() - start, len(grid_search.cv_results_['params'])))
report(grid_search.cv_results_)
腳本的總運行時間:(2分鐘59.930秒)