怎么做視頻的網(wǎng)站網(wǎng)站開發(fā)一般多少錢
基于WIN10的64位系統(tǒng)演示
一、寫在前面
這一期,我們介紹Catboost回歸。
同樣,這里使用這個(gè)數(shù)據(jù):
《PLoS One》2015年一篇題目為《Comparison of Two Hybrid Models for Forecasting the Incidence of Hemorrhagic Fever with Renal Syndrome in Jiangsu Province, China》文章的公開數(shù)據(jù)做演示。數(shù)據(jù)為江蘇省2004年1月至2012年12月腎綜合癥出血熱月發(fā)病率。運(yùn)用2004年1月至2011年12月的數(shù)據(jù)預(yù)測(cè)2012年12個(gè)月的發(fā)病率數(shù)據(jù)。
二、Catboost回歸
(1)參數(shù)解讀
無(wú)論是回歸還是分類,CatBoost的大部分參數(shù)都是通用的,但任務(wù)的不同性質(zhì)意味著一些參數(shù)可能只在一個(gè)任務(wù)中有意義。
以下是一些關(guān)鍵參數(shù)的簡(jiǎn)要概述:
(a)通用參數(shù):
learning_rate: 學(xué)習(xí)率,決定了模型每一步的步長(zhǎng)。常用的值為0.01, 0.03, 0.1等。
iterations: 樹的數(shù)量。
depth: 樹的深度。
l2_leaf_reg: L2正則化項(xiàng)的系數(shù)。
cat_features: 分類特征的列索引列表。
loss_function: 損失函數(shù)。對(duì)于分類,常見的是Logloss(二分類)或MultiClass(多分類)。對(duì)于回歸,常見的是RMSE。
border_count: 用于數(shù)值特征的分箱數(shù)量。較高的值可能會(huì)導(dǎo)致過擬合,較低的值可能會(huì)導(dǎo)致欠擬合。
verbose: 顯示的訓(xùn)練日志的詳細(xì)程度。
(b)專用于分類的參數(shù):
classes_count: 在多分類任務(wù)中,類別的數(shù)量。
class_weights: 各類的權(quán)重,用于不平衡分類任務(wù)。
auto_class_weights: 用于處理類不平衡的自動(dòng)權(quán)重計(jì)算方法。
(c)專用于回歸的參數(shù):
scale_pos_weight: 用于不平衡的回歸任務(wù)。
(d)異同點(diǎn):
相同點(diǎn): 大部分參數(shù)(如learning_rate, depth, l2_leaf_reg等)在回歸和分類任務(wù)中都是相同的,并且它們的含義和效果也是一致的。
不同點(diǎn): 損失函數(shù)loss_function是根據(jù)任務(wù)(回歸或分類)來(lái)確定的。此外,某些參數(shù)(如classes_count和class_weights)僅在分類任務(wù)中有意義,而scale_pos_weight更傾向于回歸任務(wù)。
此外,在使用CatBoost時(shí),建議始終查閱其官方文檔,因?yàn)樵搸?kù)可能會(huì)經(jīng)常更新,新的參數(shù)或功能可能會(huì)被添加進(jìn)來(lái)。網(wǎng)址如下:
https://catboost.ai/docs/
(2)單步滾動(dòng)預(yù)測(cè)
import pandas as pd
import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error
from catboost import CatBoostRegressor
from sklearn.model_selection import GridSearchCV# 讀取數(shù)據(jù)
data = pd.read_csv('data.csv')# 將時(shí)間列轉(zhuǎn)換為日期格式
data['time'] = pd.to_datetime(data['time'], format='%b-%y')# 創(chuàng)建滯后期特征
lag_period = 6
for i in range(lag_period, 0, -1):data[f'lag_{i}'] = data['incidence'].shift(lag_period - i + 1)# 刪除包含 NaN 的行
data = data.dropna().reset_index(drop=True)# 劃分訓(xùn)練集和驗(yàn)證集
train_data = data[(data['time'] >= '2004-01-01') & (data['time'] <= '2011-12-31')]
validation_data = data[(data['time'] >= '2012-01-01') & (data['time'] <= '2012-12-31')]# 定義特征和目標(biāo)變量
X_train = train_data[['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5', 'lag_6']]
y_train = train_data['incidence']
X_validation = validation_data[['lag_1', 'lag_2', 'lag_3', 'lag_4', 'lag_5', 'lag_6']]
y_validation = validation_data['incidence']# 初始化 CatBoostRegressor 模型
catboost_model = CatBoostRegressor(verbose=0)# 定義參數(shù)網(wǎng)格
param_grid = {'iterations': [50, 100, 150],'learning_rate': [0.01, 0.05, 0.1, 0.5, 1],'depth': [4, 6, 8],'loss_function': ['RMSE']
}# 初始化網(wǎng)格搜索
grid_search = GridSearchCV(catboost_model, param_grid, cv=5, scoring='neg_mean_squared_error')# 進(jìn)行網(wǎng)格搜索
grid_search.fit(X_train, y_train)# 獲取最佳參數(shù)
best_params = grid_search.best_params_# 使用最佳參數(shù)初始化 CatBoostRegressor 模型
best_catboost_model = CatBoostRegressor(**best_params, verbose=0)# 在訓(xùn)練集上訓(xùn)練模型
best_catboost_model.fit(X_train, y_train)# 對(duì)于驗(yàn)證集,我們需要迭代地預(yù)測(cè)每一個(gè)數(shù)據(jù)點(diǎn)
y_validation_pred = []for i in range(len(X_validation)):if i == 0:pred = best_catboost_model.predict([X_validation.iloc[0]])else:new_features = list(X_validation.iloc[i, 1:]) + [pred[0]]pred = best_catboost_model.predict([new_features])y_validation_pred.append(pred[0])y_validation_pred = np.array(y_validation_pred)# 計(jì)算驗(yàn)證集上的MAE, MAPE, MSE 和 RMSE
mae_validation = mean_absolute_error(y_validation, y_validation_pred)
mape_validation = np.mean(np.abs((y_validation - y_validation_pred) / y_validation))
mse_validation = mean_squared_error(y_validation, y_validation_pred)
rmse_validation = np.sqrt(mse_validation)# 計(jì)算訓(xùn)練集上的MAE, MAPE, MSE 和 RMSE
y_train_pred = best_catboost_model.predict(X_train)
mae_train = mean_absolute_error(y_train, y_train_pred)
mape_train = np.mean(np.abs((y_train - y_train_pred) / y_train))
mse_train = mean_squared_error(y_train, y_train_pred)
rmse_train = np.sqrt(mse_train)print("Train Metrics:", mae_train, mape_train, mse_train, rmse_train)
print("Validation Metrics:", mae_validation, mape_validation, mse_validation, rmse_validation)
看結(jié)果:
(3)多步滾動(dòng)預(yù)測(cè)-vol. 1
對(duì)于Catboost回歸,目標(biāo)變量y_train不能是多列的DataFrame,所以你們懂的。
(4)多步滾動(dòng)預(yù)測(cè)-vol. 2
同上。
(5)多步滾動(dòng)預(yù)測(cè)-vol. 3
import pandas as pd
import numpy as np
from catboost import CatBoostRegressor # 導(dǎo)入CatBoostRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error# 數(shù)據(jù)讀取和預(yù)處理
data = pd.read_csv('data.csv')
data_y = pd.read_csv('data.csv')
data['time'] = pd.to_datetime(data['time'], format='%b-%y')
data_y['time'] = pd.to_datetime(data_y['time'], format='%b-%y')n = 6for i in range(n, 0, -1):data[f'lag_{i}'] = data['incidence'].shift(n - i + 1)data = data.dropna().reset_index(drop=True)
train_data = data[(data['time'] >= '2004-01-01') & (data['time'] <= '2011-12-31')]
X_train = train_data[[f'lag_{i}' for i in range(1, n+1)]]
m = 3X_train_list = []
y_train_list = []for i in range(m):X_temp = X_trainy_temp = data_y['incidence'].iloc[n + i:len(data_y) - m + 1 + i]X_train_list.append(X_temp)y_train_list.append(y_temp)for i in range(m):X_train_list[i] = X_train_list[i].iloc[:-(m-1)]y_train_list[i] = y_train_list[i].iloc[:len(X_train_list[i])]# 模型訓(xùn)練
param_grid = {'iterations': [50, 100, 150],'learning_rate': [0.01, 0.05, 0.1, 0.5, 1],'depth': [4, 6, 8]
}best_catboost_models = []for i in range(m):grid_search = GridSearchCV(CatBoostRegressor(verbose=0), param_grid, cv=5, scoring='neg_mean_squared_error') # 使用CatBoostRegressorgrid_search.fit(X_train_list[i], y_train_list[i])best_catboost_model = CatBoostRegressor(**grid_search.best_params_, verbose=0)best_catboost_model.fit(X_train_list[i], y_train_list[i])best_catboost_models.append(best_catboost_model)validation_start_time = train_data['time'].iloc[-1] + pd.DateOffset(months=1)
validation_data = data[data['time'] >= validation_start_time]X_validation = validation_data[[f'lag_{i}' for i in range(1, n+1)]]
y_validation_pred_list = [model.predict(X_validation) for model in best_catboost_models]
y_train_pred_list = [model.predict(X_train_list[i]) for i, model in enumerate(best_catboost_models)]def concatenate_predictions(pred_list):concatenated = []for j in range(len(pred_list[0])):for i in range(m):concatenated.append(pred_list[i][j])return concatenatedy_validation_pred = np.array(concatenate_predictions(y_validation_pred_list))[:len(validation_data['incidence'])]
y_train_pred = np.array(concatenate_predictions(y_train_pred_list))[:len(train_data['incidence']) - m + 1]mae_validation = mean_absolute_error(validation_data['incidence'], y_validation_pred)
mape_validation = np.mean(np.abs((validation_data['incidence'] - y_validation_pred) / validation_data['incidence']))
mse_validation = mean_squared_error(validation_data['incidence'], y_validation_pred)
rmse_validation = np.sqrt(mse_validation)
print("驗(yàn)證集:", mae_validation, mape_validation, mse_validation, rmse_validation)mae_train = mean_absolute_error(train_data['incidence'][:-(m-1)], y_train_pred)
mape_train = np.mean(np.abs((train_data['incidence'][:-(m-1)] - y_train_pred) / train_data['incidence'][:-(m-1)]))
mse_train = mean_squared_error(train_data['incidence'][:-(m-1)], y_train_pred)
rmse_train = np.sqrt(mse_train)
print("訓(xùn)練集:", mae_train, mape_train, mse_train, rmse_train)
結(jié)果:
三、數(shù)據(jù)
鏈接:https://pan.baidu.com/s/1EFaWfHoG14h15KCEhn1STg?pwd=q41n
提取碼:q41n