當前位置：首頁 > news >正文

網(wǎng)站建設技術提升關鍵詞排名seo軟件

news 2025/7/13 11:01:45

網(wǎng)站建設技術,提升關鍵詞排名seo軟件,wordpress saml,c語言編程軟件文章目錄一、XGBoost算法二、Python代碼和Sentosa_DSML社區(qū)版算法實現(xiàn)對比(一) 數(shù)據(jù)讀入和統(tǒng)計分析(二)數(shù)據(jù)預處理(三)模型訓練與評估(四)模型可視化三、總結(jié) 一、XGBoost算法關于集成學習中的XGBoost算法原理，已經(jīng)進行了介紹與總結(jié)，相關內(nèi)容可參考【…

文章目錄

一、XGBoost算法
二、Python代碼和Sentosa_DSML社區(qū)版算法實現(xiàn)對比
- (一) 數(shù)據(jù)讀入和統(tǒng)計分析
- (二)數(shù)據(jù)預處理
- (三)模型訓練與評估
- (四)模型可視化
三、總結(jié)

一、XGBoost算法

??關于集成學習中的XGBoost算法原理，已經(jīng)進行了介紹與總結(jié)，相關內(nèi)容可參考【機器學習(一)】分類和回歸任務-XGBoost算法-Sentosa_DSML社區(qū)版一文。本文將利用糖尿病數(shù)據(jù)集，通過Python代碼和Sentosa_DSML社區(qū)版分別實現(xiàn)構建XGBoost分類預測模型。隨后對模型進行評估，包括評估指標的選擇與分析。最后得出實驗結(jié)果結(jié)論，展示模型在糖尿病分類預測中的有效性和準確性，為糖尿病的早期診斷和干預提供了技術手段和決策支持。

二、Python代碼和Sentosa_DSML社區(qū)版算法實現(xiàn)對比

(一) 數(shù)據(jù)讀入和統(tǒng)計分析

1、python代碼實現(xiàn)

import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc
from matplotlib import rcParams
from datetime import datetime
from sklearn.preprocessing import LabelEncoderfile_path = r'.\xgboost分類案例-糖尿病結(jié)果預測.csv'
output_dir = r'.\xgb分類'if not os.path.exists(file_path):raise FileNotFoundError(f"文件未找到: {file_path}")if not os.path.exists(output_dir):os.makedirs(output_dir)df = pd.read_csv(file_path)print("缺失值統(tǒng)計:")
print(df.isnull().sum())print("原始數(shù)據(jù)前5行:")
print(df.head())

??讀入完成后對數(shù)據(jù)信息進行統(tǒng)計

rcParams['font.family'] = 'sans-serif'
rcParams['font.sans-serif'] = ['SimHei']
stats_df = pd.DataFrame(columns=['列名', '數(shù)據(jù)類型', '最大值', '最小值', '平均值', '非空值數(shù)量', '空值數(shù)量','眾數(shù)', 'True數(shù)量', 'False數(shù)量', '標準差', '方差', '中位數(shù)', '峰度', '偏度','極值數(shù)量', '異常值數(shù)量'
])def detect_extremes_and_outliers(column, extreme_factor=3, outlier_factor=6):if not np.issubdtype(column.dtype, np.number):return None, Noneq1 = column.quantile(0.25)q3 = column.quantile(0.75)iqr = q3 - q1lower_extreme = q1 - extreme_factor * iqrupper_extreme = q3 + extreme_factor * iqrlower_outlier = q1 - outlier_factor * iqrupper_outlier = q3 + outlier_factor * iqrextremes = column[(column < lower_extreme) | (column > upper_extreme)]outliers = column[(column < lower_outlier) | (column > upper_outlier)]return len(extremes), len(outliers)for col in df.columns:col_data = df[col]dtype = col_data.dtypeif np.issubdtype(dtype, np.number):max_value = col_data.max()min_value = col_data.min()mean_value = col_data.mean()std_value = col_data.std()var_value = col_data.var()median_value = col_data.median()kurtosis_value = col_data.kurt()skew_value = col_data.skew()extreme_count, outlier_count = detect_extremes_and_outliers(col_data)else:max_value = min_value = mean_value = std_value = var_value = median_value = kurtosis_value = skew_value = Noneextreme_count = outlier_count = Nonenon_null_count = col_data.count()null_count = col_data.isna().sum()mode_value = col_data.mode().iloc[0] if not col_data.mode().empty else Nonetrue_count = col_data[col_data == True].count() if dtype == 'bool' else Nonefalse_count = col_data[col_data == False].count() if dtype == 'bool' else Nonenew_row = pd.DataFrame({'列名': [col],'數(shù)據(jù)類型': [dtype],'最大值': [max_value],'最小值': [min_value],'平均值': [mean_value],'非空值數(shù)量': [non_null_count],'空值數(shù)量': [null_count],'眾數(shù)': [mode_value],'True數(shù)量': [true_count],'False數(shù)量': [false_count],'標準差': [std_value],'方差': [var_value],'中位數(shù)': [median_value],'峰度': [kurtosis_value],'偏度': [skew_value],'極值數(shù)量': [extreme_count],'異常值數(shù)量': [outlier_count]})stats_df = pd.concat([stats_df, new_row], ignore_index=True)print(stats_df)
>> 列名     數(shù)據(jù)類型     最大值    最小值  ...         峰度        偏度  極值數(shù)量 異常值數(shù)量
0               gender   object     NaN    NaN  ...        NaN       NaN  None  None
1                  age  float64   80.00   0.08  ...  -1.003835 -0.051979     0     0
2         hypertension    int64    1.00   0.00  ...   8.441441  3.231296  7485  7485
3        heart_disease    int64    1.00   0.00  ...  20.409952  4.733872  3942  3942
4      smoking_history   object     NaN    NaN  ...        NaN       NaN  None  None
5                  bmi  float64   95.69  10.01  ...   3.520772  1.043836  1258    46
6          HbA1c_level  float64    9.00   3.50  ...   0.215392 -0.066854     0     0
7  blood_glucose_level    int64  300.00  80.00  ...   1.737624  0.821655     0     0
8             diabetes    int64    1.00   0.00  ...   6.858005  2.976217  8500  8500for col in df.columns:plt.figure(figsize=(10, 6))df[col].dropna().hist(bins=30)plt.title(f"{col} - 數(shù)據(jù)分布圖")plt.ylabel("頻率")timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')file_name = f"{col}_數(shù)據(jù)分布圖_{timestamp}.png"file_path = os.path.join(output_dir, file_name)plt.savefig(file_path)plt.close()grouped_data = df.groupby('smoking_history')['diabetes'].count()
plt.figure(figsize=(8, 8))
plt.pie(grouped_data, labels=grouped_data.index, autopct='%1.1f%%', startangle=90, colors=plt.cm.Paired.colors)
plt.title("餅狀圖\n維餅狀圖", fontsize=16)
plt.axis('equal')
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
file_name = f"smoking_history_diabetes_distribution_{timestamp}.png"
file_path = os.path.join(output_dir, file_name)
plt.savefig(file_path)
plt.close()

在這里插入圖片描述

2、Sentosa_DSML社區(qū)版實現(xiàn)

??首先，進行數(shù)據(jù)讀入，利用文本算子直接對數(shù)據(jù)進行讀取，選擇數(shù)據(jù)所在路徑，
在這里插入圖片描述
??接著，利用描述算子即可對數(shù)據(jù)進行統(tǒng)計分析，得到每一列數(shù)據(jù)的數(shù)據(jù)分布圖、極值、異常值等結(jié)果。連接描述算子，右側(cè)設置極值倍數(shù)為3，異常值倍數(shù)為6。
在這里插入圖片描述
??點擊執(zhí)行后即可得到數(shù)據(jù)統(tǒng)計分析的結(jié)果。

??也可以連接圖表算子，如餅狀圖，對不同吸煙歷史（smoking_history）與糖尿病（diabetes）之間的關系進行統(tǒng)計，

??得到結(jié)果如下所示：在這里插入圖片描述

(二)數(shù)據(jù)預處理

1、python代碼實現(xiàn)

df_filtered = df[df['gender'] != 'Other']
if df_filtered.empty:raise ValueError(" `gender`='Other'")
else:print(df_filtered.head())if 'Partition_Column' in df.columns:df['Partition_Column'] = df['Partition_Column'].astype('category')df = pd.get_dummies(df, columns=['gender', 'smoking_history'], drop_first=True)X = df.drop(columns=['diabetes'])
y = df['diabetes']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

2、Sentosa_DSML社區(qū)版實現(xiàn)
??在文本算子后連接過濾算子，過濾條件為gender=‘Other’，不保留過濾項，即在’gender’列中過濾掉值為 ‘Other’ 的數(shù)據(jù)。
在這里插入圖片描述
??連接樣本分區(qū)算子，劃分訓練集和測試集比例，

然后，連接類型算子，展示數(shù)據(jù)的存儲類型，測量類型和模型類型，將diabetes列的模型類型設置為Label。

(三)模型訓練與評估

1、python代碼實現(xiàn)

dtrain = xgb.DMatrix(X_train, label=y_train, enable_categorical=True)params = {'n_estimators': 300,'learning_rate': 0.3,'min_split_loss': 0,'max_depth': 30,'min_child_weight': 1,'subsample': 1,'colsample_bytree': 0.8,'lambda': 1,'alpha': 0,'objective': 'binary:logistic','eval_metric': 'logloss','missing': np.nan
}xgb_model = xgb.XGBClassifier(**params, use_label_encoder=False)
xgb_model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=True)y_train_pred = xgb_model.predict(X_train)
y_test_pred = xgb_model.predict(X_test)def evaluate_model(y_true, y_pred, dataset_name=''):accuracy = accuracy_score(y_true, y_pred)weighted_precision = precision_score(y_true, y_pred, average='weighted')weighted_recall = recall_score(y_true, y_pred, average='weighted')weighted_f1 = f1_score(y_true, y_pred, average='weighted')print(f"評估結(jié)果 - {dataset_name}")print(f"準確率 (Accuracy): {accuracy:.4f}")print(f"加權精確率 (Weighted Precision): {weighted_precision:.4f}")print(f"加權召回率 (Weighted Recall): {weighted_recall:.4f}")print(f"加權 F1 分數(shù) (Weighted F1 Score): {weighted_f1:.4f}\n")return {'accuracy': accuracy,'weighted_precision': weighted_precision,'weighted_recall': weighted_recall,'weighted_f1': weighted_f1}train_eval_results = evaluate_model(y_train, y_train_pred, dataset_name='訓練集 (Training Set)')
>評估結(jié)果 - 訓練集 (Training Set)
準確率 (Accuracy): 0.9991
加權精確率 (Weighted Precision): 0.9991
加權召回率 (Weighted Recall): 0.9991
加權 F1 分數(shù) (Weighted F1 Score): 0.9991test_eval_results = evaluate_model(y_test, y_test_pred, dataset_name='測試集 (Test Set)')>評估結(jié)果 - 測試集 (Test Set)
準確率 (Accuracy): 0.9657
加權精確率 (Weighted Precision): 0.9641
加權召回率 (Weighted Recall): 0.9657
加權 F1 分數(shù) (Weighted F1 Score): 0.9643

通過繪制 ROC曲線來評估分類模型在測試集的性能。

def save_plot(filename):timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')file_path = os.path.join(output_dir, f"{filename}_{timestamp}.png")plt.savefig(file_path)plt.close()def plot_roc_curve(model, X_test, y_test):"""繪制ROC曲線"""y_probs = model.predict_proba(X_test)[:, 1]fpr, tpr, thresholds = roc_curve(y_test, y_probs)roc_auc = auc(fpr, tpr)plt.figure(figsize=(10, 6))plt.plot(fpr, tpr, color='blue', label='ROC 曲線 (area = {:.2f})'.format(roc_auc))plt.plot([0, 1], [0, 1], color='red', linestyle='--')plt.xlabel('假陽性率 (FPR)')plt.ylabel('真正率 (TPR)')plt.title('Receiver Operating Characteristic (ROC) 曲線')plt.legend(loc='lower right')save_plot("ROC曲線")plot_roc_curve(xgb_model, X_test, y_test)

在這里插入圖片描述
2、Sentosa_DSML社區(qū)版實現(xiàn)
??預處理完成后，連接XGBoost分類算子，可再右側(cè)配置算子屬性，算子屬性中，評估指標即算法的損失函數(shù)，有對數(shù)損失和分類錯誤率兩種；學習率，樹的最大深度，最小葉子節(jié)點樣本權重和，子采樣率，最小分裂損失，每棵樹隨機采樣的列數(shù)占比，L1正則化項和L2正則化項都用來防止算法過擬合。子當子節(jié)點樣本權重和不大于所設的最小葉子節(jié)點樣本權重和時不對該節(jié)點進行進一步劃分。最小分裂損失指定了節(jié)點分裂所需的最小損失函數(shù)下降值。當樹構造方法是為hist的時候，需要配置節(jié)點方式、最大箱數(shù)、是否單精度三個屬性。
??在本案例中，分類模型中的屬性配置為，迭代次數(shù)：300，學習率：0.3，最小分裂損失：0，數(shù)的最大深度：30，最小葉子節(jié)點樣本權重和：1、子采樣率：1，樹構造算法:auto，每棵樹隨機采樣的列數(shù)占比：0.8，L2正則化項：1，L1正則化項：0，評估指標為對數(shù)損失，初始預測分數(shù)為0.5，并計算特征重要性和訓練數(shù)據(jù)的混淆矩陣。
在這里插入圖片描述
??右擊執(zhí)行即可得到XGBoost分類模型。

??在分類模型后連接評估算子和ROC—AUC評估算子，可以對模型訓練集和測試集的預測結(jié)果進行評估。

??評估模型在訓練集和測試集上的性能，主要使用準確率、加權精確率、加權召回率和加權 F1 分數(shù)。結(jié)果如下所示：
在這里插入圖片描述

??ROC-AUC算子用于評估當前數(shù)據(jù)訓練出來的分類模型的正確性，顯示分類結(jié)果的ROC曲線和AUC值，對模型的分類效果進行評估。執(zhí)行結(jié)果如下所示：

??還可以利用圖表分析中的表格算子對模型數(shù)據(jù)以表格形式輸出。
在這里插入圖片描述
??表格算子執(zhí)行結(jié)果如下所示：

(四)模型可視化

1、python代碼實現(xiàn)

def save_plot(filename):timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')file_path = os.path.join(output_dir, f"{filename}_{timestamp}.png")plt.savefig(file_path)plt.close()def plot_confusion_matrix(y_true, y_pred):confusion = confusion_matrix(y_true, y_pred)plt.figure(figsize=(8, 6))sns.heatmap(confusion, annot=True, fmt='d', cmap='Blues')plt.title("混淆矩陣")plt.xlabel("預測標簽")plt.ylabel("真實標簽")save_plot("混淆矩陣")def print_model_params(model):params = model.get_params()print("模型參數(shù):")for key, value in params.items():print(f"{key}: {value}")def plot_feature_importance(model):plt.figure(figsize=(12, 8))xgb.plot_importance(model, importance_type='weight', max_num_features=10)plt.title('特征重要性圖')plt.xlabel('特征重要性 (Weight)')plt.ylabel('特征')save_plot("特征重要性圖")print_model_params(xgb_model)
plot_feature_importance(xgb_model)

在這里插入圖片描述
2、Sentosa_DSML社區(qū)版實現(xiàn)
??右擊查看模型信息，即可展示特征重要性圖，混淆矩陣，決策樹等模型結(jié)果。

??模型信息如下所示：

??經(jīng)過連接算子和配置參數(shù)，完成了基于XGBoost算法的糖尿病分類預測全過程，從數(shù)據(jù)導入、預處理、模型訓練到預測及性能評估。通過模型評估算子，可以詳細了解模型的精確度、召回率、F1分數(shù)等關鍵評估指標，從而判斷模型在糖尿病分類任務中的表現(xiàn)。

三、總結(jié)

??相比傳統(tǒng)代碼方式，利用Sentosa_DSML社區(qū)版完成機器學習算法的流程更加高效和自動化，傳統(tǒng)方式需要手動編寫大量代碼來處理數(shù)據(jù)清洗、特征工程、模型訓練與評估，而在Sentosa_DSML社區(qū)版中，這些步驟可以通過可視化界面、預構建模塊和自動化流程來簡化，有效的降低了技術門檻，非專業(yè)開發(fā)者也能通過拖拽和配置的方式開發(fā)應用，減少了對專業(yè)開發(fā)人員的依賴。
??Sentosa_DSML社區(qū)版提供了易于配置的算子流，減少了編寫和調(diào)試代碼的時間，并提升了模型開發(fā)和部署的效率，由于應用的結(jié)構更清晰，維護和更新變得更加容易，且平臺通常會提供版本控制和更新功能，使得應用的持續(xù)改進更為便捷。

??為了非商業(yè)用途的科研學者、研究人員及開發(fā)者提供學習、交流及實踐機器學習技術，推出了一款輕量化且完全免費的Sentosa_DSML社區(qū)版。以輕量化一鍵安裝、平臺免費使用、視頻教學和社區(qū)論壇服務為主要特點，能夠與其他數(shù)據(jù)科學家和機器學習愛好者交流心得，分享經(jīng)驗和解決問題。文章最后附上官網(wǎng)鏈接，感興趣工具的可以直接下載使用

https://sentosa.znv.com/