當前位置：首頁 > news >正文

php畢業(yè)設(shè)計二手網(wǎng)站怎么做寧波網(wǎng)站建設(shè)推廣平臺

news 2025/7/7 22:20:29

php畢業(yè)設(shè)計二手網(wǎng)站怎么做,寧波網(wǎng)站建設(shè)推廣平臺,織夢制作網(wǎng)站地圖,網(wǎng)站關(guān)鍵詞庫文章目錄第5關(guān)：基尼系數(shù)代碼第6關(guān)：預剪枝與后剪枝代碼第7關(guān)：鳶尾花識別代碼第5關(guān)：基尼系數(shù) 基尼系數(shù) 在ID3算法中我們使用了信息增益來選擇特征，信息增益大的優(yōu)先選擇。在C4.5算法中，采用了信息增益率…

文章目錄

第5關(guān)：基尼系數(shù)
- 代碼
第6關(guān)：預剪枝與后剪枝
- 代碼
第7關(guān)：鳶尾花識別
- 代碼

第5關(guān)：基尼系數(shù)

基尼系數(shù)
在ID3算法中我們使用了信息增益來選擇特征，信息增益大的優(yōu)先選擇。在C4.5算法中，采用了信息增益率來選擇特征，以減少信息增益容易選擇特征值多的特征的問題。但是無論是ID3還是C4.5,都是基于信息論的熵模型的，這里面會涉及大量的對數(shù)運算。能不能簡化模型同時也不至于完全丟失熵模型的優(yōu)點呢？當然有！那就是基尼系數(shù)！

CART算法使用基尼系數(shù)來代替信息增益率，基尼系數(shù)代表了模型的不純度，基尼系數(shù)越小，則不純度越低，特征越好。這和信息增益與信息增益率是相反的(它們都是越大越好)。

在這里插入圖片描述
從公式可以看出，相比于信息增益和信息增益率，計算起來更加簡單。舉個例子，還是使用第二關(guān)中提到過的數(shù)據(jù)集，第一列是編號，第二列是性別，第三列是活躍度，第四列是客戶是否流失的標簽（0表示未流失，1表示流失）。

在這里插入圖片描述

代碼

import numpy as npdef calcGini(feature, label, index):'''計算基尼系數(shù):param feature:測試用例中字典里的feature，類型為ndarray:param label:測試用例中字典里的label，類型為ndarray:param index:測試用例中字典里的index，即feature部分特征列的索引。該索引指的是feature中第幾個特征，如index:0表示使用第一個特征來計算信息增益。:return:基尼系數(shù)，類型float'''# 計算子集的基尼指數(shù)def calcGiniIndex(label_subset):total = len(label_subset)if total == 0:return 0label_counts = np.bincount(label_subset)probabilities = label_counts / totalgini = 1.0 - np.sum(np.square(probabilities))return gini# 將feature和label轉(zhuǎn)為numpy數(shù)組f = np.array(feature)l = np.array(label)# 得到指定特征列的值的集合unique_values = np.unique(f[:, index])total_gini = 0total_samples = len(label)# 按照特征的每個唯一值劃分數(shù)據(jù)集for value in unique_values:# 獲取該特征值對應(yīng)的樣本索引subset_indices = np.where(f[:, index] == value)[0]# 獲取對應(yīng)的子集標簽subset_label = l[subset_indices]# 計算子集的基尼指數(shù)subset_gini = calcGiniIndex(subset_label)# 加權(quán)計算總的基尼系數(shù)weighted_gini = (len(subset_label) / total_samples) * subset_ginitotal_gini += weighted_ginireturn total_gini

第6關(guān)：預剪枝與后剪枝

為什么需要剪枝
決策樹的生成是遞歸地去構(gòu)建決策樹，直到不能繼續(xù)下去為止。這樣產(chǎn)生的樹往往對訓練數(shù)據(jù)有很高的分類準確率，但對未知的測試數(shù)據(jù)進行預測就沒有那么準確了，也就是所謂的過擬合。

決策樹容易過擬合的原因是在構(gòu)建決策樹的過程時會過多地考慮如何提高對訓練集中的數(shù)據(jù)的分類準確率，從而會構(gòu)建出非常復雜的決策樹（樹的寬度和深度都比較大）。在之前的實訓中已經(jīng)提到過，模型的復雜度越高，模型就越容易出現(xiàn)過擬合的現(xiàn)象。所以簡化決策樹的復雜度能夠有效地緩解過擬合現(xiàn)象，而簡化決策樹最常用的方法就是剪枝。剪枝分為預剪枝與后剪枝。

預剪枝
預剪枝的核心思想是在決策樹生成過程中，對每個結(jié)點在劃分前先進行一個評估，若當前結(jié)點的劃分不能帶來決策樹泛化性能提升，則停止劃分并將當前結(jié)點標記為葉結(jié)點。

想要評估決策樹算法的泛化性能如何，方法很簡單?？梢詫⒂柧殧?shù)據(jù)集中隨機取出一部分作為驗證數(shù)據(jù)集，然后在用訓練數(shù)據(jù)集對每個結(jié)點進行劃分之前用當前狀態(tài)的決策樹計算出在驗證數(shù)據(jù)集上的正確率。正確率越高說明決策樹的泛化性能越好，如果在劃分結(jié)點的時候發(fā)現(xiàn)泛化性能有所下降或者沒有提升時，說明應(yīng)該停止劃分，并用投票計數(shù)的方式將當前結(jié)點標記成葉子結(jié)點。

舉個例子，假如上一關(guān)中所提到的用來決定是否買西瓜的決策樹模型已經(jīng)出現(xiàn)過擬合的情況，模型如下：
在這里插入圖片描述
假設(shè)當模型在劃分是否便宜這個結(jié)點前，模型在驗證數(shù)據(jù)集上的正確率為0.81。但在劃分后，模型在驗證數(shù)據(jù)集上的正確率降為0.67。此時就不應(yīng)該劃分是否便宜這個結(jié)點。所以預剪枝后的模型如下：

在這里插入圖片描述
從上圖可以看出，預剪枝能夠降低決策樹的復雜度。這種預剪枝處理屬于貪心思想，但是貪心有一定的缺陷，就是可能當前劃分會降低泛化性能，但在其基礎(chǔ)上進行的后續(xù)劃分卻有可能導致性能顯著提高。所以有可能會導致決策樹出現(xiàn)欠擬合的情況。

后剪枝
后剪枝是先從訓練集生成一棵完整的決策樹，然后自底向上地對非葉結(jié)點進行考察，若將該結(jié)點對應(yīng)的子樹替換為葉結(jié)點能夠帶來決策樹泛化性能提升，則將該子樹替換為葉結(jié)點。

后剪枝的思路很直接，對于決策樹中的每一個非葉子結(jié)點的子樹，我們嘗試著把它替換成一個葉子結(jié)點，該葉子結(jié)點的類別我們用子樹所覆蓋訓練樣本中存在最多的那個類來代替，這樣就產(chǎn)生了一個簡化決策樹，然后比較這兩個決策樹在測試數(shù)據(jù)集中的表現(xiàn)，如果簡化決策樹在驗證數(shù)據(jù)集中的準確率有所提高，那么該子樹就可以替換成葉子結(jié)點。該算法以bottom-up的方式遍歷所有的子樹，直至沒有任何子樹可以替換使得測試數(shù)據(jù)集的表現(xiàn)得以改進時，算法就可以終止。

從后剪枝的流程可以看出，后剪枝是從全局的角度來看待要不要剪枝，所以造成欠擬合現(xiàn)象的可能性比較小。但由于后剪枝需要先生成完整的決策樹，然后再剪枝，所以后剪枝的訓練時間開銷更高。

代碼

import numpy as np
from copy import deepcopyclass DecisionTree(object):def __init__(self):# 決策樹模型self.tree = {}def calcInfoGain(self, feature, label, index):# 計算信息增益的代碼def calcInfoEntropy(feature, label):# 計算信息熵label_set = set(label)result = 0for l in label_set:count = 0for j in range(len(label)):if label[j] == l:count += 1p = count / len(label)result -= p * np.log2(p)return resultdef calcHDA(feature, label, index, value):# 計算條件熵count = 0sub_feature = []sub_label = []for i in range(len(feature)):if feature[i][index] == value:count += 1sub_feature.append(feature[i])sub_label.append(label[i])pHA = count / len(feature)e = calcInfoEntropy(sub_feature, sub_label)return pHA * ebase_e = calcInfoEntropy(feature, label)f = np.array(feature)f_set = set(f[:, index])sum_HDA = 0for value in f_set:sum_HDA += calcHDA(feature, label, index, value)return base_e - sum_HDAdef getBestFeature(self, feature, label):max_infogain = 0best_feature = 0for i in range(len(feature[0])):infogain = self.calcInfoGain(feature, label, i)if infogain > max_infogain:max_infogain = infogainbest_feature = ireturn best_featuredef calc_acc_val(self, the_tree, val_feature, val_label):# 計算驗證集準確率result = []def classify(tree, feature):if not isinstance(tree, dict):return treet_index, t_value = list(tree.items())[0]f_value = feature[t_index]if isinstance(t_value, dict):classLabel = classify(tree[t_index][f_value], feature)return classLabelelse:return t_valuefor f in val_feature:result.append(classify(the_tree, f))result = np.array(result)return np.mean(result == val_label)def createTree(self, train_feature, train_label):# 創(chuàng)建決策樹if len(set(train_label)) == 1:return train_label[0]if len(train_feature[0]) == 1 or len(np.unique(train_feature, axis=0)) == 1:vote = {}for l in train_label:if l in vote.keys():vote[l] += 1else:vote[l] = 1max_count = 0vote_label = Nonefor k, v in vote.items():if v > max_count:max_count = vvote_label = kreturn vote_labelbest_feature = self.getBestFeature(train_feature, train_label)tree = {best_feature: {}}f = np.array(train_feature)f_set = set(f[:, best_feature])for v in f_set:sub_feature = []sub_label = []for i in range(len(train_feature)):if train_feature[i][best_feature] == v:sub_feature.append(train_feature[i])sub_label.append(train_label[i])tree[best_feature][v] = self.createTree(sub_feature, sub_label)return treedef post_cut(self, val_feature, val_label):# 剪枝相關(guān)代碼def get_non_leaf_node_count(tree):non_leaf_node_path = []def dfs(tree, path, all_path):for k in tree.keys():if isinstance(tree[k], dict):path.append(k)dfs(tree[k], path, all_path)if len(path) > 0:path.pop()else:all_path.append(path[:])dfs(tree, [], non_leaf_node_path)unique_non_leaf_node = []for path in non_leaf_node_path:isFind = Falsefor p in unique_non_leaf_node:if path == p:isFind = Truebreakif not isFind:unique_non_leaf_node.append(path)return len(unique_non_leaf_node)def get_the_most_deep_path(tree):non_leaf_node_path = []def dfs(tree, path, all_path):for k in tree.keys():if isinstance(tree[k], dict):path.append(k)dfs(tree[k], path, all_path)if len(path) > 0:path.pop()else:all_path.append(path[:])dfs(tree, [], non_leaf_node_path)max_depth = 0result = Nonefor path in non_leaf_node_path:if len(path) > max_depth:max_depth = len(path)result = pathreturn resultdef set_vote_label(tree, path, label):for i in range(len(path)-1):tree = tree[path[i]]tree[path[len(path)-1]] = labelacc_before_cut = self.calc_acc_val(self.tree, val_feature, val_label)for _ in range(get_non_leaf_node_count(self.tree)):path = get_the_most_deep_path(self.tree)tree = deepcopy(self.tree)step = deepcopy(tree)for k in path:step = step[k]vote_label = sorted(step.items(), key=lambda item: item[1], reverse=True)[0][0]set_vote_label(tree, path, vote_label)acc_after_cut = self.calc_acc_val(tree, val_feature, val_label)if acc_after_cut > acc_before_cut:set_vote_label(self.tree, path, vote_label)acc_before_cut = acc_after_cutdef fit(self, train_feature, train_label, val_feature, val_label):# 訓練決策樹模型self.tree = self.createTree(train_feature, train_label)self.post_cut(val_feature, val_label)def predict(self, feature):# 預測函數(shù)result = []def classify(tree, feature):if not isinstance(tree, dict):return treet_index, t_value = list(tree.items())[0]f_value = feature[t_index]if isinstance(t_value, dict):classLabel = classify(tree[t_index][f_value], feature)return classLabelelse:return t_valuefor f in feature:result.append(classify(self.tree, f))return np.array(result)

第7關(guān)：鳶尾花識別

掌握如何使用sklearn提供的DecisionTreeClassifier

在這里插入圖片描述
數(shù)據(jù)簡介：
鳶尾花數(shù)據(jù)集是一類多重變量分析的數(shù)據(jù)集。通過花萼長度，花萼寬度，花瓣長度，花瓣寬度4個屬性預測鳶尾花卉屬于(Setosa，Versicolour，Virginica)三個種類中的哪一類(其中分別用0，1，2代替)。

數(shù)據(jù)集中部分數(shù)據(jù)與標簽如下圖所示：
在這里插入圖片描述

DecisionTreeClassifier
DecisionTreeClassifier的構(gòu)造函數(shù)中有兩個常用的參數(shù)可以設(shè)置：

criterion:劃分節(jié)點時用到的指標。有g(shù)ini（基尼系數(shù)）,entropy(信息增益)。若不設(shè)置，默認為gini
max_depth:決策樹的最大深度，如果發(fā)現(xiàn)模型已經(jīng)出現(xiàn)過擬合，可以嘗試將該參數(shù)調(diào)小。若不設(shè)置，默認為None

和sklearn中其他分類器一樣，DecisionTreeClassifier類中的fit函數(shù)用于訓練模型，fit函數(shù)有兩個向量輸入：

X：大小為[樣本數(shù)量,特征數(shù)量]的ndarray，存放訓練樣本；
Y：值為整型，大小為[樣本數(shù)量]的ndarray，存放訓練樣本的分類標簽。

DecisionTreeClassifier類中的predict函數(shù)用于預測，返回預測標簽，predict函數(shù)有一個向量輸入：

X：大小為[樣本數(shù)量,特征數(shù)量]的ndarray，存放預測樣本。

DecisionTreeClassifier的使用代碼如下：

from sklearn.tree import DecisionTreeClassifier
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, Y_train)
result = clf.predict(X_test)

代碼

#********* Begin *********#
import pandas as pd
from sklearn.tree import DecisionTreeClassifiertrain_df = pd.read_csv('./step7/train_data.csv').as_matrix()
train_label = pd.read_csv('./step7/train_label.csv').as_matrix()
test_df = pd.read_csv('./step7/test_data.csv').as_matrix()dt = DecisionTreeClassifier()
dt.fit(train_df, train_label)
result = dt.predict(test_df)result = pd.DataFrame({'target':result})
result.to_csv('./step7/predict.csv', index=False)
#********* End *********#

查看全文

http://www.risenshineclean.com/news/22672.html

中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

php畢業(yè)設(shè)計二手網(wǎng)站怎么做寧波網(wǎng)站建設(shè)推廣平臺

文章目錄

第5關(guān)：基尼系數(shù)

代碼

第6關(guān)：預剪枝與后剪枝

代碼

第7關(guān)：鳶尾花識別

代碼

相關(guān)文章：