dede裝修網(wǎng)站模板申請網(wǎng)站域名要多少錢
文章目錄
- 一. 數(shù)據(jù)集介紹
- Iris plants dataset
- 二. 代碼
- 三. k值的選擇
一. 數(shù)據(jù)集介紹
鳶尾花數(shù)據(jù)集
鳶尾花Iris Dataset數(shù)據(jù)集是機器學習領(lǐng)域經(jīng)典數(shù)據(jù)集,鳶尾花數(shù)據(jù)集包含了150條鳶尾花信息,每50條取自三個鳶尾花中之一:Versicolour、Setosa和Virginica
每個花的特征用如下屬性描述:
from sklearn.datasets import load_iris
# 1. 準備數(shù)據(jù)集
iris = load_iris()
iris.data
iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
print(iris.DESCR)
Iris plants dataset
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:- sepal length in cm- sepal width in cm- petal length in cm- petal width in cm- class:- Iris-Setosa- Iris-Versicolour- Iris-Virginica:Summary Statistics:============== ==== ==== ======= ===== ====================Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
二. 代碼
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifierif __name__ == '__main__':# 1. 加載數(shù)據(jù)集 iris = load_iris() #通過iris.data 獲取數(shù)據(jù)集中的特征值 iris.target獲取目標值# 2. 數(shù)據(jù)標準化transformer = StandardScaler()x_ = transformer.fit_transform(iris.data) # iris.data 數(shù)據(jù)的特征值# 3. 模型訓練estimator = KNeighborsClassifier(n_neighbors=3) # n_neighbors 鄰居的數(shù)量,也就是Knn中的K值estimator.fit(x_, iris.target) # 調(diào)用fit方法 傳入特征和目標進行模型訓練# 4. 利用模型預測result = estimator.predict(x_) print(result)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
三. k值的選擇
KNN算法的關(guān)鍵是,是K值的選擇,下圖中K=3,屬于紅色三角形,K=5屬于藍色的正方形。這個時候就是K選擇困難的時候。
KNN 算法中K值過大、過小都不好, 一般會取一個較小的值
采用交叉驗證法(把訓練數(shù)據(jù)再分成:訓練集和驗證集)來選擇最優(yōu)的K值。
#加載數(shù)據(jù)集
x,y = load_iris(return_X_y=True)
#數(shù)據(jù)標準化
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
#劃分數(shù)據(jù)集
x_train,x_test,y_train,y_test = train_test_split(x_scaled,y,test_size=0.2,random_state=0)
#創(chuàng)建網(wǎng)絡(luò)搜索對象
knn = KNeighborsClassifier()
param_grid = {'n_neighbors':[1, 3, 5, 7]}
estimator = GridSearchCV(knn, param_grid, cv=5)
#訓練模型
estimator.fit(x_train,y_train)
#輸出最優(yōu)參數(shù)
#打印最優(yōu)參數(shù)(驗證集)
print('最優(yōu)參數(shù)組合:', estimator.best_params_, '最好得分:', estimator.best_score_)#測試集評估模型(測試集)
print('測試集準確率:', estimator.score(x_test, y_test))
最優(yōu)參數(shù)組合: {'n_neighbors': 7} 最好得分: 0.9416666666666667
測試集準確率: 1.0