當前位置：首頁 > news >正文

中山商城型網(wǎng)站建設企業(yè)模板建站

news 2025/7/5 13:42:36

中山商城型網(wǎng)站建設,企業(yè)模板建站,網(wǎng)站建設風險,網(wǎng)站設計知識準備yelp數(shù)據(jù)集是研究B2C業(yè)態(tài)的一個很好的數(shù)據(jù)集，要識別潛在的熱門商家是一個多維度的分析過程，涉及用戶行為、商家特征和社區(qū)結構等多個因素。從yelp數(shù)據(jù)集里我們可以挖掘到下面信息有助于識別熱門商家用戶評分和評論分析評分均值: 商家的平均評分是反映其…

yelp數(shù)據(jù)集是研究B2C業(yè)態(tài)的一個很好的數(shù)據(jù)集，要識別潛在的熱門商家是一個多維度的分析過程，涉及用戶行為、商家特征和社區(qū)結構等多個因素。從yelp數(shù)據(jù)集里我們可以挖掘到下面信息有助于識別熱門商家

用戶評分和評論分析

評分均值: 商家的平均評分是反映其受歡迎程度的重要指標。較高的平均評分通常意味著顧客滿意度高，從而可能成為熱門商家。
評論數(shù)量: 評論數(shù)量可以反映商家的活躍度和用戶的參與程度。評論數(shù)量多的商家更可能受到廣泛關注。

用戶活躍度

用戶評分行為: 分析活躍用戶（頻繁評分的用戶）對商家的評分，可以識別出哪些商家在用戶群體中更受歡迎。
用戶影響力: 一些用戶的評分會對其他用戶的選擇產(chǎn)生較大影響（例如，社交媒體影響者）。識別這些高影響力用戶對商家的評分可以幫助識別潛在熱門商家。

社交網(wǎng)絡分析

用戶與商家的關系網(wǎng)絡: 使用圖神經(jīng)網(wǎng)絡等算法分析用戶與商家之間的關系。商家與許多用戶有互動，且用戶在網(wǎng)絡中有較高影響力的商家，可能會被視為熱門商家。
社區(qū)發(fā)現(xiàn): 通過分析用戶和商家之間的關系網(wǎng)絡，識別出相似用戶群體，進而識別出在這些群體中受歡迎的商家。

多維度評價

綜合評價: 結合多個指標（如評分、評論數(shù)、用戶活躍度、地理位置等），使用加權方法或多指標決策模型來綜合評估商家的受歡迎程度。

使用的文件

yelp_academic_dataset_business.json:
- 包含商家的基本信息，如商家 ID、名稱、類別、位置等。
yelp_academic_dataset_review.json:
- 包含用戶對商家的評論及評分，可以用來分析商家的受歡迎程度和用戶的行為。
yelp_academic_dataset_user.json:
- 包含用戶的基本信息，比如用戶 ID、注冊時間、評價數(shù)量等，可以用來分析用戶的活躍度和影響力。

通過圖神經(jīng)網(wǎng)絡（GNN）來識別商家的影響力：

先加載必要的庫并讀取數(shù)據(jù)文件：

import pandas as pd
import json# 讀取數(shù)據(jù)
with open('yelp_academic_dataset_business.json', 'r') as f:businesses = pd.DataFrame([json.loads(line) for line in f])with open('yelp_academic_dataset_review.json', 'r') as f:reviews = pd.DataFrame([json.loads(line) for line in f])with open('yelp_academic_dataset_user.json', 'r') as f:users = pd.DataFrame([json.loads(line) for line in f])

清洗數(shù)據(jù)以提取有用的信息：

# 過濾出需要的商家和用戶數(shù)據(jù)
businesses = businesses[['business_id', 'name', 'categories', 'city', 'state', 'review_count', 'stars']]
reviews = reviews[['user_id', 'business_id', 'stars']]
users = users[['user_id', 'review_count', 'average_stars']]# 處理類別數(shù)據(jù)
businesses['categories'] = businesses['categories'].str.split(', ').apply(lambda x: x[0] if x else None)

構建商家和用戶之間的圖，節(jié)點為商家和用戶，邊為用戶對商家的評分。

    edges = []for _, row in reviews.iterrows():if row['user_id'] in node_mapping and row['business_id'] in node_mapping:edges.append([node_mapping[row['user_id']], node_mapping[row['business_id']]])edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()return node_mapping, edge_index, total_nodes

我們可以通過以下方式計算商家的影響力：

用戶評分的平均值: 表示商家的受歡迎程度。
評論數(shù): 提供商家影響力的直觀指標。

business_reviews = reviews.groupby('business_id').agg({'stars': ['mean', 'count']
}).reset_index()
business_reviews.columns = ['business_id', 'average_rating', 'review_count']# 合并商家信息和評論信息
merged_data = businesses.merge(business_reviews, on='business_id', how='left')# 3. 目標變量定義
# 定義熱門商家的標準
merged_data['is_popular'] = ((merged_data['average_rating'] > 4.0) &(merged_data['review_count'] > 10)).astype(int)

使用 GNN 進一步分析商家的影響力，可以構建 GNN 模型并訓練。以下是 GNN 模型的基本示例，使用 PyTorch Geometric：

class GNNModel(torch.nn.Module):def __init__(self, num_node_features):super(GNNModel, self).__init__()self.conv1 = GCNConv(num_node_features, 64)self.conv2 = GCNConv(64, 32)self.conv3 = GCNConv(32, 16)self.fc = torch.nn.Linear(16, 1)self.dropout = torch.nn.Dropout(0.3)def forward(self, x, edge_index):x = F.relu(self.conv1(x, edge_index))x = self.dropout(x)x = F.relu(self.conv2(x, edge_index))x = self.dropout(x)x = F.relu(self.conv3(x, edge_index))x = self.fc(x)return x

使用模型的輸出嵌入來分析商家之間的相似度，識別潛在的熱門商家。

print("Making predictions...")model.eval()with torch.no_grad():predictions = torch.sigmoid(model(data.x.to(device), data.edge_index.to(device))).cpu()# 將預測結果添加到數(shù)據(jù)框merged_data['predicted_popularity'] = 0.0for _, row in merged_data.iterrows():if row['business_id'] in node_mapping:idx = node_mapping[row['business_id']]merged_data.loc[row.name, 'predicted_popularity'] = predictions[idx].item()# 輸出潛在熱門商家potential_hot = merged_data[(merged_data['predicted_popularity'] > 0.5) &(merged_data['is_popular'] == 0)].sort_values('predicted_popularity', ascending=False)print("\nPotential Hot Businesses:")print(potential_hot[['name', 'average_rating', 'review_count', 'predicted_popularity']].head())

使用上面定義流程跑一下訓練, 報錯了

Traceback (most recent call last):
? File "/opt/miniconda3/envs/lora/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
? ? return self._engine.get_loc(casted_key)
? File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
? File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
? File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
? File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'review_count'
?

把print('merged_data', merged_data) 加上再試下

[150346 rows x 16 columns]
Index(['business_id', 'name', 'address', 'city', 'state', 'postal_code',
? ? ? ?'latitude', 'longitude', 'stars', 'review_count_x', 'is_open',
? ? ? ?'attributes', 'categories', 'hours', 'average_rating',
? ? ? ?'review_count_y'],
? ? ? dtype='object')?

review_count 列被重命名為 review_count_x 和 review_count_y。這通常是因為在合并過程中，兩個 DataFrame 中都存在 review_count 列。為了繼續(xù)進行需要選擇合適的列來作為評論數(shù)量的依據(jù)。選擇 review_count_x 或 review_count_y: 通常，review_count_x 是從 businesses DataFrame 中來的，而 review_count_y 是從 business_reviews DataFrame 中來的。

代碼修改下

import torch
import pandas as pd
import numpy as np
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split# 1. 數(shù)據(jù)加載
def load_data():businesses = pd.read_json('yelp_academic_dataset_business.json', lines=True)reviews = pd.read_json('yelp_academic_dataset_review.json', lines=True)users = pd.read_json('yelp_academic_dataset_user.json', lines=True)return businesses, reviews, users# 2. 數(shù)據(jù)預處理
def preprocess_data(businesses, reviews):# 聚合評論數(shù)據(jù)business_reviews = reviews.groupby('business_id').agg({'stars': ['mean', 'count'],'useful': 'sum','funny': 'sum','cool': 'sum'}).reset_index()# 修復列名business_reviews.columns = ['business_id', 'average_rating', 'review_count','total_useful', 'total_funny', 'total_cool']# 合并商家信息# 刪除businesses中的review_count列（如果存在）if 'review_count' in businesses.columns:businesses = businesses.drop('review_count', axis=1)# 合并商家信息merged_data = businesses.merge(business_reviews, on='business_id', how='left')# 填充缺失值merged_data = merged_data.fillna(0)return merged_data# 3. 特征工程
def engineer_features(merged_data):# 確保使用正確的列名創(chuàng)建特征merged_data['engagement_score'] = (merged_data['total_useful'] +merged_data['total_funny'] +merged_data['total_cool']) / (merged_data['review_count'] + 1)  # 加1避免除零# 定義熱門商家merged_data['is_popular'] = ((merged_data['average_rating'] >= 4.0) &(merged_data['review_count'] >= merged_data['review_count'].quantile(0.75))).astype(int)return merged_data# 4. 圖構建
def build_graph(merged_data, reviews):# 創(chuàng)建節(jié)點映射business_ids = merged_data['business_id'].unique()user_ids = reviews['user_id'].unique()# 修改索引映射，確保從0開始node_mapping = {user_id: i for i, user_id in enumerate(user_ids)}# 商家節(jié)點的索引接續(xù)用戶節(jié)點的索引business_start_idx = len(user_ids)node_mapping.update({business_id: i + business_start_idx for i, business_id in enumerate(business_ids)})# 獲取節(jié)點總數(shù)total_nodes = len(user_ids) + len(business_ids)# 創(chuàng)建邊edges = []for _, row in reviews.iterrows():if row['user_id'] in node_mapping and row['business_id'] in node_mapping:edges.append([node_mapping[row['user_id']], node_mapping[row['business_id']]])edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()return node_mapping, edge_index, total_nodesdef prepare_node_features(merged_data, node_mapping, num_user_nodes, total_nodes):feature_cols = ['average_rating', 'review_count', 'engagement_score']# 確保所有特征列都是數(shù)值類型for col in feature_cols:merged_data[col] = merged_data[col].astype(float)# 標準化特征scaler = StandardScaler()merged_data[feature_cols] = scaler.fit_transform(merged_data[feature_cols])# 創(chuàng)建特征矩陣，使用總節(jié)點數(shù)num_features = len(feature_cols)x = torch.zeros(total_nodes, num_features, dtype=torch.float)# 用戶節(jié)點特征（使用平均值）mean_values = merged_data[feature_cols].mean().values.astype(np.float32)x[:num_user_nodes] = torch.tensor(mean_values, dtype=torch.float)# 商家節(jié)點特征for _, row in merged_data.iterrows():if row['business_id'] in node_mapping:idx = node_mapping[row['business_id']]feature_values = row[feature_cols].values.astype(np.float32)if not np.isfinite(feature_values).all():print(f"警告: 發(fā)現(xiàn)無效值 {feature_values}")feature_values = np.nan_to_num(feature_values, 0)x[idx] = torch.tensor(feature_values, dtype=torch.float)return xdef main():print("Starting the program...")# 設置設備device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')print(f"Using device: {device}")# 加載數(shù)據(jù)print("Loading data...")businesses, reviews, users = load_data()# 預處理數(shù)據(jù)print("Preprocessing data...")merged_data = preprocess_data(businesses, reviews)merged_data = engineer_features(merged_data)# 構建圖print("Building graph...")node_mapping, edge_index, total_nodes = build_graph(merged_data, reviews)num_user_nodes = len(reviews['user_id'].unique())# 打印節(jié)點信息print(f"Total nodes: {total_nodes}")print(f"User nodes: {num_user_nodes}")print(f"Business nodes: {total_nodes - num_user_nodes}")print(f"Max node index in mapping: {max(node_mapping.values())}")# 準備特征print("Preparing node features...")x = prepare_node_features(merged_data, node_mapping, num_user_nodes, total_nodes)# 準備標簽print("Preparing labels...")labels = torch.zeros(total_nodes)business_mask = torch.zeros(total_nodes, dtype=torch.bool)for _, row in merged_data.iterrows():if row['business_id'] in node_mapping:idx = node_mapping[row['business_id']]labels[idx] = row['is_popular']business_mask[idx] = True# 創(chuàng)建圖數(shù)據(jù)對象data = Data(x=x, edge_index=edge_index)# 初始化模型print("Initializing model...")model = GNNModel(num_node_features=x.size(1)).to(device)# 訓練模型print("Training model...")train_model(model, data, labels, business_mask, device)# 預測print("Making predictions...")model.eval()with torch.no_grad():predictions = torch.sigmoid(model(data.x.to(device), data.edge_index.to(device))).cpu()# 將預測結果添加到數(shù)據(jù)框merged_data['predicted_popularity'] = 0.0for _, row in merged_data.iterrows():if row['business_id'] in node_mapping:idx = node_mapping[row['business_id']]merged_data.loc[row.name, 'predicted_popularity'] = predictions[idx].item()# 輸出潛在熱門商家potential_hot = merged_data[(merged_data['predicted_popularity'] > 0.5) &(merged_data['is_popular'] == 0)].sort_values('predicted_popularity', ascending=False)print("\nPotential Hot Businesses:")print(potential_hot[['name', 'average_rating', 'review_count', 'predicted_popularity']].head())# 6. GNN模型定義
class GNNModel(torch.nn.Module):def __init__(self, num_node_features):super(GNNModel, self).__init__()self.conv1 = GCNConv(num_node_features, 64)self.conv2 = GCNConv(64, 32)self.conv3 = GCNConv(32, 16)self.fc = torch.nn.Linear(16, 1)self.dropout = torch.nn.Dropout(0.3)def forward(self, x, edge_index):x = F.relu(self.conv1(x, edge_index))x = self.dropout(x)x = F.relu(self.conv2(x, edge_index))x = self.dropout(x)x = F.relu(self.conv3(x, edge_index))x = self.fc(x)return x# 7. 訓練函數(shù)
def train_model(model, data, labels, business_mask, device, epochs=100):optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)criterion = torch.nn.BCEWithLogitsLoss()model.train()for epoch in range(epochs):optimizer.zero_grad()out = model(data.x.to(device), data.edge_index.to(device))loss = criterion(out[business_mask], labels[business_mask].unsqueeze(1).to(device))loss.backward()optimizer.step()print(f'Epoch [{epoch + 1}/{epochs}], Loss: {loss.item():.4f}')if __name__ == "__main__":main()

開始正式訓練，先按照epoch=100做迭代訓練測試，loss向收斂方向滑動

識別出熱門店鋪

Potential Hot Businesses:
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?name ?average_rating ?review_count ?predicted_popularity
100024 ? ? ? ? ? ? ?Mother's Restaurant ? ? ? -0.154731 ? ? 41.821089 ? ? ? ? ? ? ?0.999941
31033 ? ? ? ? ? ? ? ? ? ? ? Royal House ? ? ? ?0.207003 ? ? 40.953749 ? ? ? ? ? ? ?0.999933
113983 ? ? ? ? ? ? Pat's King of Steaks ? ? ? -0.361171 ? ? 34.103369 ? ? ? ? ? ? ?0.999805
64541 ? Felix's Restaurant & Oyster Bar ? ? ? ?0.389155 ? ? 32.023360 ? ? ? ? ? ? ?0.999725
42331 ? ? ? ? ? ? ? ? ? ? ? ?Gumbo Shop ? ? ? ?0.340872 ? ? 31.517411 ? ? ? ? ? ? ?0.999701

查看全文

http://www.risenshineclean.com/news/7407.html