請(qǐng)問(wèn)網(wǎng)上有沒(méi)有比較好的網(wǎng)站可以做照片書的呀?要求質(zhì)量比較好的!品牌推廣方案ppt
目錄
介紹
數(shù)據(jù)集
設(shè)置
準(zhǔn)備數(shù)據(jù)
將電影評(píng)分?jǐn)?shù)據(jù)轉(zhuǎn)換為序列
定義元數(shù)據(jù)
創(chuàng)建用于訓(xùn)練和評(píng)估的 tf.data.Dataset
創(chuàng)建模型輸入
輸入特征編碼
創(chuàng)建 BST 模型
開(kāi)展培訓(xùn)和評(píng)估實(shí)驗(yàn)
政安晨的個(gè)人主頁(yè):政安晨
歡迎?👍點(diǎn)贊?評(píng)論?收藏
希望政安晨的博客能夠?qū)δ兴砸?#xff0c;如有不足之處,歡迎在評(píng)論區(qū)提出指正!
本文目標(biāo):在 Movielens 上使用行為序列轉(zhuǎn)換器(BST)模型預(yù)測(cè)評(píng)級(jí)率。
介紹
本示例使用 Movielens 數(shù)據(jù)集演示了陳啟偉等人的行為序列轉(zhuǎn)換器(BST)模型。 BST 模型利用用戶觀看電影和給電影評(píng)分的順序行為,以及用戶資料和電影特征,來(lái)預(yù)測(cè)用戶對(duì)目標(biāo)電影的評(píng)分。
更確切地說(shuō),BST 模型旨在通過(guò)接受以下輸入來(lái)預(yù)測(cè)目標(biāo)電影的評(píng)分:
- 用戶觀看過(guò)的電影的固定長(zhǎng)度序列。
- 用戶觀看過(guò)的電影評(píng)分的固定長(zhǎng)度序列。
- 輸入序列中每部電影和目標(biāo)電影的類型集。
- 輸入序列中每部電影和目標(biāo)電影的類型集。
- 要預(yù)測(cè)評(píng)分的 target_movie_id。
該示例以下列方式修改了原始 BST 模型:
1. 我們?cè)谔幚磔斎胄蛄兄械拿坎侩娪昂湍繕?biāo)電影的嵌入過(guò)程中都加入了電影特征(流派),而不是將其視為轉(zhuǎn)換層之外的 "其他特征"。
2. 我們利用輸入序列中電影的評(píng)分以及它們?cè)谛蛄兄械奈恢脕?lái)更新它們,然后再將它們輸入自我關(guān)注層。
(請(qǐng)注意,本示例應(yīng)在 TensorFlow 2.4 或更高版本中運(yùn)行。)
數(shù)據(jù)集
我們使用的是 Movielens 數(shù)據(jù)集的 1M 版本。 該數(shù)據(jù)集包含 6000 名用戶對(duì) 4000 部電影的約 100 萬(wàn)個(gè)評(píng)分,以及一些用戶特征和電影類型。 此外,數(shù)據(jù)集還提供了每個(gè)用戶對(duì)電影評(píng)分的時(shí)間戳,這樣就可以按照 BST 模型的預(yù)期,為每個(gè)用戶創(chuàng)建電影評(píng)分序列。
設(shè)置
import osos.environ["KERAS_BACKEND"] = "tensorflow"import math
from zipfile import ZipFile
from urllib.request import urlretrieveimport keras
import numpy as np
import pandas as pd
import tensorflow as tf
from keras import layers
from keras.layers import StringLookup
準(zhǔn)備數(shù)據(jù)
下載并準(zhǔn)備數(shù)據(jù)框
首先,讓我們下載 movielens 數(shù)據(jù)。
下載的文件夾將包含三個(gè)數(shù)據(jù)文件:users.dat、movies.dat 和 ratings.dat。
urlretrieve("http://files.grouplens.org/datasets/movielens/ml-1m.zip", "movielens.zip")
ZipFile("movielens.zip", "r").extractall()
然后,我們用正確的列名將數(shù)據(jù)加載到 pandas DataFrames 中。
users = pd.read_csv("ml-1m/users.dat",sep="::",names=["user_id", "sex", "age_group", "occupation", "zip_code"],encoding="ISO-8859-1",engine="python",
)ratings = pd.read_csv("ml-1m/ratings.dat",sep="::",names=["user_id", "movie_id", "rating", "unix_timestamp"],encoding="ISO-8859-1",engine="python",
)movies = pd.read_csv("ml-1m/movies.dat",sep="::",names=["movie_id", "title", "genres"],encoding="ISO-8859-1",engine="python",
)
在此,我們進(jìn)行一些簡(jiǎn)單的數(shù)據(jù)處理,以固定列的數(shù)據(jù)類型。
users["user_id"] = users["user_id"].apply(lambda x: f"user_{x}")
users["age_group"] = users["age_group"].apply(lambda x: f"group_{x}")
users["occupation"] = users["occupation"].apply(lambda x: f"occupation_{x}")movies["movie_id"] = movies["movie_id"].apply(lambda x: f"movie_{x}")ratings["movie_id"] = ratings["movie_id"].apply(lambda x: f"movie_{x}")
ratings["user_id"] = ratings["user_id"].apply(lambda x: f"user_{x}")
ratings["rating"] = ratings["rating"].apply(lambda x: float(x))
每部電影都有多種類型。 我們將它們分成電影 DataFrame 中的不同列。
genres = ["Action", "Adventure", "Animation", "Children's", "Comedy", "Crime"] genres += ["Documentary", "Drama", "Fantasy", "Film-Noir", "Horror", "Musical"] genres += ["Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"]for genre in genres:movies[genre] = movies["genres"].apply(lambda values: int(genre in values.split("|")))
將電影評(píng)分?jǐn)?shù)據(jù)轉(zhuǎn)換為序列
首先,我們使用 unix_timestamp 對(duì)評(píng)分?jǐn)?shù)據(jù)進(jìn)行排序,然后按用戶 ID 對(duì)電影 ID 值和評(píng)分值進(jìn)行分組。
ratings_group = ratings.sort_values(by=["unix_timestamp"]).groupby("user_id")ratings_data = pd.DataFrame(data={"user_id": list(ratings_group.groups.keys()),"movie_ids": list(ratings_group.movie_id.apply(list)),"ratings": list(ratings_group.rating.apply(list)),"timestamps": list(ratings_group.unix_timestamp.apply(list)),}
)
現(xiàn)在,讓我們把 movie_ids 列表拆分成一組固定長(zhǎng)度的序列。 我們對(duì)評(píng)分也做同樣的處理。 設(shè)置 sequence_length 變量可改變模型輸入序列的長(zhǎng)度。 您還可以改變 step_size 來(lái)控制為每個(gè)用戶生成的序列數(shù)量。
sequence_length = 4
step_size = 2def create_sequences(values, window_size, step_size):sequences = []start_index = 0while True:end_index = start_index + window_sizeseq = values[start_index:end_index]if len(seq) < window_size:seq = values[-window_size:]if len(seq) == window_size:sequences.append(seq)breaksequences.append(seq)start_index += step_sizereturn sequencesratings_data.movie_ids = ratings_data.movie_ids.apply(lambda ids: create_sequences(ids, sequence_length, step_size)
)ratings_data.ratings = ratings_data.ratings.apply(lambda ids: create_sequences(ids, sequence_length, step_size)
)del ratings_data["timestamps"]
然后,我們對(duì)輸出進(jìn)行處理,使每個(gè)序列在 DataFrame 中都有單獨(dú)的記錄。 此外,我們還將用戶特征與評(píng)分?jǐn)?shù)據(jù)結(jié)合起來(lái)。
ratings_data_movies = ratings_data[["user_id", "movie_ids"]].explode("movie_ids", ignore_index=True
)
ratings_data_rating = ratings_data[["ratings"]].explode("ratings", ignore_index=True)
ratings_data_transformed = pd.concat([ratings_data_movies, ratings_data_rating], axis=1)
ratings_data_transformed = ratings_data_transformed.join(users.set_index("user_id"), on="user_id"
)
ratings_data_transformed.movie_ids = ratings_data_transformed.movie_ids.apply(lambda x: ",".join(x)
)
ratings_data_transformed.ratings = ratings_data_transformed.ratings.apply(lambda x: ",".join([str(v) for v in x])
)del ratings_data_transformed["zip_code"]ratings_data_transformed.rename(columns={"movie_ids": "sequence_movie_ids", "ratings": "sequence_ratings"},inplace=True,
)
在 sequence_length 為 4、step_size 為 2 的情況下,我們最終得到了 498 623 個(gè)序列。 最后,我們將數(shù)據(jù)分成訓(xùn)練和測(cè)試兩個(gè)部分,分別包含 85% 和 15% 的實(shí)例,并將它們存儲(chǔ)到 CSV 文件中。
random_selection = np.random.rand(len(ratings_data_transformed.index)) <= 0.85
train_data = ratings_data_transformed[random_selection]
test_data = ratings_data_transformed[~random_selection]train_data.to_csv("train_data.csv", index=False, sep="|", header=False)
test_data.to_csv("test_data.csv", index=False, sep="|", header=False)
定義元數(shù)據(jù)
CSV_HEADER = list(ratings_data_transformed.columns)CATEGORICAL_FEATURES_WITH_VOCABULARY = {"user_id": list(users.user_id.unique()),"movie_id": list(movies.movie_id.unique()),"sex": list(users.sex.unique()),"age_group": list(users.age_group.unique()),"occupation": list(users.occupation.unique()),
}USER_FEATURES = ["sex", "age_group", "occupation"]MOVIE_FEATURES = ["genres"]
創(chuàng)建用于訓(xùn)練和評(píng)估的 tf.data.Dataset
def get_dataset_from_csv(csv_file_path, shuffle=False, batch_size=128):def process(features):movie_ids_string = features["sequence_movie_ids"]sequence_movie_ids = tf.strings.split(movie_ids_string, ",").to_tensor()# The last movie id in the sequence is the target movie.features["target_movie_id"] = sequence_movie_ids[:, -1]features["sequence_movie_ids"] = sequence_movie_ids[:, :-1]ratings_string = features["sequence_ratings"]sequence_ratings = tf.strings.to_number(tf.strings.split(ratings_string, ","), tf.dtypes.float32).to_tensor()# The last rating in the sequence is the target for the model to predict.target = sequence_ratings[:, -1]features["sequence_ratings"] = sequence_ratings[:, :-1]return features, targetdataset = tf.data.experimental.make_csv_dataset(csv_file_path,batch_size=batch_size,column_names=CSV_HEADER,num_epochs=1,header=False,field_delim="|",shuffle=shuffle,).map(process)return dataset
創(chuàng)建模型輸入
def create_model_inputs():return {"user_id": keras.Input(name="user_id", shape=(1,), dtype="string"),"sequence_movie_ids": keras.Input(name="sequence_movie_ids", shape=(sequence_length - 1,), dtype="string"),"target_movie_id": keras.Input(name="target_movie_id", shape=(1,), dtype="string"),"sequence_ratings": keras.Input(name="sequence_ratings", shape=(sequence_length - 1,), dtype=tf.float32),"sex": keras.Input(name="sex", shape=(1,), dtype="string"),"age_group": keras.Input(name="age_group", shape=(1,), dtype="string"),"occupation": keras.Input(name="occupation", shape=(1,), dtype="string"),}
輸入特征編碼
輸入特征編碼方法的工作原理如下:
每個(gè)分類用戶特征都使用層嵌入(layer.Embedding)編碼,嵌入維度等于特征詞匯量的平方根。
電影序列中的每部電影和目標(biāo)電影都使用層.嵌入編碼,嵌入維度等于電影數(shù)量的平方根。
每部電影的多熱點(diǎn)流派向量與其嵌入向量連接,并使用非線性層.密集處理,以輸出具有相同電影嵌入維度的向量。
將位置嵌入添加到序列中的每個(gè)電影嵌入中,然后乘以評(píng)分序列中的評(píng)分。將目標(biāo)電影嵌入與序列電影嵌入連接起來(lái),產(chǎn)生一個(gè)張量,其形狀為[批量大小、序列長(zhǎng)度、嵌入大小],正如轉(zhuǎn)換器架構(gòu)的注意層所預(yù)期的那樣。
該方法返回一個(gè)由兩個(gè)元素組成的元組:編碼轉(zhuǎn)換器特征和編碼其他特征。
def encode_input_features(inputs,include_user_id=True,include_user_features=True,include_movie_features=True,
):encoded_transformer_features = []encoded_other_features = []other_feature_names = []if include_user_id:other_feature_names.append("user_id")if include_user_features:other_feature_names.extend(USER_FEATURES)## Encode user featuresfor feature_name in other_feature_names:# Convert the string input values into integer indices.vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY[feature_name]idx = StringLookup(vocabulary=vocabulary, mask_token=None, num_oov_indices=0)(inputs[feature_name])# Compute embedding dimensionsembedding_dims = int(math.sqrt(len(vocabulary)))# Create an embedding layer with the specified dimensions.embedding_encoder = layers.Embedding(input_dim=len(vocabulary),output_dim=embedding_dims,name=f"{feature_name}_embedding",)# Convert the index values to embedding representations.encoded_other_features.append(embedding_encoder(idx))## Create a single embedding vector for the user featuresif len(encoded_other_features) > 1:encoded_other_features = layers.concatenate(encoded_other_features)elif len(encoded_other_features) == 1:encoded_other_features = encoded_other_features[0]else:encoded_other_features = None## Create a movie embedding encodermovie_vocabulary = CATEGORICAL_FEATURES_WITH_VOCABULARY["movie_id"]movie_embedding_dims = int(math.sqrt(len(movie_vocabulary)))# Create a lookup to convert string values to integer indices.movie_index_lookup = StringLookup(vocabulary=movie_vocabulary,mask_token=None,num_oov_indices=0,name="movie_index_lookup",)# Create an embedding layer with the specified dimensions.movie_embedding_encoder = layers.Embedding(input_dim=len(movie_vocabulary),output_dim=movie_embedding_dims,name=f"movie_embedding",)# Create a vector lookup for movie genres.genre_vectors = movies[genres].to_numpy()movie_genres_lookup = layers.Embedding(input_dim=genre_vectors.shape[0],output_dim=genre_vectors.shape[1],embeddings_initializer=keras.initializers.Constant(genre_vectors),trainable=False,name="genres_vector",)# Create a processing layer for genres.movie_embedding_processor = layers.Dense(units=movie_embedding_dims,activation="relu",name="process_movie_embedding_with_genres",)## Define a function to encode a given movie id.def encode_movie(movie_id):# Convert the string input values into integer indices.movie_idx = movie_index_lookup(movie_id)movie_embedding = movie_embedding_encoder(movie_idx)encoded_movie = movie_embeddingif include_movie_features:movie_genres_vector = movie_genres_lookup(movie_idx)encoded_movie = movie_embedding_processor(layers.concatenate([movie_embedding, movie_genres_vector]))return encoded_movie## Encoding target_movie_idtarget_movie_id = inputs["target_movie_id"]encoded_target_movie = encode_movie(target_movie_id)## Encoding sequence movie_ids.sequence_movies_ids = inputs["sequence_movie_ids"]encoded_sequence_movies = encode_movie(sequence_movies_ids)# Create positional embedding.position_embedding_encoder = layers.Embedding(input_dim=sequence_length,output_dim=movie_embedding_dims,name="position_embedding",)positions = tf.range(start=0, limit=sequence_length - 1, delta=1)encodded_positions = position_embedding_encoder(positions)# Retrieve sequence ratings to incorporate them into the encoding of the movie.sequence_ratings = inputs["sequence_ratings"]sequence_ratings = keras.ops.expand_dims(sequence_ratings, -1)# Add the positional encoding to the movie encodings and multiply them by rating.encoded_sequence_movies_with_poistion_and_rating = layers.Multiply()([(encoded_sequence_movies + encodded_positions), sequence_ratings])# Construct the transformer inputs.for i in range(sequence_length - 1):feature = encoded_sequence_movies_with_poistion_and_rating[:, i, ...]feature = keras.ops.expand_dims(feature, 1)encoded_transformer_features.append(feature)encoded_transformer_features.append(encoded_target_movie)encoded_transformer_features = layers.concatenate(encoded_transformer_features, axis=1)return encoded_transformer_features, encoded_other_features
創(chuàng)建 BST 模型
include_user_id = False
include_user_features = False
include_movie_features = Falsehidden_units = [256, 128]
dropout_rate = 0.1
num_heads = 3def create_model():inputs = create_model_inputs()transformer_features, other_features = encode_input_features(inputs, include_user_id, include_user_features, include_movie_features)# Create a multi-headed attention layer.attention_output = layers.MultiHeadAttention(num_heads=num_heads, key_dim=transformer_features.shape[2], dropout=dropout_rate)(transformer_features, transformer_features)# Transformer block.attention_output = layers.Dropout(dropout_rate)(attention_output)x1 = layers.Add()([transformer_features, attention_output])x1 = layers.LayerNormalization()(x1)x2 = layers.LeakyReLU()(x1)x2 = layers.Dense(units=x2.shape[-1])(x2)x2 = layers.Dropout(dropout_rate)(x2)transformer_features = layers.Add()([x1, x2])transformer_features = layers.LayerNormalization()(transformer_features)features = layers.Flatten()(transformer_features)# Included the other features.if other_features is not None:features = layers.concatenate([features, layers.Reshape([other_features.shape[-1]])(other_features)])# Fully-connected layers.for num_units in hidden_units:features = layers.Dense(num_units)(features)features = layers.BatchNormalization()(features)features = layers.LeakyReLU()(features)features = layers.Dropout(dropout_rate)(features)outputs = layers.Dense(units=1)(features)model = keras.Model(inputs=inputs, outputs=outputs)return modelmodel = create_model()
開(kāi)展培訓(xùn)和評(píng)估實(shí)驗(yàn)
# Compile the model.
model.compile(optimizer=keras.optimizers.Adagrad(learning_rate=0.01),loss=keras.losses.MeanSquaredError(),metrics=[keras.metrics.MeanAbsoluteError()],
)# Read the training data.
train_dataset = get_dataset_from_csv("train_data.csv", shuffle=True, batch_size=265)# Fit the model with the training data.
model.fit(train_dataset, epochs=5)# Read the test data.
test_dataset = get_dataset_from_csv("test_data.csv", batch_size=265)# Evaluate the model on the test data.
_, rmse = model.evaluate(test_dataset, verbose=0)
print(f"Test MAE: {round(rmse, 3)}")
Epoch 1/51600/1600 ━━━━━━━━━━━━━━━━━━━━ 19s 11ms/step - loss: 1.5762 - mean_absolute_error: 0.9892 Epoch 2/51600/1600 ━━━━━━━━━━━━━━━━━━━━ 17s 11ms/step - loss: 1.1263 - mean_absolute_error: 0.8502 Epoch 3/51600/1600 ━━━━━━━━━━━━━━━━━━━━ 17s 11ms/step - loss: 1.0885 - mean_absolute_error: 0.8361 Epoch 4/51600/1600 ━━━━━━━━━━━━━━━━━━━━ 17s 11ms/step - loss: 1.0943 - mean_absolute_error: 0.8388 Epoch 5/51600/1600 ━━━━━━━━━━━━━━━━━━━━ 17s 10ms/step - loss: 1.0360 - mean_absolute_error: 0.8142 Test MAE: 0.782
測(cè)試數(shù)據(jù)的平均絕對(duì)誤差 (MAE) 應(yīng)該在 0.7 左右。