中文亚洲精品无码_熟女乱子伦免费_人人超碰人人爱国产_亚洲熟妇女综合网

當前位置: 首頁 > news >正文

做木工網(wǎng)站近期熱點新聞事件50個

做木工網(wǎng)站,近期熱點新聞事件50個,免費網(wǎng)站下載軟件免費,群暉nas做網(wǎng)站服務器前言 僅記錄學習過程,有問題歡迎討論 步驟: * 1. 輸入問題 * 2. 匹配問題庫(基礎資源,FAQ) * 3. 返回答案文本匹配算法: 編輯距離算法(缺點) 字符之間沒有語義相似度; 受無關詞/停用詞影響大; 受語序影響大 Jaccar…

前言

僅記錄學習過程,有問題歡迎討論

步驟:
* 1. 輸入問題
* 2. 匹配問題庫(基礎資源,FAQ)
* 3. 返回答案

文本匹配算法:

  • 編輯距離算法(缺點)

    • 字符之間沒有語義相似度;
      受無關詞/停用詞影響大;
      受語序影響大
  • Jaccard相似度(元素的交集/元素的并集)、

  • 詞向量(基于窗口;解決了語義相似的問題;文本轉為數(shù)字,計算cos值來判斷相似度)

  • 深度學習-表示型(問題匹配問題比較合適,因為二者都是問題,所以轉向量也方便)
    兩個話使用同一個Encoder向量 語義相似的score = 1,類似二分類

    • (Triplet Loss):
      使得相同標簽的樣本再Embedding空間盡量接近(anchor和positive接近 away negative)
    • loss = max((D(p,a)-D(p,n)+margin,0)
    • 優(yōu)點:訓練好的模型可以對知識庫內的問題計算向量,在實際查找過程中,只對輸入文本做一次向量化
    • 缺點:在向量化的過程中不知道文本重點
  • 深度學習-交互型

    • 輸入一句話,但是兩個樣本拼接,利用attention機制來判斷是否匹配(Q&A拼接去學習)
    • 優(yōu)點:通過對比把握句子重點
    • 缺點:每次計算需要都需要兩個輸入
  • 對比學習

    • 輸入一個樣本,通過函數(shù)把樣本改動,但還是相似,得到兩個相似樣本,進行bertEconder,pooling操作
  • 海量向量查找:

    • 可以用開源寫好的庫^^ Faiss Pinecore
    • 避免遍歷,避免和所有向量做距離計算(空間切割KD樹,Kmeans方式切割)

代碼

實現(xiàn)一個智能問答demo

"""
配置參數(shù)信息
"""
Config = {"model_path": "./output/","model_name": "model.pt","schema_path": r"D:\NLP\video\第八周\week8 文本匹配問題\data\schema.json","train_data_path": r"D:\NLP\video\第八周\week8 文本匹配問題\data\data.json","valid_data_path": r"D:\NLP\video\第八周\week8 文本匹配問題\data\valid.json","vocab_path": r"D:\NLP\video\第七周\data\vocab.txt","model_type": "rnn",# 正樣本比例"positive_sample_rate": 0.5,"use_bert": False,# 文本向量大小"char_dim": 32,# 文本長度"max_len": 20,# 詞向量大小"hidden_size": 128,# 訓練 輪數(shù)"epoch_size": 15,# 批量大小"batch_size": 32,# 訓練集大小"simple_size": 300,# 學習率"lr": 1e-3,# dropout"dropout": 0.5,# 優(yōu)化器"optimizer": "adam",# 卷積核"kernel_size": 3,# 最大池 or 平均池"pooling_style": "max",# 模型層數(shù)"num_layers": 2,"bert_model_path": r"D:\NLP\video\第六周\bert-base-chinese",# 輸出層大小"output_size": 2,# 隨機數(shù)種子"seed": 987
}

load.py j加載數(shù)據(jù)文件

"""
數(shù)據(jù)加載
"""
import json
from collections import defaultdict
import randomimport torch
import torch.utils.data as Data
from torch.utils.data import DataLoader
from transformers import BertTokenizer# 獲取字表集
def load_vocab(path):vocab = {}with open(path, 'r', encoding='utf-8') as f:for index, line in enumerate(f):word = line.strip()# 0留給padding位置,所以從1開始vocab[word] = index + 1vocab['unk'] = len(vocab) + 1return vocab# 數(shù)據(jù)預處理 裁剪or填充
def padding(input_ids, length):if len(input_ids) >= length:return input_ids[:length]else:padded_input_ids = input_ids + [0] * (length - len(input_ids))return padded_input_ids# 文本預處理
# 轉化為向量
def sentence_to_index(text, length, vocab):input_ids = []for char in text:input_ids.append(vocab.get(char, vocab['unk']))# 填充or裁剪input_ids = padding(input_ids, length)return input_idsclass DataGenerator:def __init__(self, data_path, config):# 加載json數(shù)據(jù)self.load_know_base(config["train_data_path"])# 加載schema 相當于答案集self.schema = self.load_schema(config["schema_path"])self.data_path = data_pathself.config = configif self.config["model_type"] == "bert":self.tokenizer = BertTokenizer.from_pretrained(config["bert_model_path"])self.vocab = load_vocab(config["vocab_path"])self.config["vocab_size"] = len(self.vocab)self.train_flag = Noneself.load_data()def __len__(self):if self.train_flag:return self.config["simple_size"]else:return len(self.data)# 這里需要返回隨機的樣本def __getitem__(self, idx):if self.train_flag:# return self.random_train_sample()  # 隨機生成一個訓練樣本# triplet loss:return self.random_train_sample_for_triplet_loss()else:return self.data[idx]# 針對獲取的文本 load_know_base = {target : [questions]} 做處理# 傳入兩個樣本 正樣本為相同target數(shù)據(jù) 負樣本為不同target數(shù)據(jù)# 訓練集和驗證集不一致def load_data(self):self.train_flag = self.config["train_flag"]dataset_x = []dataset_y = []self.knwb = defaultdict(list)if self.train_flag:for target, questions in self.target_to_questions.items():for question in questions:input_id = sentence_to_index(question, self.config["max_len"], self.vocab)input_id = torch.LongTensor(input_id)# self.schema[target] 下標 把每個question轉化為向量append放入一個target下self.knwb[self.schema[target]].append(input_id)else:with open(self.data_path, encoding="utf8") as f:for line in f:line = json.loads(line)assert isinstance(line, list)question, target = lineinput_id = sentence_to_index(question, self.config["max_len"], self.vocab)# input_id = torch.LongTensor(input_id)label_index = torch.LongTensor([self.schema[target]])# self.data.append([input_id, label_index])dataset_x.append(input_id)dataset_y.append(label_index)self.data = Data.TensorDataset(torch.tensor(dataset_x), torch.tensor(dataset_y))return# 加載知識庫def load_know_base(self, know_base_path):self.target_to_questions = {}with open(know_base_path, encoding="utf8") as f:for index, line in enumerate(f):content = json.loads(line)questions = content["questions"]target = content["target"]self.target_to_questions[target] = questionsreturn# 加載schema 相當于答案集def load_schema(self, param):with open(param, encoding="utf8") as f:return json.loads(f.read())# 訓練集隨機生成一個樣本# 依照一定概率生成負樣本或正樣本# 負樣本從隨機兩個不同的標準問題中各隨機選取一個# 正樣本從隨機一個標準問題中隨機選取兩個def random_train_sample(self):target = random.choice(list(self.knwb.keys()))# 隨機正樣本:# 隨機正樣本if random.random() <= self.config["positive_sample_rate"]:if len(self.knwb[target]) <= 1:return self.random_train_sample()else:question1 = random.choice(self.knwb[target])question2 = random.choice(self.knwb[target])# 一組# dataset_x.append([question1, question2])# # 二分類任務 同一組的question target = 1# dataset_y.append([1])return [question1, question2, torch.LongTensor([1])]else:# 隨機負樣本:p, n = random.sample(list(self.knwb.keys()), 2)question1 = random.choice(self.knwb[p])question2 = random.choice(self.knwb[n])# dataset_x.append([question1, question2])# dataset_y.append([-1])return [question1, question2, torch.LongTensor([-1])]# triplet_loss隨機生成3個樣本 錨樣本A, 正樣本P, 負樣本Ndef random_train_sample_for_triplet_loss(self):target = random.choice(list(self.knwb.keys()))# question1錨樣本 question2為同一個target下的正樣本 question3 為其他target下樣本question1 = random.choice(self.knwb[target])question2 = random.choice(self.knwb[target])question3 = random.choice(self.knwb[random.choice(list(self.knwb.keys()))])return [question1, question2, question3]# 用torch自帶的DataLoader類封裝數(shù)據(jù)
def load_data_batch(data_path, config, shuffle=True):dg = DataGenerator(data_path, config)if config["train_flag"]:dl = DataLoader(dg, batch_size=config["batch_size"], shuffle=shuffle)else:dl = DataLoader(dg.data, batch_size=config["batch_size"], shuffle=shuffle)return dlif __name__ == '__main__':from config import ConfigConfig["train_flag"] = True# dg = DataGenerator(Config["train_data_path"], Config)dataset = load_data_batch(Config["train_data_path"], Config)# print(len(dg))# print(dg[0])for index, dataset in enumerate(dataset):input_id1, input_id2, input_id3 = datasetprint(input_id1)print(input_id2)print(input_id3)

main.py 主方法

import torch
import os
import random
import os
import numpy as np
import logging
from config import Config
from model import TorchModel, choose_optimizer, SiameseNetwork
from loader import load_data_batch
from evaluate import Evaluator# [DEBUG, INFO, WARNING, ERROR, CRITICAL]logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)"""
模型訓練主程序
"""
# 通過設置隨機種子來復現(xiàn)上一次的結果(避免隨機性)
seed = Config["seed"]
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)def main(config):# 保存模型的目錄if not os.path.isdir(config["model_path"]):os.mkdir(config["model_path"])# 加載數(shù)據(jù)dataset = load_data_batch(config["train_data_path"], config)# 加載模型model = SiameseNetwork(config)# 是否使用gpuif torch.cuda.is_available():logger.info("gpu可以使用,遷移模型至gpu")model.cuda()# 選擇優(yōu)化器optim = choose_optimizer(config, model)# 加載效果測試類evaluator = Evaluator(config, model, logger)for epoch in range(config["epoch_size"]):epoch += 1logger.info("epoch %d begin" % epoch)epoch_loss = []# 訓練模型model.train()for batch_data in dataset:if torch.cuda.is_available():batch_data = [d.cuda() for d in batch_data]# x, y = dataiter# 反向傳播optim.zero_grad()s1, s2, s3 = batch_data  # 輸入變化時這里需要修改,比如多輸入,多輸出的情況# 計算梯度loss = model(s1, s2, s3)# 梯度更新loss.backward()# 優(yōu)化器更新模型optim.step()# 記錄損失epoch_loss.append(loss.item())logger.info("epoch average loss: %f" % np.mean(epoch_loss))# 測試模型效果acc = evaluator.eval(epoch)# 可以用model_type model_path epoch 三個參數(shù)來保存模型model_path = os.path.join(config["model_path"], "epoch_%d_%s.pth" % (epoch, config["model_type"]))torch.save(model.state_dict(), model_path)  # 保存模型權重returnif __name__ == "__main__":from config import ConfigConfig["train_flag"] = Truemain(Config)# for model in ["cnn"]:#     Config["model_type"] = model#     print("最后一輪準確率:", main(Config), "當前配置:", Config["model_type"])# 對比所有模型# 中間日志可以關掉,避免輸出過多信息# 超參數(shù)的網(wǎng)格搜索# for model in ["gated_cnn"]:#     Config["model_type"] = model#     for lr in [1e-3, 1e-4]:#         Config["learning_rate"] = lr#         for hidden_size in [128]:#             Config["hidden_size"] = hidden_size#             for batch_size in [64, 128]:#                 Config["batch_size"] = batch_size#                 for pooling_style in ["avg"]:#                     Config["pooling_style"] = pooling_style# 可以把輸出放入文件中 便于查看#                     print("最后一輪準確率:", main(Config), "當前配置:", Config)

evaluate.py 評估模型文件

"""
模型效果測試
"""
import torch
from loader import load_data_batchclass Evaluator:def __init__(self, config, model, logger):self.config = configself.model = modelself.logger = logger# 選擇驗證集合config['train_flag'] = Falseself.valid_data = load_data_batch(config["valid_data_path"], config, shuffle=False)config['train_flag'] = Trueself.train_data = load_data_batch(config["train_data_path"], config)self.stats_dict = {"correct": 0, "wrong": 0}  # 用于存儲測試結果def eval(self, epoch):self.logger.info("開始測試第%d輪模型效果:" % epoch)self.stats_dict = {"correct": 0, "wrong": 0}  # 清空前一輪的測試結果self.model.eval()self.knwb_to_vector()for index, batch_data in enumerate(self.valid_data):if torch.cuda.is_available():batch_data = [d.cuda() for d in batch_data]input_id, labels = batch_data  # 輸入變化時這里需要修改,比如多輸入,多輸出的情況with torch.no_grad():test_question_vectors = self.model(input_id)  # 不輸入labels,使用模型當前參數(shù)進行預測self.write_stats(test_question_vectors, labels)self.show_stats()returndef write_stats(self, test_question_vectors, labels):assert len(labels) == len(test_question_vectors)for test_question_vector, label in zip(test_question_vectors, labels):# 通過一次矩陣乘法,計算輸入問題和知識庫中所有問題的相似度# test_question_vector shape [vec_size]   knwb_vectors shape = [n, vec_size]res = torch.mm(test_question_vector.unsqueeze(0), self.knwb_vectors.T)hit_index = int(torch.argmax(res.squeeze()))  # 命中問題標號hit_index = self.question_index_to_standard_question_index[hit_index]  # 轉化成標準問編號if int(hit_index) == int(label):self.stats_dict["correct"] += 1else:self.stats_dict["wrong"] += 1return# 將知識庫中的問題向量化,為匹配做準備# 每輪訓練的模型參數(shù)不一樣,生成的向量也不一樣,所以需要每輪測試都重新進行向量化def knwb_to_vector(self):self.question_index_to_standard_question_index = {}self.question_ids = []for standard_question_index, question_ids in self.train_data.dataset.knwb.items():for question_id in question_ids:# 記錄問題編號到標準問題標號的映射,用來確認答案是否正確self.question_index_to_standard_question_index[len(self.question_ids)] = standard_question_indexself.question_ids.append(question_id)with torch.no_grad():question_matrixs = torch.stack(self.question_ids, dim=0)if torch.cuda.is_available():question_matrixs = question_matrixs.cuda()self.knwb_vectors = self.model(question_matrixs)# 將所有向量都作歸一化 v / |v|self.knwb_vectors = torch.nn.functional.normalize(self.knwb_vectors, dim=-1)returndef show_stats(self):correct = self.stats_dict["correct"]wrong = self.stats_dict["wrong"]self.logger.info("預測集合條目總量:%d" % (correct + wrong))self.logger.info("預測正確條目:%d,預測錯誤條目:%d" % (correct, wrong))self.logger.info("預測準確率:%f" % (correct / (correct + wrong)))self.logger.info("--------------------")return correct / (correct + wrong)

model.py

import torch
import torch.nn as nn
from torch.optim import Adam, SGD
from transformers import BertModel"""
建立網(wǎng)絡模型結構
"""class TorchModel(nn.Module):def __init__(self, config):super(TorchModel, self).__init__()hidden_size = config["hidden_size"]vocab_size = config["vocab_size"] + 1output_size = config["output_size"]model_type = config["model_type"]num_layers = config["num_layers"]self.use_bert = config["use_bert"]self.emb = nn.Embedding(vocab_size + 1, hidden_size, padding_idx=0)if model_type == 'rnn':self.encoder = nn.RNN(input_size=hidden_size, hidden_size=hidden_size, num_layers=num_layers,batch_first=True)elif model_type == 'lstm':# 雙向lstm,輸出的是 hidden_size * 2(num_layers 要寫2)self.encoder = nn.LSTM(hidden_size, hidden_size, num_layers=num_layers)elif self.use_bert:self.encoder = BertModel.from_pretrained(config["bert_model_path"])# 需要使用預訓練模型的hidden_sizehidden_size = self.encoder.config.hidden_sizeelif model_type == 'cnn':self.encoder = CNN(config)elif model_type == "gated_cnn":self.encoder = GatedCNN(config)elif model_type == "bert_lstm":self.encoder = BertLSTM(config)# 需要使用預訓練模型的hidden_sizehidden_size = self.encoder.config.hidden_sizeself.classify = nn.Linear(hidden_size, output_size)self.pooling_style = config["pooling_style"]self.loss = nn.functional.cross_entropy  # loss采用交叉熵損失def forward(self, x, y=None):if self.use_bert:# 輸入x為[batch_size, seq_len]# bert返回的結果是 (sequence_output, pooler_output)# sequence_output:batch_size, max_len, hidden_size# pooler_output:batch_size, hidden_sizex = self.encoder(x)[0]else:x = self.emb(x)x = self.encoder(x)# 判斷x是否是tupleif isinstance(x, tuple):x = x[0]# 池化層if self.pooling_style == "max":# shape[1]代表列數(shù),shape是行和列數(shù)構成的元組self.pooling_style = nn.MaxPool1d(x.shape[1])elif self.pooling_style == "avg":self.pooling_style = nn.AvgPool1d(x.shape[1])x = self.pooling_style(x.transpose(1, 2)).squeeze()y_pred = self.classify(x)if y is not None:return self.loss(y_pred, y.squeeze())else:return y_pred# 定義孿生網(wǎng)絡  (計算兩個句子之間的相似度)
class SiameseNetwork(nn.Module):def __init__(self, config):super(SiameseNetwork, self).__init__()self.sentence_encoder = TorchModel(config)# 使用的是cos計算# self.loss = nn.CosineEmbeddingLoss()# 使用triplet_lossself.triplet_loss = self.cosine_triplet_loss# 計算余弦距離  1-cos(a,b)# cos=1時兩個向量相同,余弦距離為0;cos=0時,兩個向量正交,余弦距離為1def cosine_distance(self, tensor1, tensor2):tensor1 = torch.nn.functional.normalize(tensor1, dim=-1)tensor2 = torch.nn.functional.normalize(tensor2, dim=-1)cosine = torch.sum(torch.mul(tensor1, tensor2), axis=-1)return 1 - cosine# 3個樣本  2個為一類 另一個一類 計算triplet lossdef cosine_triplet_loss(self, a, p, n, margin=None):ap = self.cosine_distance(a, p)an = self.cosine_distance(a, n)if margin is None:diff = ap - an + 0.1else:diff = ap - an + margin.squeeze()return torch.mean(diff[diff.gt(0)])  # greater than# 使用triplet_lossdef forward(self, sentence1, sentence2=None, sentence3=None, margin=None):vector1 = self.sentence_encoder(sentence1)# 同時傳入3 個樣本if sentence2 is None:if sentence3 is None:return vector1# 計算余弦距離else:vector3 = self.sentence_encoder(sentence3)return self.cosine_distance(vector1, vector3)else:vector2 = self.sentence_encoder(sentence2)if sentence3 is None:return self.cosine_distance(vector1, vector2)else:vector3 = self.sentence_encoder(sentence3)return self.triplet_loss(vector1, vector2, vector3, margin)# CosineEmbeddingLoss# def forward(self,sentence1, sentence2=None, target=None):#     # 同時傳入兩個句子#     if sentence2 is not None:#         vector1 = self.sentence_encoder(sentence1)  # vec:(batch_size, hidden_size)#         vector2 = self.sentence_encoder(sentence2)#         # 如果有標簽,則計算loss#         if target is not None:#             return self.loss(vector1, vector2, target.squeeze())#         # 如果無標簽,計算余弦距離#         else:#             return self.cosine_distance(vector1, vector2)#     # 單獨傳入一個句子時,認為正在使用向量化能力#     else:#         return self.sentence_encoder(sentence1)# 優(yōu)化器的選擇
def choose_optimizer(config, model):optimizer = config["optimizer"]learning_rate = config["lr"]if optimizer == "adam":return Adam(model.parameters(), lr=learning_rate)elif optimizer == "sgd":return SGD(model.parameters(), lr=learning_rate)# 定義CNN模型
class CNN(nn.Module):def __init__(self, config):super(CNN, self).__init__()hidden_size = config["hidden_size"]kernel_size = config["kernel_size"]pad = int((kernel_size - 1) / 2)self.cnn = nn.Conv1d(hidden_size, hidden_size, kernel_size, bias=False, padding=pad)def forward(self, x):  # x : (batch_size, max_len, embeding_size)return self.cnn(x.transpose(1, 2)).transpose(1, 2)# 定義GatedCNN模型
class GatedCNN(nn.Module):def __init__(self, config):super(GatedCNN, self).__init__()self.cnn = CNN(config)self.gate = CNN(config)# 定義前向傳播函數(shù) 比普通cnn多了一次sigmoid 然后互相卷積def forward(self, x):a = self.cnn(x)b = self.gate(x)b = torch.sigmoid(b)return torch.mul(a, b)# 定義BERT-LSTM模型
class BertLSTM(nn.Module):def __init__(self, config):super(BertLSTM, self).__init__()self.bert = BertModel.from_pretrained(config["bert_model_path"], return_dict=False)self.rnn = nn.LSTM(self.bert.config.hidden_size, self.bert.config.hidden_size, batch_first=True)def forward(self, x):x = self.bert(x)[0]x, _ = self.rnn(x)return xif __name__ == "__main__":from config import ConfigConfig["vocab_size"] = 10Config["max_length"] = 4model = SiameseNetwork(Config)s1 = torch.LongTensor([[1, 2, 3, 0], [2, 2, 0, 0]])s2 = torch.LongTensor([[1, 2, 3, 4], [3, 2, 3, 4]])l = torch.LongTensor([[1], [0]])y = model(s1, s2, l)print(y)
http://www.risenshineclean.com/news/6629.html

相關文章:

  • 網(wǎng)站 頭尾調用學校網(wǎng)站建設
  • 各類網(wǎng)站建設信息流廣告模板
  • 如何做1個手機網(wǎng)站廣東網(wǎng)絡推廣運營
  • 建設pc端網(wǎng)站是什么意思網(wǎng)絡策劃書范文
  • 高端網(wǎng)站建設 杭州百度快速排名技術培訓
  • 臨沂市建設局網(wǎng)站品牌廣告語經(jīng)典100條
  • 微信創(chuàng)建小程序怎么弄seo工資待遇 seo工資多少
  • 怎么用wordpress做企業(yè)網(wǎng)站有什么平臺可以發(fā)布推廣信息
  • 宜昌模板網(wǎng)站建設北京百度seo工作室
  • 電費由誰承擔seo排名優(yōu)化培訓價格
  • 電子商務網(wǎng)站功能設計seo做的比較牛的公司
  • 網(wǎng)站備案要營業(yè)執(zhí)照原件嗎上海網(wǎng)絡營銷上海網(wǎng)絡推廣
  • 外地公司做的網(wǎng)站能備案預防電信網(wǎng)絡詐騙
  • 哈爾濱站建筑濟南seo官網(wǎng)優(yōu)化
  • 企業(yè)網(wǎng)站建設聯(lián)系網(wǎng)站流量查詢工具
  • 網(wǎng)站流量外流站長推薦黃色
  • 深圳做網(wǎng)站的公seo快速軟件
  • 易企互聯(lián)網(wǎng)站建設南寧seo結算
  • wordpress模板下載云落新站點seo聯(lián)系方式
  • 商丘網(wǎng)站建設案例關鍵詞優(yōu)化是什么意思
  • 南海建設工程交易中心網(wǎng)站店鋪推廣平臺有哪些
  • 成人大專報考條件深圳網(wǎng)站優(yōu)化培訓
  • 怎么做本地婚姻介紹網(wǎng)站運營推廣渠道有哪些
  • 平頂山哪里有做網(wǎng)站的公司培訓課程安排
  • 為什么做網(wǎng)站越早越好軍事新聞最新消息今天
  • 純流量卡免費申請入口seo優(yōu)化幾個關鍵詞
  • 甘肅省建設廳官方網(wǎng)站張睿長沙互聯(lián)網(wǎng)網(wǎng)站建設
  • 建筑公司企業(yè)宣傳冊溫州企業(yè)網(wǎng)站排名優(yōu)化
  • 北京 高端網(wǎng)站定制seo最新優(yōu)化技術
  • 怎么建網(wǎng)站鏈接效果最好的推廣軟件