當前位置：首頁 > news >正文

做網(wǎng)站的流程分析-圖靈吧哪個行業(yè)最需要推廣

news 2025/7/12 19:31:49

做網(wǎng)站的流程分析-圖靈吧,哪個行業(yè)最需要推廣,重慶網(wǎng)站建設哪里比較好呢,wordpress基礎安裝文章目錄自然語言處理Gensim入門：建模與模型保存關于gensim基礎知識1. 模塊導入2. 內(nèi)部變量定義3. 主函數(shù)入口 (if __name__ __main__:)4. 加載語料庫映射5. 加載和預處理語料庫6. 根據(jù)方法參數(shù)選擇模型訓練方式7. 保存模型和變換后的語料8.代碼自然語言處理Gens…

文章目錄

自然語言處理Gensim入門：建模與模型保存
- 關于gensim基礎知識
- 1. 模塊導入
- 2. 內(nèi)部變量定義
- 3. 主函數(shù)入口 (`if __name__ == '__main__':`)
- 4. 加載語料庫映射
- 5. 加載和預處理語料庫
- 6. 根據(jù)方法參數(shù)選擇模型訓練方式
- 7. 保存模型和變換后的語料
- 8.代碼

自然語言處理Gensim入門：建模與模型保存

關于gensim基礎知識

Gensim是一個專門針對大規(guī)模文本數(shù)據(jù)進行主題建模和相似性檢索的Python庫。
MmCorpus是gensim用于高效讀寫大型稀疏矩陣的一種格式，適用于大數(shù)據(jù)集。
TF-IDF是一種常見的文本表示方法，通過對詞頻進行加權以突出重要性較高的詞語。
LSI、LDA和RP都是降維或主題提取方法，常用于信息檢索、文本分類和聚類任務。

這段代碼是使用gensim庫生成主題模型的一個腳本，它根據(jù)用戶提供的語言和方法參數(shù)來訓練文本數(shù)據(jù)集，并將訓練好的模型保存為文件。以下是核心代碼邏輯的分析與解釋：

1. 模塊導入

導入了logging模塊用于記錄程序運行日志。
導入sys模塊以獲取命令行參數(shù)和程序名。
導入os.path模塊處理文件路徑相關操作。
從gensim.corpora導入dmlcorpus（一個用于加載特定格式語料庫的模塊）和MmCorpus（存儲稀疏矩陣表示的文檔-詞項矩陣的類）。
從gensim.models導入四個模型：lsimodel、ldamodel、tfidfmodel、rpmodel，分別對應潛在語義索引（LSI）、潛在狄利克雷分配（LDA）、TF-IDF轉換模型以及隨機投影（RP）。

2. 內(nèi)部變量定義

DIM_RP, DIM_LSI, DIM_LDA 分別指定了RP、LSI和LDA模型的維度大小。

3. 主函數(shù)入口 (`if name == 'main':`)

配置日志輸出格式并設置日志級別為INFO。
檢查輸入?yún)?shù)數(shù)量是否滿足要求（至少包含語言和方法兩個參數(shù)），否則打印幫助信息并退出程序。
獲取指定的語言和方法參數(shù)。

4. 加載語料庫映射

根據(jù)傳入的語言參數(shù)創(chuàng)建DmlConfig對象，該對象包含了語料庫的相關配置信息，如存放結果的目錄等。
加載詞匯表字典，即wordids.txt文件，將其轉換成id2word字典結構，以便在后續(xù)模型構建中將詞語ID映射回實際詞語。

5. 加載和預處理語料庫

使用MmCorpus加載二進制bow.mm文件，該文件存儲了文檔-詞項矩陣，每個文檔是一個稀疏向量表示。

6. 根據(jù)方法參數(shù)選擇模型訓練方式

如果方法為’tfidf’，則訓練并保存TF-IDF模型，該模型對原始詞頻進行加權，增加了逆文檔頻率因子。
若方法為’lda’，則訓練LDA模型，這是一個基于概率統(tǒng)計的主題模型，通過文檔-主題分布和主題-詞語分布抽取主題結構。
若方法為’lsi’，首先用TF-IDF模型轉換語料，然后在此基礎上訓練LSI模型，它是一種線性代數(shù)方法，用于發(fā)現(xiàn)文本中的潛在主題空間。
若方法為’rp’，同樣先轉為TF-IDF表示，然后訓練RP模型，利用隨機投影技術降低數(shù)據(jù)維數(shù)。
對于未知的方法，拋出ValueError異常。

7. 保存模型和變換后的語料

訓練完相應模型后，將其保存到指定的文件中（例如model_lda.pkl或model_lsi.pkl）。
將原始語料經(jīng)過所訓練模型變換后得到的新語料（即主題表示形式）保存為一個新的MM格式文件，文件名反映所使用的主題模型方法。

8.代碼

#!/usr/bin/env python
#
# Copyright (C) 2010 Radim Rehurek <radimrehurek@seznam.cz>
# Licensed under the GNU LGPL v2.1 - https://www.gnu.org/licenses/old-licenses/lgpl-2.1.en.html"""
USAGE: %(program)s LANGUAGE METHODGenerate topic models for the specified subcorpus. METHOD is currently one \
of 'tfidf', 'lsi', 'lda', 'rp'.Example: ./gensim_genmodel.py any lsi
"""import logging
import sys
import os.pathfrom gensim.corpora import dmlcorpus, MmCorpus
from gensim.models import lsimodel, ldamodel, tfidfmodel, rpmodelimport gensim_build# internal method parameters
DIM_RP = 300  # dimensionality for random projections
DIM_LSI = 200  # for lantent semantic indexing
DIM_LDA = 100  # for latent dirichlet allocationif __name__ == '__main__':logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')logging.root.setLevel(level=logging.INFO)logging.info("running %s", ' '.join(sys.argv))program = os.path.basename(sys.argv[0])# check and process input argumentsif len(sys.argv) < 3:print(globals()['__doc__'] % locals())sys.exit(1)language = sys.argv[1]method = sys.argv[2].strip().lower()logging.info("loading corpus mappings")config = dmlcorpus.DmlConfig('%s_%s' % (gensim_build.PREFIX, language),resultDir=gensim_build.RESULT_DIR, acceptLangs=[language])logging.info("loading word id mapping from %s", config.resultFile('wordids.txt'))id2word = dmlcorpus.DmlCorpus.loadDictionary(config.resultFile('wordids.txt'))logging.info("loaded %i word ids", len(id2word))corpus = MmCorpus(config.resultFile('bow.mm'))if method == 'tfidf':model = tfidfmodel.TfidfModel(corpus, id2word=id2word, normalize=True)model.save(config.resultFile('model_tfidf.pkl'))elif method == 'lda':model = ldamodel.LdaModel(corpus, id2word=id2word, num_topics=DIM_LDA)model.save(config.resultFile('model_lda.pkl'))elif method == 'lsi':# first, transform word counts to tf-idf weightstfidf = tfidfmodel.TfidfModel(corpus, id2word=id2word, normalize=True)# then find the transformation from tf-idf to latent spacemodel = lsimodel.LsiModel(tfidf[corpus], id2word=id2word, num_topics=DIM_LSI)model.save(config.resultFile('model_lsi.pkl'))elif method == 'rp':# first, transform word counts to tf-idf weightstfidf = tfidfmodel.TfidfModel(corpus, id2word=id2word, normalize=True)# then find the transformation from tf-idf to latent spacemodel = rpmodel.RpModel(tfidf[corpus], id2word=id2word, num_topics=DIM_RP)model.save(config.resultFile('model_rp.pkl'))else:raise ValueError('unknown topic extraction method: %s' % repr(method))MmCorpus.saveCorpus(config.resultFile('%s.mm' % method), model[corpus])logging.info("finished running %s", program)