17網(wǎng)站一起做網(wǎng)店不發(fā)貨百度指數(shù)總結(jié)
使用pyannote-audio實現(xiàn)聲紋分割聚類
# GitHub地址
https://github.com/MasonYyp/audio
1 簡單介紹
pyannote.audio是用Python編寫的用于聲紋分割聚類的開源工具包。在PyTorch機器學(xué)習(xí)基礎(chǔ)上,不僅可以借助性能優(yōu)越的預(yù)訓(xùn)練模型和管道實現(xiàn)聲紋分割聚類,還可以進一步微調(diào)模型。
它的主要功能有以下幾個:
- 聲紋嵌入:從一段聲音中提出聲紋轉(zhuǎn)換為向量(嵌入);
- 聲紋識別:從一段聲音中識別不同的人(多個人);
- 聲紋活動檢測:檢測一段聲音檢測不同時間點的活動;
- 聲紋重疊檢測:檢測一段聲音中重疊交叉的部分;
- 聲紋分割;將一段聲音進行分割;
pyannote.audio中主要有”segmentation“、”embedding“和”speaker-diarization“三個模型,”segmentation“的主要作用是分割、”embedding“主要作用是嵌入(跟wespeaker-voxceleb-resnet34-LM作用相同),”speaker-diarization“的作用是使用管道對上面兩個模型整合。
pyannote-audio的參考地址
# Huggingface地址
https://hf-mirror.com/pyannote# Github地址
https://github.com/pyannote/pyannote-audio
?? 注意: pyannote.audio不同的版本有些區(qū)別;
2 使用pyannote.audio:3.1.3
2.1 安裝pyannote.audio
pip install pyannote.audio==3.1.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
使用模型需要現(xiàn)在huggingface上下載模型,模型如下:
?? pyannote.audio的部分模型是收到保護的,即需要在huggingface登錄后,填寫部分信息,同意相關(guān)協(xié)議才能下載,否則無法下載。
# 1 嵌入模型 pyannote/wespeaker-voxceleb-resnet34-LM
https://hf-mirror.com/pyannote/wespeaker-voxceleb-resnet34-LM# 2 分割模型 pyannote/segmentation-3.0
https://hf-mirror.com/pyannote/segmentation-3.0
使用huggingface-cli下載相關(guān)模型的命令:
# 注意:需要先創(chuàng)建Python環(huán)境# 安裝huggingface-cli
pip install -U huggingface_hub# 例如下載pyannote/embedding模型
# 必須提供Hugging上的 --token hf_****
huggingface-cli download --resume-download pyannote/embedding --local-dir pyannote/embedding --local-dir-use-symlinks False --token hf_****
注意兩個類
# Inference主要用于聲紋嵌入
pyannote.audio import Inference# Annotation主要用于聲紋分割
from pyannote.core import Annotation# Annotation中的主要方法,假設(shè)實例為;diarization
# 獲取聲音中說話人的標(biāo)識
labels = diarization.labels()# 獲取聲音中全部的活動Segment(列表)
segments = list(diarization.itertracks())# 獲取聲音中指定說話人時間段(列表),”SPEAKER_00“為第一個說話人的標(biāo)識
durations = diarization.label_timeline('SPEAKER_00')
2.2 實現(xiàn)聲紋分割
注意:pyannote/speaker-diarization-3.1實現(xiàn)聲紋識別特別慢,不知道是不是我的方法不對(30分鐘的音頻,處理了20多分鐘)。?? 使用單個模型很快。pyannote/speaker-diarization(版本2)較快,推薦使用pyannote/speaker-diarization(版本2)。
注意:此處加載模型和通常加載模型的思路不同,常規(guī)加載模型直接到名稱即可,此處需要加載到具體的模型名稱。
(1)使用python方法
# 使用 pyannote-audio-3.1.1
import timefrom pyannote.audio import Model
from pyannote.audio.pipelines import SpeakerDiarization
from pyannote.audio.pipelines.utils import PipelineModel
from pyannote.core import Annotation# 語音轉(zhuǎn)向量模型
embedding: PipelineModel = Model.from_pretrained("E:/model/pyannote/pyannote-audio-3.1.1/wespeaker-voxceleb-resnet34-LM/pytorch_model.bin")
# 分割語音模型
segmentation: PipelineModel = Model.from_pretrained("E:/model/pyannote/pyannote-audio-3.1.1/segmentation-3.0/pytorch_model.bin")# 語音分離模型
speaker_diarization: SpeakerDiarization = SpeakerDiarization(segmentation=segmentation, embedding=embedding)# 初始化語音分離模型的參數(shù)
HYPER_PARAMETERS = {"clustering": {"method": "centroid","min_cluster_size": 12,"threshold": 0.7045654963945799},"segmentation":{"min_duration_off": 0.58}
}
speaker_diarization.instantiate(HYPER_PARAMETERS)start_time = time.time()# 分離語音
diarization: Annotation = speaker_diarization("E:/語音識別/數(shù)據(jù)/0-test-en.wav")# 獲取說話人列表
print(diarization.labels())
# 獲取活動segments列表
print(list(diarization.itertracks()))
print(diarization.label_timeline('SPEAKER_00'))ent_time = time.time()
print(ent_time - start_time)
(2)使用yml方法
# instantiate the pipeline
from pyannote.audio import Pipeline
from pyannote.core import Annotationspeaker_diarization = Pipeline.from_pretrained("E:/model/pyannote/speaker-diarization-3.1/config.yaml")# 分離語音
diarization: Annotation = speaker_diarization("E:/語音識別/數(shù)據(jù)/0-test-en.wav")print(type(diarization))
print(diarization.labels())
config.yaml
根據(jù)文件可以看出,聲紋分割是將embedding和segmentation進行了組合。
version: 3.1.0pipeline:name: pyannote.audio.pipelines.SpeakerDiarizationparams:clustering: AgglomerativeClustering# embedding: pyannote/wespeaker-voxceleb-resnet34-LMembedding: E:/model/pyannote/speaker-diarization-3.1/wespeaker-voxceleb-resnet34-LM/pytorch_model.binembedding_batch_size: 32embedding_exclude_overlap: true# segmentation: pyannote/segmentation-3.0segmentation: E:/model/pyannote/speaker-diarization-3.1/segmentation-3.0/pytorch_model.binsegmentation_batch_size: 32params:clustering:method: centroidmin_cluster_size: 12threshold: 0.7045654963945799segmentation:min_duration_off: 0.0
模型目錄
模型中的其他文件可以刪除,只保留”pytorch_model.bin“即可。
執(zhí)行結(jié)果
2.3 實現(xiàn)聲紋識別
比較兩段聲音的相似度。
from pyannote.audio import Model
from pyannote.audio import Inference
from scipy.spatial.distance import cdist# 導(dǎo)入模型
embedding = Model.from_pretrained("E:/model/pyannote/speaker-diarization-3.1/wespeaker-voxceleb-resnet34-LM/pytorch_model.bin")# 抽取聲紋
inference: Inference = Inference(embedding, window="whole")# 生成聲紋,1維向量
embedding1 = inference("E:/語音識別/數(shù)據(jù)/0-test-en.wav")
embedding2 = inference("E:/語音識別/數(shù)據(jù)/0-test-en.wav")# 計算兩個聲紋的相似度
distance = cdist([embedding1], [embedding2], metric="cosine")
print(distance)
2.4 檢測聲紋活動
from pyannote.audio import Model
from pyannote.core import Annotation
from pyannote.audio.pipelines import VoiceActivityDetection# 加載模型
model = Model.from_pretrained("E:/model/pyannote/speaker-diarization-3.1/segmentation-3.0/pytorch_model.bin")# 初始化參數(shù)
activity_detection = VoiceActivityDetection(segmentation=model)
HYPER_PARAMETERS = {# remove speech regions shorter than that many seconds."min_duration_on": 1,# fill non-speech regions shorter than that many seconds."min_duration_off": 0
}
activity_detection.instantiate(HYPER_PARAMETERS)# 獲取活動特征
annotation: Annotation = activity_detection("E:/語音識別/數(shù)據(jù)/0-test-en.wav")# 獲取活動列表
segments = list(annotation.itertracks())
print(segments)
3 使用pyannote.audio:2.1.1
?? 推薦使用此版本
3.1 安裝pyannote.audio
# 安裝包
pip install pyannote.audio==2.1.1 -i https://pypi.tuna.tsinghua.edu.cn/simple# 1 嵌入模型 pyannote/embedding
https://hf-mirror.com/pyannote/embedding# 2 分割模型 pyannote/segmentation
https://hf-mirror.com/pyannote/segmentation
3.2 實現(xiàn)聲紋分割
# 使用 pyannote-audio-2.1.1
import timefrom pyannote.audio.pipelines import SpeakerDiarization
from pyannote.audio import Model
from pyannote.audio.pipelines.utils import PipelineModel
from pyannote.core import Annotation# 語音轉(zhuǎn)向量模型
embedding: PipelineModel = Model.from_pretrained("E:/model/pyannote/pyannote-audio-2.1.1/embedding/pytorch_model.bin")# 分割語音模型
segmentation: PipelineModel = Model.from_pretrained("E:/model/pyannote/pyannote-audio-2.1.1/segmentation/pytorch_model.bin")# 語音分離模型
speaker_diarization: SpeakerDiarization = SpeakerDiarization(segmentation=segmentation,embedding=embedding,clustering="AgglomerativeClustering"
)HYPER_PARAMETERS = {"clustering": {"method": "centroid","min_cluster_size": 15,"threshold": 0.7153814381597874},"segmentation":{"min_duration_off": 0.5817029604921046,"threshold": 0.4442333667381752}
}speaker_diarization.instantiate(HYPER_PARAMETERS)start_time = time.time()
# vad: Annotation = pipeline("E:/語音識別/數(shù)據(jù)/0-test-en.wav")
diarization: Annotation = speaker_diarization("E:/語音識別/數(shù)據(jù)/0-test-en.wav")# 獲取說話人列表
print(diarization.labels())ent_time = time.time()
print(ent_time - start_time)
3.3 其他功能
3.1.1版本的功能2.1.1都能實現(xiàn),參考3.1.1版本即可。