首页 > 解决方案 > 在语音识别 API 中使用语音区分结果

问题描述

我正在尝试更多地了解语音区分和语音识别。我开始遵循本教程,并且能够获得音频标签的小管。

根据教程,您可以使用谷歌语音 API 并将音频片段发送到谷歌 API,它会被转录,这正是我坚持的地方!

根据教程你所要做的就是

  1. 获取 Google /Ibm watson API 语音转文本(完成)

(我已经完成了这一步并获得了 Watson API 密钥和 url!)

1.对于标签文件中的每个元组元素'ele',提取ele[0]作为说话者标签,ele 1作为开始时间,ele[2]作为结束时间。

(我根本不明白这一步......我试过这个,但我不确定这是否是他们的意思)


for ele in labelling:
    speaker_label = ele[0]
    start_time = ele[1]
    end_time=ele[2]

2.从开始时间到结束时间修剪您的原始音频文件。您可以使用 ffmpeg 执行此任务。

(此步骤取决于步骤 1,但我也不了解它,因为我不知道如何使用 ffmpeg 或如何将其用于该项目)

3. 将上一步中获得的修剪后的音频文件传递给 Google 的 API/Ibm watson API,它将返回此音频片段的文本转录本。

(我只需要了解上下文或如何传递分段音频的代码以及它的外观)

4.将成绩单连同演讲者标签一起写入文本文件并保存。

任何帮助将不胜感激!

我的完整代码:

from resemblyzer import preprocess_wav, VoiceEncoder
from pathlib import Path

from resemblyzer.audio import sampling_rate

from spectralcluster import SpectralClusterer

import ffmpeg

from ibm_watson import SpeechToTextV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator

# Ibm related components (Not used as it's not implemented )
authenticator = IAMAuthenticator('Key here')
speech_to_text = SpeechToTextV1(
    authenticator=authenticator
)


speech_to_text.set_service_url(
    'URL HERE')

#-------------------------------------------------------

#From the tutorial this part is to get the audio file and to process it 

# give the file path to your audio file
audio_file_path = 'Audio files/testForTheOthers.wav'
wav_fpath = Path(audio_file_path)

wav = preprocess_wav(wav_fpath)
encoder = VoiceEncoder("cpu")
_, cont_embeds, wav_splits = encoder.embed_utterance(wav, return_partials=True, rate=16)
print(cont_embeds.shape)



#-----------------------------------------------------------------------


#From the tutorial this is the clustering part
#(some parts of the code got me error that is why they are not included)
# (p_percentile=0.90,gaussian_blur_sigma=1) got removed (Errors)

clusterer = SpectralClusterer(
    min_clusters=2,
    max_clusters=100,
)

labels = clusterer.predict(cont_embeds)
#-----------------------------------------------------------------------



#From the tutorial this is the clustering part


def create_labelling(labels, wav_splits):
    from resemblyzer.audio import sampling_rate
    times = [((s.start + s.stop) / 2) / sampling_rate for s in wav_splits]
    labelling = []
    start_time = 0

    for i, time in enumerate(times):
        if i > 0 and labels[i] != labels[i - 1]:
            temp = [str(labels[i - 1]), start_time, time]
            labelling.append(tuple(temp))
            start_time = time
        if i == len(times) - 1:
            temp = [str(labels[i]), start_time, time]
            labelling.append(tuple(temp))

    return labelling


labelling = create_labelling(labels, wav_splits)


print(labelling)
#----------------------

#Me Trying to implement step 1

for ele in labelling:
    speaker_label = ele[0]
    start_time = ele[1]
    end_time=ele[2]


#-----------------------------------------------------------------------------

#After this part you are supposed to implement the rest of the tutorial 
#but I'm stuck


标签: pythonffmpegspeech-recognitionibm-watsonspeech-to-text

解决方案


推荐阅读