首页 > 解决方案 > Microsoft.CognitiveServices.Speech.SpeechRecognizer - 获取文件中结果的时间偏移,持续识别

问题描述

我正在 Azure 上测试新的统一语音引擎,我正在尝试转录一个 10 分钟的音频文件。我使用 CreateSpeechRecognizerWithFileInput 创建了一个识别器,并使用 StartContinuousRecognitionAsync 开始了连续识别。我创建了启用详细结果的识别器。

在 FinalResultsReceived 事件中,似乎没有办法访问 SpeechRecognitionResult 中的音频偏移量。如果我这样做:

string rawResult = ea.Result.ToString();  //can get access to raw value this way.
Regex r=new Regex(@".*Offset"":(\d*),.*");
int offset=Convert.ToInt32(r?.Match(rawResult)?.Groups[1]?.Value);

然后我可以提取偏移量。原始结果如下所示:

ResultId:4116b361141446a98f306fdc11c3a5bd Status:Recognized Recognized text:<OK, so what's your think it went well, let's look at number number is 104-828-1198.>. Json:{"Duration":129500000,"NBest":[{"Confidence":0.887861133,"Display":"OK, so what's your think it went well, let's look at number number is 104-828-1198.","ITN":"OK so what's your think it went well let's look at number number is 104-828-1198","Lexical":"OK so what's your think it went well let's look at number number is one zero four eight two eight one one nine eight","MaskedITN":"OK so what's your think it went well let's look at number number is 104-828-1198"}],"Offset":6900000,"RecognitionStatus":"Success"}

那里的挑战是偏移量有时为零,即使在它是非零文件索引的情况下,所以我会在识别流的中间得到零。

我还尝试通过批处理转录API提交相同的文件,这给了我完全不同的结果:

{
                "RecognitionStatus": "Success",
                "Offset": 531700000,
                "Duration": 91300000,
                "NBest": [{
                        "Confidence": 0.87579143,
                        "Lexical": "OK so what's your think it went well let's look at number number is one zero four eight two eight one",
                        "ITN": "OK so what's your think it went well let's look at number number is 1048281",
                        "MaskedITN": "OK so what's your think it went well let's look at number number is 1048281",
                        "Display": "OK, so what's your think it went well, let's look at number number is 1048281."
                    }
                ]
            }, 

所以我对此有三个问题:

  1. 是否有支持的方法来获取识别器 API 中文件的已识别部分的偏移量?SpeechRecognitionResult 不公开这一点,Best() 扩展也不公开。
  2. 为什么文件中的某个段的偏移量返回为 0?
  3. 批量识别和文件识别API中的偏移量的单位是什么,为什么它们不同?它们似乎不是毫秒或帧,至少从我在 Audacity 中找到的内容来看。我发布的结果是从大约 59 秒到文件中,大约是 800k 样本。

标签: c#azurespeech-recognitionmicrosoft-cognitive

解决方案


克里斯,

感谢您的反馈意见。对于您的问题,1) 偏移量和持续时间已添加到 API。下一个即将发布的版本(很快)将允许您访问这两个属性。敬请期待。2)这可能是由于使用了不同的识别模式。我们还将在下一个版本中解决这个问题。3) 两个 API 的时间单位都是 100ns(tick)。另请注意,批量转录使用与在线识别不同的模型,因此识别结果可能会略有不同。

带来不便敬请谅解!

谢谢,


推荐阅读