首页 > 解决方案 > 音素级发音正确性 Microsoft Speech

问题描述

我正在使用 Microsoft Cognitive Service 的 Pronunciation Assessment 服务(使用 Python API)。目前,我可以根据我在请求中传递的参考文本显示音素细分(以及置信度分数)。我的问题是:有什么方法可以得到它真正所说的音素分解?换句话说.. 有可能得到检测到的音素而不是系统等待根据参考文本识别的音素作为输出?

这描绘了我目前拥有的输出。但是,我不想获得组成单词“不能”的音素,而是想获得输出中传递的单词的音素

            {
                "Word": "can't", 
                "AccuracyScore": 85.0, 
                "ErrorType": "None", 
                "Offset": 39900000, 
                "Duration": 6500000, 
                "Phonemes": [
                    {
                        "Duration": 1300000, 
                        "Phoneme": "k", 
                        "AccuracyScore": 89.0, 
                        "Offset": 39900000
                    }, 
                    {
                        "Duration": 800000, 
                        "Phoneme": "aa", 
                        "AccuracyScore": 86.0, 
                        "Offset": 41300000
                    }, 
                    {
                        "Duration": 1600000, 
                        "Phoneme": "n", 
                        "AccuracyScore": 74.0, 
                        "Offset": 42200000
                    }, 
                    {
                        "Duration": 2500000, 
                        "Phoneme": "t", 
                        "AccuracyScore": 89.0, 
                        "Offset": 43900000
                    }
                ]
            }, 

提前致谢

标签: pythonwindowsmicrosoft-cognitivevoice-recognitionphonetics

解决方案


Go through the document of Pronunciation assessment and the sample code on Github, it seems we can get what the speaker said by print reference_text.

在此处输入图像描述

You can also do it by PronunciationAssessmentConfig.to_json()(pronunciation_config.to_json()) to get all of the parameters(include the reference_text in it).


推荐阅读