首页 > 解决方案 > 保存 AWS Transcribe JSON 输出

问题描述

我正在尝试将 AMS Lambda 函数发送到 AMS Transcribe 以转录音频文档。然后,我想将此音频文档保存回原始 AMS Lambda 函数中。我知道将其发送到 S3 存储桶可能会更容易,但这不是简报的一部分。

一切都顺利进行,但我需要能够访问转录内容以进行下一步(将文本传递到 AMS Comprehend)。AMS Transcribe 作业创建成功,当我点击下载成绩单时,会下载一个 JSON 文件,其中包含以下内容:-

{"jobName":"MY_FILE_NAME","accountId":"MY_ID","results":{"transcripts":[{"transcript":"MY TRANSCRIPT"}],"items":[{"start_time":"0.04","end_time":"0.35","alternatives":[{"confidence":"1.0","content":"better"}],"type":"pronunciation"},{"start_time":"0.35","end_time":"0.71","alternatives":[{"confidence":"1.0","content":"three"}],"type":"pronunciation"},{"start_time":"0.71","end_time":"1.09","alternatives":[{"confidence":"1.0","content":"hours"}],"type":"pronunciation"},{"start_time":"1.09","end_time":"1.29","alternatives":[{"confidence":"1.0","content":"too"}],"type":"pronunciation"},{"start_time":"1.29","end_time":"1.58","alternatives":[{"confidence":"1.0","content":"soon"}],"type":"pronunciation"},{"start_time":"1.58","end_time":"1.73","alternatives":[{"confidence":"0.9991","content":"than"}],"type":"pronunciation"},{"start_time":"1.73","end_time":"1.79","alternatives":[{"confidence":"0.9421","content":"a"}],"type":"pronunciation"},{"start_time":"1.79","end_time":"2.16","alternatives":[{"confidence":"0.9312","content":"minute"}],"type":"pronunciation"},{"start_time":"2.16","end_time":"2.37","alternatives":[{"confidence":"0.925","content":"too"}],"type":"pronunciation"},{"start_time":"2.37","end_time":"2.76","alternatives":[{"confidence":"0.9973","content":"late"}],"type":"pronunciation"},{"alternatives":[{"confidence":"0.0","content":"."}],"type":"punctuation"}]},"status":"COMPLETED"}

这个文件会很棒,因为我可以从 JSON 文件中获取成绩单。但是,即使我获得了转录结果(与下载 JSON 文件的 URL 完全相同),我似乎也无法以 JSON 格式读取它们。这是我的代码。我已经包含了转录的过程,但是 s3bucket 和 s3object 来自代码的早期部分。

#CREATE TRANSCRIBE JOB
jobName = s3object + '-' + str(uuid.uuid4())

client = boto3.client('transcribe')

response = client.start_transcription_job(
    TranscriptionJobName=jobName,
    LanguageCode='en-US',
    MediaFormat='mp3',
    Media={
        'MediaFileUri': "s3://" + s3bucket + "/" + str(s3object)
    },
)

#TESTING
print(response['TranscriptionJob']['TranscriptionJobName'])
time.sleep(50)
print(response)

# GET TRANSCRIBE FILE
while True:
    result = client.get_transcription_job(TranscriptionJobName=jobName)
    if result['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
      break
    time.sleep(15)
if result['TranscriptionJob']['TranscriptionJobStatus'] == "COMPLETED":
    data = result['TranscriptionJob']['Transcript']['TranscriptFileUri']
    data = json.loads(data)
    print(data)

当我打印数据时,我得到以下信息(这也是下载文件的 URL)

https://s3.eu-west-2.amazonaws.com/aws-transcribe-eu-west-2-prod/21040557774/MY_FILE_NAME/8127c3b7-dcdf-4f64-8331-e61c7219c942/asrOutput.json?X-Amz-Security-Token=LONG_SECURITY_TOKEN&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210326T193652Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=MY_CREDENTIAL%2Feu-west-2%2Fs3%2Faws4_request&X-Amz-Signature=AMZ_SIGNATURE

当这个文件作为 JSON 文件下载时,我想我可以简单地通过 JSON 将它导入到我的代码中。这就是 data = json.loads(data) 的来源。

但是,当我运行这条线时,我得到: -

[ERROR] JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 51, in lambda_handler
data = json.loads(data)

我知道有使用 pandas 的潜力,但我使用的是 AWS CLI,我花了大约 2 个小时查看不同的教程,每个教程都提供有关如何让 pandas 工作的线路建议,每个都在中途中断,所以如果完全可以避免我想打开它而不必超越简单的导入,但如果没有其他方法,那么我理解。

谢谢!

标签: jsonamazon-web-services

解决方案


您不能使用data = json.loads(data),因为 data 是 url 而不是格式化的 Json 字符串

尝试下载它(例如使用请求库)

import requests
data = requests.get(data).json()

推荐阅读