json - 保存 AWS Transcribe JSON 输出
问题描述
我正在尝试将 AMS Lambda 函数发送到 AMS Transcribe 以转录音频文档。然后,我想将此音频文档保存回原始 AMS Lambda 函数中。我知道将其发送到 S3 存储桶可能会更容易,但这不是简报的一部分。
一切都顺利进行,但我需要能够访问转录内容以进行下一步(将文本传递到 AMS Comprehend)。AMS Transcribe 作业创建成功,当我点击下载成绩单时,会下载一个 JSON 文件,其中包含以下内容:-
{"jobName":"MY_FILE_NAME","accountId":"MY_ID","results":{"transcripts":[{"transcript":"MY TRANSCRIPT"}],"items":[{"start_time":"0.04","end_time":"0.35","alternatives":[{"confidence":"1.0","content":"better"}],"type":"pronunciation"},{"start_time":"0.35","end_time":"0.71","alternatives":[{"confidence":"1.0","content":"three"}],"type":"pronunciation"},{"start_time":"0.71","end_time":"1.09","alternatives":[{"confidence":"1.0","content":"hours"}],"type":"pronunciation"},{"start_time":"1.09","end_time":"1.29","alternatives":[{"confidence":"1.0","content":"too"}],"type":"pronunciation"},{"start_time":"1.29","end_time":"1.58","alternatives":[{"confidence":"1.0","content":"soon"}],"type":"pronunciation"},{"start_time":"1.58","end_time":"1.73","alternatives":[{"confidence":"0.9991","content":"than"}],"type":"pronunciation"},{"start_time":"1.73","end_time":"1.79","alternatives":[{"confidence":"0.9421","content":"a"}],"type":"pronunciation"},{"start_time":"1.79","end_time":"2.16","alternatives":[{"confidence":"0.9312","content":"minute"}],"type":"pronunciation"},{"start_time":"2.16","end_time":"2.37","alternatives":[{"confidence":"0.925","content":"too"}],"type":"pronunciation"},{"start_time":"2.37","end_time":"2.76","alternatives":[{"confidence":"0.9973","content":"late"}],"type":"pronunciation"},{"alternatives":[{"confidence":"0.0","content":"."}],"type":"punctuation"}]},"status":"COMPLETED"}
这个文件会很棒,因为我可以从 JSON 文件中获取成绩单。但是,即使我获得了转录结果(与下载 JSON 文件的 URL 完全相同),我似乎也无法以 JSON 格式读取它们。这是我的代码。我已经包含了转录的过程,但是 s3bucket 和 s3object 来自代码的早期部分。
#CREATE TRANSCRIBE JOB
jobName = s3object + '-' + str(uuid.uuid4())
client = boto3.client('transcribe')
response = client.start_transcription_job(
TranscriptionJobName=jobName,
LanguageCode='en-US',
MediaFormat='mp3',
Media={
'MediaFileUri': "s3://" + s3bucket + "/" + str(s3object)
},
)
#TESTING
print(response['TranscriptionJob']['TranscriptionJobName'])
time.sleep(50)
print(response)
# GET TRANSCRIBE FILE
while True:
result = client.get_transcription_job(TranscriptionJobName=jobName)
if result['TranscriptionJob']['TranscriptionJobStatus'] in ['COMPLETED', 'FAILED']:
break
time.sleep(15)
if result['TranscriptionJob']['TranscriptionJobStatus'] == "COMPLETED":
data = result['TranscriptionJob']['Transcript']['TranscriptFileUri']
data = json.loads(data)
print(data)
当我打印数据时,我得到以下信息(这也是下载文件的 URL)
https://s3.eu-west-2.amazonaws.com/aws-transcribe-eu-west-2-prod/21040557774/MY_FILE_NAME/8127c3b7-dcdf-4f64-8331-e61c7219c942/asrOutput.json?X-Amz-Security-Token=LONG_SECURITY_TOKEN&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20210326T193652Z&X-Amz-SignedHeaders=host&X-Amz-Expires=900&X-Amz-Credential=MY_CREDENTIAL%2Feu-west-2%2Fs3%2Faws4_request&X-Amz-Signature=AMZ_SIGNATURE
当这个文件作为 JSON 文件下载时,我想我可以简单地通过 JSON 将它导入到我的代码中。这就是 data = json.loads(data) 的来源。
但是,当我运行这条线时,我得到: -
[ERROR] JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 51, in lambda_handler
data = json.loads(data)
我知道有使用 pandas 的潜力,但我使用的是 AWS CLI,我花了大约 2 个小时查看不同的教程,每个教程都提供有关如何让 pandas 工作的线路建议,每个都在中途中断,所以如果完全可以避免我想打开它而不必超越简单的导入,但如果没有其他方法,那么我理解。
谢谢!
解决方案
您不能使用data = json.loads(data)
,因为 data 是 url 而不是格式化的 Json 字符串
尝试下载它(例如使用请求库)
import requests
data = requests.get(data).json()
推荐阅读
- javascript - 使用多个 DropDownList (javascript) 过滤 HTML 表中的数据
- c++ - lambda 可以实例化模板函数吗?
- eclipse - Eclipse 显示奇怪的符号而不是文本
- reactjs - 当 React 组件重新渲染时 UI 不会更新
- c++ - 使用 ParseDelimitedFromCodedInputStream 从文件中读取?
- python - InvalidTextEncoding botocore 异常
- java - 我想关闭 JavaFx 中的启动画面。但它没有关闭,而是第二个窗口正在打开
- reactjs - 在过渡/动画之后显示/隐藏 React 组件
- ruby-on-rails - Rails 服务器使用的 ruby 版本与 rbenv 指定的不同
- javascript - 在我的while循环结束时获取和错误