首页 > 解决方案 > 直接从 Amazon Transcribe 获取结果(无服务器)

问题描述

我使用无服务器 Lambda 服务通过 Amazon Transcribe 从语音转录为文本。我当前的脚本能够从 S3 转录文件并将结果作为 JSON 文件存储在 S3 中。

是否有可能直接获得结果,因为我想将它存储在数据库中(AWS RDS 中的 PostgreSQL)?

谢谢你的提示

无服务器.yml

...
provider:
  name: aws
  runtime: nodejs10.x
  region: eu-central-1
  memorySize: 128
  timeout: 30
  environment:
    S3_AUDIO_BUCKET: ${self:service}-${opt:stage, self:provider.stage}-records
    S3_TRANSCRIPTION_BUCKET: ${self:service}-${opt:stage, self:provider.stage}-transcriptions
    LANGUAGE_CODE: de-DE
  iamRoleStatements:
    - Effect: Allow
      Action:
        - s3:PutObject
        - s3:GetObject
      Resource:
        - 'arn:aws:s3:::${self:provider.environment.S3_AUDIO_BUCKET}/*'
        - 'arn:aws:s3:::${self:provider.environment.S3_TRANSCRIPTION_BUCKET}/*'
    - Effect: Allow
      Action:
        - transcribe:StartTranscriptionJob
      Resource: '*'

functions:

  transcribe:
    handler: handler.transcribe
    events:
      - s3:
          bucket: ${self:provider.environment.S3_AUDIO_BUCKET}
          event: s3:ObjectCreated:*

  createTextinput:
    handler: handler.createTextinput
    events:
      - http:
          path: textinputs
          method: post
          cors: true
...

resources:
  Resources:
    S3TranscriptionBucket:
      Type: 'AWS::S3::Bucket'
      Properties:
        BucketName: ${self:provider.environment.S3_TRANSCRIPTION_BUCKET}  
...

handler.js

const db = require('./db_connect');

const awsSdk = require('aws-sdk');

const transcribeService = new awsSdk.TranscribeService();

module.exports.transcribe = (event, context, callback) => {
  const records = event.Records;

  const transcribingPromises = records.map((record) => {
    const recordUrl = [
      'https://s3.amazonaws.com',
      process.env.S3_AUDIO_BUCKET,
      record.s3.object.key,
    ].join('/');

    // create random filename to avoid conflicts in amazon transcribe jobs

    function makeid(length) {
       var result           = '';
       var characters       = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';
       var charactersLength = characters.length;
       for ( var i = 0; i < length; i++ ) {
          result += characters.charAt(Math.floor(Math.random() * charactersLength));
       }
       return result;
    }

    const TranscriptionJobName = makeid(7);

    return transcribeService.startTranscriptionJob({
      LanguageCode: process.env.LANGUAGE_CODE,
      Media: { MediaFileUri: recordUrl },
      MediaFormat: 'wav',
      TranscriptionJobName,
      //MediaSampleRateHertz: 8000, // normally 8000 if you are using wav file
      OutputBucketName: process.env.S3_TRANSCRIPTION_BUCKET,
    }).promise();
  });

  Promise.all(transcribingPromises)
    .then(() => {
      callback(null, { message: 'Start transcription job successfully' });
    })
    .catch(err => callback(err, { message: 'Error start transcription job' }));
};

module.exports.createTextinput = (event, context, callback) => {
  context.callbackWaitsForEmptyEventLoop = false;
  const data = JSON.parse(event.body);
  db.insert('textinputs', data)
    .then(res => {
      callback(null,{
        statusCode: 200,
        body: "Textinput Created! id: " + res
      })
    })
    .catch(e => {
      callback(null,{
        statusCode: e.statusCode || 500,
        body: "Could not create a Textinput " + e
      })
    }) 
};

标签: amazon-web-servicesaws-lambdaaws-transcribe

解决方案


Amazon Transcribe 目前仅支持在 S3 中存储转录,如StartTranscriptionJob 的 API 定义中所述。但是有一种特殊情况:如果您不想管理自己的 S3 存储桶进行转录,您可以省略OutputBucketName并且转录将存储在 AWS 管理的 S3 存储桶中。在这种情况下,您将获得一个预签名的 URL,允许您下载转录。

由于转录是异步发生的,我建议您创建第二个 AWS Lambda 函数,由 CloudWatch 事件触发,该事件在您的转录状态更改时发出(如将 Amazon CloudWatch 事件与 Amazon Transcribe 结合使用中所述)或通过 S3 通知(使用 AWS Lambda与亚马逊 S3)。然后,此 AWS Lambda 函数可以从 S3 获取完成的转录并将其内容存储在 PostgreSQL 中。


推荐阅读