首页 > 解决方案 > Alpakka s3`multipartUpload`不上传文件

问题描述

我有一个关于alpakka_kafka+alpakka_s3集成的问题。当我使用 alpakka kafka 源时, Alpakka s3multipartUpload似乎没有上传文件。

kafkaSource ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
    bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
    bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore

但是,只要我.take(100)在 kafkaSource 之后添加。一切正常。

kafkaSource.take(100) ~> kafkaSubscriber.serializer.deserializeFlow ~>     bcast.in
    bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
    bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore

任何帮助将不胜感激。提前致谢!

这是完整的代码片段:

// Source
val kafkaSource: Source[(CommittableOffset, Array[Byte]), Consumer.Control] = {
    Consumer
      .committableSource(consumerSettings, Subscriptions.topics(prefixedTopics))
      .map(committableMessage => (committableMessage.committableOffset, committableMessage.record.value))
      .watchTermination() { (mat, f: Future[Done]) =>
        f.foreach { _ =>
          log.debug("consumer source shutdown, consumerId={}, group={}, topics={}", consumerId, group,     prefixedTopics.mkString(", "))
        }

        mat
      }
  }

// Flow
val commitFlow: Flow[CommittableOffset, Done, NotUsed] = {
    Flow[CommittableOffset]
      .groupedWithin(batchingSize, batchingInterval)
      .map(group => group.foldLeft(CommittableOffsetBatch.empty) { (batch, elem) => batch.updated(elem) })
      .mapAsync(parallelism = 3) { msg =>
        log.debug("committing offset, msg={}", msg)

        msg.commitScaladsl().map { result =>
          log.debug("committed offset, msg={}", msg)
          result
        }
      }
  }

private val kafkaMsgToByteStringFlow = Flow[KafkaMessage[Any]].map(x => ByteString(x.msg + "\n"))

private val kafkaMsgToOffsetFlow = {
    implicit val askTimeout: Timeout = Timeout(5.seconds)
    Flow[KafkaMessage[Any]].mapAsync(parallelism = 5) { elem =>
      Future(elem.offset)
    }
  }


// Sink

val s3Sink = {
      val BUCKET = "test-data"
      s3Client.multipartUpload(BUCKET, s"tmp/data.txt")



// Doesnt' work..... ( no files are showing up on the S3)
kafkaSource ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
        bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
        bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore

// This one works...
kafkaSource.take(100) ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
        bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
        bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore

标签: amazon-s3apache-kafkaakkaakka-streamalpakka

解决方案


实际上,它确实可以上传。问题是,您需要向 s3 发送完成请求以完成上传,然后您的文件将在存储桶中可用。我打赌,因为 kafka 源take(n)永远不会停止在下游生成数据,接收器永远不会向 s3 发送完成请求,因为流程实际上从未完成,所以接收器总是希望在完成请求之前上传更多数据。

没有办法只将所有内容上传到一个文件中,所以我的建议是:将您的kafkaSource消息分组并将压缩的 Array[Byte] 发送到接收器。诀窍是您必须为每个文件创建一个接收器,而不是只使用一个接收器。


推荐阅读