首页 > 解决方案 > 无法从 flume hdfs sink 创建的 Avro 数据中获取记录

问题描述

flume1.sources = kafka-source-1
flume1.channels = hdfs-channel-1
flume1.sinks = hdfs-sink-1
flume1.sources.kafka-source-1.type = org.apache.flume.source.kafka.KafkaSource
flume1.sources.kafka-source-1.kafka.bootstrap.servers= kafka bootstrap servers
flume1.sources.kafka-source-1.kafka.topics = topic
flume1.sources.kafka-source-1.batchSize = 1000
flume1.sources.kafka-source-1.channels = hdfs-channel-1
flume1.sources.kafka-source-1.kafka.consumer.group.id=group_id

flume1.channels.hdfs-channel-1.type = memory
flume1.sinks.hdfs-sink-1.channel = hdfs-channel-1
flume1.sinks.hdfs-sink-1.type = hdfs
flume1.sinks.hdfs-sink-1.hdfs.writeFormat = Text
flume1.sinks.hdfs-sink-1.hdfs.fileType = DataStream
flume1.sinks.hdfs-sink-1.hdfs.filePrefix = file_prefix
flume1.sinks.hdfs-sink-1.hdfs.fileSuffix = .avro
flume1.sinks.hdfs-sink-1.hdfs.inUsePrefix = tmp/
flume1.sinks.hdfs-sink-1.hdfs.useLocalTimeStamp = true
flume1.sinks.hdfs-sink-1.hdfs.path = /user/directory/ingest_date=%y-%m-%d
flume1.sinks.hdfs-sink-1.hdfs.rollCount=0
flume1.sinks.hdfs-sink-1.hdfs.rollSize=1000000
flume1.channels.hdfs-channel-1.capacity = 10000
flume1.channels.hdfs-channel-1.transactionCapacity = 10000

我正在使用水槽来使用来自 Kafka 的 avro 数据,并使用 HDFS 接收器将其存储在 HDFS 中。

我正在尝试在 HDFS 中的 avro 数据上创建一个配置单元表。

我可以做 msck 修复表,并且可以看到添加到 hive 元存储的分区

所以当我从 table_name 限制 1 中选择 * 时;它让我记录下来

但是当我尝试获取比我得到的更多的东西时

失败并出现异常 java.io.IOException:org.apache.avro.AvroRuntimeException: java.io.IOException: Block size invalid or too large for this implementation: -40

我也尝试过提供以下道具

flume1.sinks.hdfs-sink-1.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
flume1.sinks.hdfs-sink-1.deserializer.schemaType = LITERAL
flume1.sinks.hdfs-sink-1.serializer.schemaURL = file:///schemadirectory

PS 我使用 kafka connect 将数据推送到 Kafka 主题。

标签: apache-kafkaavroflume

解决方案


推荐阅读