首页 > 解决方案 > spark-redis 异常:原因:redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: Read timed out

问题描述

我正在尝试通过 spark 将数据插入到 redis(Azure Cache for Redis)。大约有 7 亿行,我正在使用spark-redis连接器插入数据。它在一段时间抛出此错误后失败。我可以插入一些行,但一段时间后,一些任务开始失败并出现以下错误。我正在浏览 jupyter 笔记本。

Caused by: redis.clients.jedis.exceptions.JedisConnectionException: java.net.SocketTimeoutException: Read timed out
    at redis.clients.jedis.util.RedisInputStream.ensureFill(RedisInputStream.java:205)
    at redis.clients.jedis.util.RedisInputStream.readByte(RedisInputStream.java:43)
    at redis.clients.jedis.Protocol.process(Protocol.java:155)
    at redis.clients.jedis.Protocol.read(Protocol.java:220)
    at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:318)
    at redis.clients.jedis.Connection.getStatusCodeReply(Connection.java:236)
    at redis.clients.jedis.BinaryJedis.auth(BinaryJedis.java:2259)
    at redis.clients.jedis.JedisFactory.makeObject(JedisFactory.java:119)
    at org.apache.commons.pool2.impl.GenericObjectPool.create(GenericObjectPool.java:819)
    at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:429)
    at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:360)
    at redis.clients.jedis.util.Pool.getResource(Pool.java:50)
    ... 27 more
Caused by: java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at java.net.SocketInputStream.read(SocketInputStream.java:127)
    at redis.clients.jedis.util.RedisInputStream.ensureFill(RedisInputStream.java:199)
    ... 38 more

这就是我尝试写入数据的方式。

df.write
.option("host", REDIS_URL)
.option("port", 6379)
.option("auth", <PWD>)
.option("timeout", 20000)
.format("org.apache.spark.sql.redis")
.option("table", "testrediskeys").option("key.column", "dummy").mode("overwrite").save()
Spark : 3.0
Scala : 2.12
spark-redis: com.redislabs:spark-redis_2.12:2.6.0

标签: apache-sparkredisspark-redis

解决方案


我遇到了同样的问题,我的 spark 上下文的以下配置有帮助:

val spark = SparkSession.builder()
      .appName("My-lovely-app")
      .master(options.masterSpec)
      .config("spark.redis.host", redisHost)
      .config("spark.redis.port", redisPort)
      .config("spark.redis.auth", redisPass)
      .config("spark.redis.timeout", redisSparkTimeout)
      .config("redis.timeout", redisTimeout)
      .config("spark.redis.max.pipeline.size", redisSparkMaxPipelineSize)
      .getOrCreate()

因此,您需要增加spark.redis.timeoutredis.timeout获得更大的价值。两种配置的 3600000 毫秒(1 小时)的值帮助我在我的 Redis 集群 101 中加载了超过 5 亿个列表。对于大规模加载(如你所拥有的)优化,最好增加spark.redis.max.pipeline.size.


推荐阅读