首页 > 解决方案 > 将 PySpark 连接到 cassandra 数据库时出现问题

问题描述

我很难将 PySpark 连接到 Cassandra DB。

我目前正在尝试使用 SparkConf 对象:

pySparkConf = SparkConf().setAll([
    ['packages', 'com.datastax.spark:spark-cassandra-connector_2.12:3.0.0'],
    ['spark.cassandra.connection.host', '127.0.0.1'],
    ['spark.cassandra.connection.port', '9042']
])

sc = SparkContext(conf=pySparkConf)
sqlContext = SQLContext(sc)

sqlContext.read.format("org.apache.spark.sql.cassandra").options(
    table="emails", keyspace="lambda").load().show()

投掷java.net.BindException: Can't assign requested address: Service 'sparkDriver' failed after 16 retries (on a random free port)! Consider explicitly setting the appropriate binding address for the service 'sparkDriver' (for example spark.driver.bindAddress for SparkDriver) to the correct binding address.

当配置传递给os.environ['PYSPARK_SUBMIT_ARGS']. 它在其他操作系统上失败:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.12:3.0.0 --conf spark.cassandra.connection.host=127.0.0.1 --conf spark.cassandra.connection.port=9042 pyspark-shell'

sc = SparkContext("local", "PySpark email processing")
sqlContext = SQLContext(sc)

sqlContext.read.format("org.apache.spark.sql.cassandra").options(
    table="emails", keyspace="lambda").load().show()
# Display the tables as expected

安装在虚拟环境中的软件包:

cassandra-driver==3.24.0
py4j==0.10.9
pyspark==3.0.1

那么如何将 DB 与 SparkConf 或 SparkSession.builder 对象连接起来呢?

标签: apache-sparkpysparkcassandra

解决方案


推荐阅读