首页 > 解决方案 > 如何正确设置 PySpark - Snowflake 连接的变量?

问题描述

我正在使用文档并尝试运行此处找到的简单脚本:https ://docs.snowflake.com/en/user-guide/spark-connector-use.html

Py4JJavaError: An error occurred while calling o37.load.
: java.lang.ClassNotFoundException: Failed to find data source: net.snowflake.spark.snowflake.

我的代码如下。我还尝试使用位于/Users/Hana/spark-sf/目录中的 jdbc 和 spark-snowflake jar 的路径设置配置选项,但没有运气。

from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkConf, SparkContext

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config('spark.jars','/Users/Hana/spark-sf/snowflake-jdbc-3.12.9.jar,/Users/Hana/spark-sf/spark-snowflake_2.12-2.8.1-spark_3.0.jar') \
    .getOrCreate()

# Set options below
sfOptions = {
  "sfURL" : "<account_name>.snowflakecomputing.com",
  "sfUser" : "<user_name>",
  "sfPassword" : "<password>",
  "sfDatabase" : "<database>",
  "sfSchema" : "<schema>",
  "sfWarehouse" : "<warehouse>"
}

SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"


df = spark.read.format(SNOWFLAKE_SOURCE_NAME) \
  .options(**sfOptions) \
  .option("query",  "select * from table limit 200") \
  .load()

df.show()

我应该如何正确设置变量?以及需要设置哪些?如果有人可以帮助列出这些步骤,我将不胜感激!

标签: scalaapache-sparkpysparksnowflake-cloud-data-platform

解决方案


您可以尝试仅将格式设置为“雪花”吗

所以你的数据框会有

df = spark.read.format("snowflake") \
  .options(**sfOptions) \
  .option("query",  "select * from table limit 200") \
  .load()

或将SNOWFLAKE_SOURCE_NAME变量设置为

SNOWFLAKE_SOURCE_NAME = "snowflake"

推荐阅读