首页 > 解决方案 > 将 BigQuery 表读入 GCP DataProc 上的 Spark RDD,为什么缺少用于 newAPIHadoopRDD 的类

问题描述

大约一个多星期前,我能够使用https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector上的指南将 BigQuery 表读入 RDD,以便在 Dataproc 集群上运行 Spark 作业-spark-example作为模板。从那时起,我现在遇到了缺少课程的问题,尽管指南没有受到任何影响。

我试图追踪丢失的类 com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList,尽管我找不到任何关于该类现在是否从 gs 中排除的信息://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar

作业提交请求如下:

gcloud dataproc jobs submit pyspark \
--cluster $CLUSTER_NAME \
--jars gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
--bucket gs://$BUCKET_NAME \
--region europe-west2 \
--py-files $PYSPARK_PATH/main.py

PySpark 代码在以下点中断:

bq_table_rdd = spark_context.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)

其中 conf 是一个 Python dict,其结构如下:

conf = {
    'mapred.bq.project.id': project_id,
    'mapred.bq.gcs.bucket': gcs_staging_bucket,
    'mapred.bq.temp.gcs.path': input_staging_path,
    'mapred.bq.input.project.id': bq_input_project_id,
    'mapred.bq.input.dataset.id': bq_input_dataset_id,
    'mapred.bq.input.table.id': bq_input_table_id,
}

当我的输出表明代码已到达上述 spark_context.newAPIHadoopRDD 函数时,将以下内容打印到 stdout:

class com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.DefaultPlatform: cannot cast result of calling 'com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance' to 'com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory': java.lang.ClassCastException: Cannot cast com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory to com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory

Traceback (most recent call last):
  File "/tmp/0af805a2dd104e46b087037f0790691f/main.py", line 31, in <module>
    sc)
  File "/tmp/0af805a2dd104e46b087037f0790691f/extract.py", line 65, in bq_table_to_rdd
    conf=conf)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 749, in newAPIHadoopRDD
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList

就在上周,这还不是问题。我担心即使是 GCP 网站上的 hello world 示例在短期内也不稳定。如果有人能对这个问题有所了解,将不胜感激。谢谢。

标签: pysparkgoogle-bigquerygoogle-cloud-dataproc

解决方案


我重现了这个问题

$ gcloud dataproc clusters create test-cluster --image-version=1.4

$ gcloud dataproc jobs submit pyspark wordcount_bq.py \
    --cluster test-cluster \
    --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar

然后发生了完全相同的错误:

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList

我注意到 8 月 23 日有一个新版本 1.0.0:

$ gsutil ls -l gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-**
   ...
   4038762  2018-10-03T20:59:35Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.8.jar
   4040566  2018-10-19T23:32:19Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.9.jar
  14104522  2019-06-28T21:08:57Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0-RC1.jar
  14104520  2019-07-01T20:38:18Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0-RC2.jar
  14149215  2019-08-23T21:08:03Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0.jar
  14149215  2019-08-24T00:27:49Z  gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar

然后我尝试了 0.13.9 版本,它起作用了:

$ gcloud dataproc jobs submit pyspark wordcount_bq.py \
    --cluster test-cluster \
    --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.9.jar

这是 1.0.0 的问题,GitHub 上已经有问题。我们将修复它并改进测试。


推荐阅读