首页 > 解决方案 > 将数据从 pyspark 保存到 HBase 时出错

问题描述

我正在尝试使用 PySpark 将 Spark Dataframe 写入 HBase。我上传了 spark HBase 依赖项。通过使用 Jupyter notebook,我正在运行代码。另外,我在默认命名空间中的 HBase 中创建了一个表。

我通过运行以下命令启动了 pyspark。我的 spark 版本:spark 3.x 和 HBase 版本:hbase-2.2.6

pyspark --packages com.hortonworks:shc:1.0.0-1.6-s_2.10 --repositories http://repo.hortonworks.com/content/groups/public/ --files /home/vijee/hbase-2.2.6-bin/conf/hbase-site.xml

依赖添加成功

df = sc.parallelize([('a', 'def'), ('b', 'abc')]).toDF(schema=['col0', 'col1'])

catalog = ''.join("""{
     "table":{"namespace":"default", "name":"smTable"},
     "rowkey":"c1",
     "columns":{
    "col0":{"cf":"rowkey", "col":"c1", "type":"string"},
    "col1":{"cf":"t1", "col":"c2", "type":"string"}
   }
      }""".split())
      
df.write.options(catalog=catalog).format('org.apache.spark.sql.execution.datasources.hbase').save()

当我运行上述语句时,我收到以下错误。由于我是新手,因此我无法理解该错误。

起初,我尝试使用我的 CSV 文件并遇到相同的“:java.lang.AbstractMethodError”。现在我使用示例数据仍然得到相同的错误。

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-9-cfcf107b1f03> in <module>
----> 1 df.write.options(catalog=catalog,newtable=5).format('org.apache.spark.sql.execution.datasources.hbase').save()

~/spark-3.0.1-bin-hadoop2.7/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
    823             self.format(format)
    824         if path is None:
--> 825             self._jwrite.save()
    826         else:
    827             self._jwrite.save(path)

~/spark-3.0.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1302 
   1303         answer = self.gateway_client.send_command(command)
-> 1304         return_value = get_return_value(
   1305             answer, self.gateway_client, self.target_id, self.name)
   1306 

~/spark-3.0.1-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
    126     def deco(*a, **kw):
    127         try:
--> 128             return f(*a, **kw)
    129         except py4j.protocol.Py4JJavaError as e:
    130             converted = convert_exception(e.java_exception)

~/spark-3.0.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o114.save.
: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/Dataset;)Lorg/apache/spark/sql/sources/BaseRelation;
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)

标签: apache-sparkpysparkhbase

解决方案


推荐阅读