首页 > 解决方案 > 谷歌数据融合 xml 解析 - 'parse-xml-to-json':6 处不匹配的关闭标签注释

问题描述

我是 Google Cloud Data Fusion 的新手。我能够成功处理 CSV 文件并加载到 BigQuery。我的要求是处理 XML 文件并加载到 BigQuery 中。为了尝试,我只使用了非常简单的 XML

XML 文件:

{<?xml version="1.0" encoding="UTF-8"?> <note> <to>Tove</to <from>Jani</from>  <heading>Reminder</heading>  <body>Don't forget me this weekend!</body> </note> }

错误信息 1

java.lang.Exception: Stage:Wrangler - Reached error threshold 1, terminating processing due to error : Error encountered while executing 'parse-xml-to-json' : Mismatched close tag note at 6 [character 7 line 1]
at io.cdap.wrangler.Wrangler.transform(Wrangler.java:404) ~[1601903767453-0/:na]
at io.cdap.wrangler.Wrangler.transform(Wrangler.java:83) ~[1601903767453-0/:na]
at io.cdap.cdap.etl.common.plugin.WrappedTransform.lambda$transform$5(WrappedTransform.java:90) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.plugin.Caller$1.call(Caller.java:30) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.plugin.StageLoggingCaller.call(StageLoggingCaller.java:40) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.plugin.WrappedTransform.transform(WrappedTransform.java:89) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.common.TrackedTransform.transform(TrackedTransform.java:74) ~[cdap-etl-core-6.2.0.jar:na]
at io.cdap.cdap.etl.spark.function.TransformFunction.call(TransformFunction.java:50) ~[hydrator-spark-core2_2.11-6.2.0.jar:na]
at io.cdap.cdap.etl.spark.Compat$FlatMapAdapter.call(Compat.java:126) ~[hydrator-spark-core2_2.11-6.2.0.jar:na]
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$1$1.apply(JavaRDDLike.scala:125) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) ~[scala-library-2.11.8.jar:na]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:128) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415) ~[spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.scheduler.Task.run(Task.scala:109) [spark-core_2.11-2.3.3.jar:2.3.3]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) [spark-core_2.11-2.3.3.jar:2.3.3]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_252]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_252]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_252]

原因:io.cdap.wrangler.api.RecipeException:执行“parse-xml-to-json”时遇到错误:io.cdap.wrangler.executor.RecipePipelineExecutor 的 6 [字符 7 第 1 行] 处的关闭标记注释不匹配。执行(RecipePipelineExecutor.java:149)~[wrangler-core-4.2.0.jar:na] 在 io.cdap.wrangler.executor.RecipePipelineExecutor.execute(RecipePipelineExecutor.java:97)~[wrangler-core-4.2.0 .jar:na] at io.cdap.wrangler.Wrangler.transform(Wrangler.java:376) ~[1601903767453-0/:na] ... 省略了 26 个常见帧 原因:io.cdap.wrangler.api.DirectiveExecutionException :执行“parse-xml-to-json”时遇到错误:在 io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:106) 的 6 [字符 7 第 1 行] 处的关闭标记注释不匹配~[na: na] 在 io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:49) ~[na:na] 在 io.cdap。wrangler.executor.RecipePipelineExecutor.execute(RecipePipelineExecutor.java:129) ~[wrangler-core-4.2.0.jar:na] ... 省略了 28 个常见帧 原因:org.json.JSONException: 6 处不匹配的关闭标签注释[字符 7 第 1 行] 在 org.json.JSONTokener.syntaxError(JSONTokener.java:505) ~[org.json.json-20090211.jar:na] 在 org.json.XML.parse(XML.java:311) ~[org.json.json-20090211.jar:na] 在 org.json.XML.toJSONObject(XML.java:520) ~[org.json.json-20090211.jar:na] 在 org.json.XML。 toJSONObject(XML.java:548) ~[org.json.json-20090211.jar:na] at org.json.XML.toJSONObject(XML.java:472) ~[org.json.json-20090211.jar:na] ] at io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:96) ~[na:na] ...省略了 30 个常见帧org.json.JSONException: org.json.JSONTokener.syntaxError(JSONTokener.java:505) ~[org.json.json-20090211.jar:na] 处 6 [字符 7 第 1 行] 处的关闭标记注释不匹配。 json.XML.parse(XML.java:311) ~[org.json.json-20090211.jar:na] at org.json.XML.toJSONObject(XML.java:520) ~[org.json.json-20090211 .jar:na] 在 org.json.XML.toJSONObject(XML.java:548) ~[org.json.json-20090211.jar:na] 在 org.json.XML.toJSONObject(XML.java:472) ~ [org.json.json-20090211.jar:na] at io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:96) ~[na:na] ...省略了 30 个常见框架org.json.JSONException: org.json.JSONTokener.syntaxError(JSONTokener.java:505) ~[org.json.json-20090211.jar:na] 处 6 [字符 7 第 1 行] 处的关闭标记注释不匹配。 json.XML.parse(XML.java:311) ~[org.json.json-20090211.jar:na] at org.json.XML.toJSONObject(XML.java:520) ~[org.json.json-20090211 .jar:na] 在 org.json.XML.toJSONObject(XML.java:548) ~[org.json.json-20090211.jar:na] 在 org.json.XML.toJSONObject(XML.java:472) ~ [org.json.json-20090211.jar:na] at io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:96) ~[na:na] ...省略了 30 个常见框架json-20090211.jar:na] 在 org.json.XML.toJSONObject(XML.java:548) ~[org.json.json-20090211.jar:na] 在 org.json.XML.toJSONObject(XML.java: 472) ~[org.json.json-20090211.jar:na] at io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:96) ~[na:na] ...省略了 30 个常用帧json-20090211.jar:na] 在 org.json.XML.toJSONObject(XML.java:548) ~[org.json.json-20090211.jar:na] 在 org.json.XML.toJSONObject(XML.java: 472) ~[org.json.json-20090211.jar:na] at io.cdap.directives.xml.XmlToJson.execute(XmlToJson.java:96) ~[na:na] ...省略了 30 个常用帧

错误消息 2:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): UnknownReason

驱动程序堆栈跟踪:在 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1661) ~[spark-core_2.11-2.3.3.jar:2.3.3 ] 在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1649) ~[spark-core_2.11-2.3.3.jar:2.3.3] 在 org.apache.spark .scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1648) ~[spark-core_2.11-2.3.3.jar:2.3.3] at scala.collection.mutable.ResizableArray$class.foreach( ResizableArray.scala:59) ~[scala-library-2.11.8.jar:na] 在 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) ~[scala-library-2.11.8.jar:na] ] 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1648) ~[spark-core_2.11-2.3.3.jar:2.3.3] 在 org.apache.spark.scheduler.DAGScheduler$$ anonfun$handleTaskSetFailed$1。应用(DAGScheduler.scala:831)~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831 ) ~[spark-core_2.11-2.3.3.jar:2.3.3] at scala.Option.foreach(Option.scala:257) ~[scala-library-2.11.8.jar:na] at org.apache .spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) ~[spark-core_2.11-2.3.3.jar:2.3.3] 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala: 1882)~[spark-core_2.11-2.3.3.jar:2.3.3] 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1831)~[spark-core_2.11-2.3.3 .jar:2.3.3] 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1820) ~[spark-core_2.11-2.3.3.jar:2.3.3] 在 org.apache.spark .util.EventLoop$$anon$1.run(EventLoop.scala:48) ~[spark-core_2.11-2.3.3.jar:2.3.3] 在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) ~[spark-core_2.11-2.3.3 .jar:2.3.3] 在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) ~[na:2.3.3] 在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) ~ [na:2.3.3] at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087) ~[na:2.3.3] at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter. scala:78) ~[spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083 ) [spark-core_2.11-2.3.3.jar:2.3.3] 在 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081) [spark-core_2.11- 2.3.3.jar:2.3.3] 在 org.apache.spark.rdd。PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081) [spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala :151) [spark-core_2.11-2.3.3.jar:2.3.3] 在 org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) [spark-core_2.11-2.3.3 .jar:2.3.3] 在 org.apache.spark.rdd.RDD.withScope(RDD.scala:363) [spark-core_2.11-2.3.3.jar:2.3.3] 在 org.apache.spark。 rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1081) [spark-core_2.11-2.3.3.jar:2.3.3] at org.apache.spark.api.java.JavaPairRDD.saveAsNewAPIHadoopDataset(JavaPairRDD.scala:831) [spark-core_2.11-2.3.3.jar:2.3.3] 在 io.cdap.cdap.etl.spark.batch.SparkBatchSinkFactory.writeFromRDD(SparkBatchSinkFactory.java:98) [hydrator-spark-core2_2.11-6.2 .0.jar:na] 在 io.cdap.cdap.etl。spark.batch.RDDCollection$1.run(RDDCollection.java:179) [hydrator-spark-core2_2.11-6.2.0.jar:na] 在 io.cdap.cdap.etl.spark.SparkPipelineRunner.runPipeline(SparkPipelineRunner.java :350) [hydrator-spark-core2_2.11-6.2.0.jar:na] 在 io.cdap.cdap.etl.spark.batch.BatchSparkPipelineDriver.run(BatchSparkPipelineDriver.java:148) [hydrator-spark-core2_2. 11-6.2.0.jar:na] 在 io.cdap.cdap.app.runtime.spark.SparkTransactional$2.run(SparkTransactional.java:236) [io.cdap.cdap.cdap-spark-core2_2.11-6.2 .0.jar:na] 在 io.cdap.cdap.app.runtime.spark.SparkTransactional.execute(SparkTransactional.java:208) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar :na] 在 io.cdap.cdap.app.runtime.spark.SparkTransactional.execute(SparkTransactional.java:138) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] 在io.cdap.cdap.app.runtime.spark.AbstractSparkExecutionContext。在 io.cdap.cdap.app.runtime.spark.SerializableSparkExecutionContext.execute(SerializableSparkExecutionContext。 scala:61) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] 在 io.cdap.cdap.app.runtime.spark.DefaultJavaSparkExecutionContext.execute(DefaultJavaSparkExecutionContext.scala:89) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] 在 io.cdap.cdap.api.Transactionals.execute(Transactionals.java:63) [na:na] 在 io. cdap.cdap.etl.spark.batch.BatchSparkPipelineDriver.run(BatchSparkPipelineDriver.java:116) [hydrator-spark-core2_2.11-6.2.0.jar:na] 在 io.cdap.cdap.app.runtime.spark。 SparkMainWrapper$.main(SparkMainWrapper.scala:86) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] 在 io.cdap.cdap.app.runtime.spark.SparkMainWrapper.main (SparkMainWrapper.scala) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] 在 sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:1.8.0_252] 在 sun.reflect. NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[na:1.8.0_252] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[na:1.8.0_252] at java.lang.reflect.Method。调用(Method.java:498)~[na:1.8.0_252] at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:56)[io.cdap.cdap.cdap-spark-core2_2.11- 6.2.0.jar:2.3.3] 在 org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894) [na:2.3.3] 在 org .apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) [na:2.3.3] 在 org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228) [na:2.3 .3] 在 org.apache.spark.deploy。SparkSubmit$.main(SparkSubmit.scala:137) [na:2.3.3] 在 org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) [spark-core_2.11-2.3.3.jar:2.3. 3] 在 io.cdap.cdap.app.runtime.spark.submit.AbstractSparkSubmitter.submit(AbstractSparkSubmitter.java:172) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na]在 io.cdap.cdap.app.runtime.spark.submit.AbstractSparkSubmitter.access$000(AbstractSparkSubmitter.java:54) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] 在io.cdap.cdap.app.runtime.spark.submit.AbstractSparkSubmitter$5.run(AbstractSparkSubmitter.java:111) [io.cdap.cdap.cdap-spark-core2_2.11-6.2.0.jar:na] 在 java .util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_252] 在 java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_252] 在 java .util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:第1149章 [na:1.8.0_252] 在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_252] 在 java.lang.Thread.run(Thread.java:748) [na:1.8.0_252]

标签: google-cloud-platformgoogle-cloud-data-fusionparsexml

解决方案


看来你XML的不正确。尝试使用下面的 XML:

<?xml version="1.0" encoding="UTF-8"?> <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>

推荐阅读