scala - 如何使用 Scala 读取 sgm 文件
问题描述
我想使用 Scala 和可能的 Spark 来玩转 1987 年路透社数据集。我可以看到我下载的文件是 .sgm 格式的。我以前从未见过这个,但执行了一个more
:
$ more reut2-003.sgm
<!DOCTYPE lewis SYSTEM "lewis.dtd">
<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="19419" NEWID="3001">
<DATE> 9-MAR-1987 04:58:41.12</DATE>
<TOPICS><D>money-fx</D></TOPICS>
<PLACES><D>uk</D></PLACES>
<PEOPLE></PEOPLE>
<ORGS></ORGS>
<EXCHANGES></EXCHANGES>
<COMPANIES></COMPANIES>
<UNKNOWN>
RM
f0416reute
b f BC-U.K.-MONEY-MARKET-SHO 03-09 0095</UNKNOWN>
<TEXT>
<TITLE>U.K. MONEY MARKET SHORTAGE FORECAST AT 250 MLN STG</TITLE>
<DATELINE> LONDON, March 9 - </DATELINE><BODY>The Bank of England said it forecast a
shortage of around 250 mln stg in the money market today.
Among the factors affecting liquidity, it said bills
maturing in official hands and the treasury bill take-up would
drain around 1.02 billion stg while below target bankers'
balances would take out a further 140 mln.
Against this, a fall in the note circulation would add 345
mln stg and the net effect of exchequer transactions would be
an inflow of some 545 mln stg, the Bank added.
REUTER
</BODY></TEXT>
</REUTERS>
我们可以看到它看起来非常简单的标记。
由于我不想编写自己的解析器,我的问题是,是否有一些简单的方法可以使用一些库在 Scala/Spark 中解析它?
解决方案
问:由于我不想编写自己的解析器,所以我的问题是,有没有一些简单的方法可以在 Scala/Spark 中使用一些库来解析它?
AFAIK 没有这样的 api。您必须对其进行映射和解析(清除其中的特殊字符)。转换为多列。
我尝试了以下方式...但是您的 xml 显示为数据框中的损坏记录。
进一步的指针:https ://github.com/databricks/spark-xml
import java.io.File
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{SQLContext, SparkSession}
/**
* Created by Ram Ghadiyaram
*/
object SparkXmlWithDtd {
def main(args: Array[String]) {
val spark = SparkSession.builder.
master("local")
.appName(this.getClass.getName)
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val sc = spark.sparkContext
val sqlContext = new SQLContext(sc)
val str =
"""
|<!DOCTYPE lewis SYSTEM "lewis.dtd">
|
|<REUTERS TOPICS="YES" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="19419" NEWID="3001">
|<DATE> 9-MAR-1987 04:58:41.12</DATE>
|<TOPICS><D>money-fx</D></TOPICS>
|<PLACES><D>uk</D></PLACES>
|<PEOPLE></PEOPLE>
|<ORGS></ORGS>
|<EXCHANGES></EXCHANGES>
|<COMPANIES></COMPANIES>
|<UNKNOWN>
|RM
|f0416reute
|b f BC-U.K.-MONEY-MARKET-SHO 03-09 0095</UNKNOWN>
|<TEXT>
|<TITLE>U.K. MONEY MARKET SHORTAGE FORECAST AT 250 MLN STG</TITLE>
|<DATELINE> LONDON, March 9 - </DATELINE><BODY>The Bank of England said it forecast a
|shortage of around 250 mln stg in the money market today.
| Among the factors affecting liquidity, it said bills
|maturing in official hands and the treasury bill take-up would
|drain around 1.02 billion stg while below target bankers'
|balances would take out a further 140 mln.
| Against this, a fall in the note circulation would add 345
|mln stg and the net effect of exchequer transactions would be
|an inflow of some 545 mln stg, the Bank added.
| REUTER
|</BODY></TEXT>
|</REUTERS>
""".stripMargin
val f = new File("sgmtest.sgm")
FileUtils.writeStringToFile(f, str)
val xml_df = spark.read.
format("com.databricks.spark.xml")
.option("rowTag", "REUTERS")
.load(f.getAbsolutePath)
xml_df.printSchema()
xml_df.createOrReplaceTempView("XML_DATA")
spark.sql("SELECT * FROM XML_DATA").show(false)
xml_df.show(false)
}
}
结果 :
根 |-- _corrupt_record: 字符串(可为空=真) +-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------+ |_corrupt_record| +-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------+ | 1987 年 3 月 9 日 04:58:41.12 货币外汇 英国 R M f0416 罗特 bf BC-UK-MONEY-MARKET-SHO 03-09 0095 英国货币市场短缺预测为 2.5 亿标准吨 伦敦,3 月 9 日——英格兰银行表示,它预测 今日货币市场短缺约 2.5 亿英镑。 在影响流动性的因素中,它说票据 在官方手中成熟,国库券的吸收将 在低于目标银行家的情况下消耗约 10.2 亿英镑 余额将再增加1.4亿。 与此相反,纸币发行量下降将增加 345 百万 stg 和国库交易的净影响将是 该银行补充说,流入约 5.45 亿英镑。 路透社 | +-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------+ +-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------+ |_corrupt_record| +-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------+ | 1987 年 3 月 9 日 04:58:41.12 货币外汇 英国 R M f0416 罗特 bf BC-UK-MONEY-MARKET-SHO 03-09 0095 英国货币市场短缺预测为 2.5 亿标准吨 伦敦,3 月 9 日——英格兰银行表示,它预测 今日货币市场短缺约 2.5 亿英镑。 在影响流动性的因素中,它说票据 在官方手中成熟,国库券的吸收将 在低于目标银行家的情况下消耗约 10.2 亿英镑 余额将再增加1.4亿。 与此相反,纸币发行量下降将增加 345 百万 stg 和国库交易的净影响将是 该银行补充说,流入约 5.45 亿英镑。 路透社 | +-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------------------------------------------- -------------------------------------------------- -------------------------------------------------- ----------------------+
推荐阅读
- sql-server - 如果未提供日期,则返回最近 6 个月的记录或自提供日期起最近 6 个月的记录
- powershell - 将 CSV 导入数组,然后通过 ForEach 循环运行每个 IP
- python - 如何使用python和selenium通过加载更多按钮来抓取无限滚动的网站
- swift - 获取从星期一开始的本周日期数组
- r - r中的新列中带有if条件的for循环
- r - do.call 从参数列表中返回单个值 - 期望多个值
- python-3.x - Odoo API 使用 XML-RPC 创建雇主离开
- javascript - 使用 nodejs 上传 blob
- tfs - 在 Visual Studio 2015 社区版中设置 TFS
- python-3.x - 如何以编程方式运行 Python 单元测试