apache-spark - Combined Spark output into single file
问题描述
I'm wondering if there's a way to combine the final result into a single file when using Spark? Here's the code I have:
conf = SparkConf().setAppName("logs").setMaster("local[*]")
sc = SparkContext(conf = conf)
logs_1 = sc.textFile('logs/logs_1.tsv')
logs_2 = sc.textFile('logs/logs_2.tsv')
url_1 = logs_1.map(lambda line: line.split("\t")[2])
url_2 = logs_2.map(lambda line: line.split("\t")[2])
all_urls = uls_1.intersection(urls_2)
all_urls = all_urls.filter(lambda url: url != "localhost")
all_urls.collect()
all_urls.saveAsTextFile('logs.csv')
The collect() method doesn't seem to be working (or I've misunderstood its purpose). Essentially, I need the 'saveAsTextFile' to output to a single file, instead of a folder with parts.
解决方案
Well, before you save, you can repartition once, like below:
all_urls.repartition(1).saveAsTextFile(resultPath)
then you would get just one result file.
推荐阅读
- mysql - 如何消除查询中的某些值
- node.js - 如何根据用户表单输入查询api
- mysql - 查找 2019 年支出比 2018 年至少增加 10% 的客户
- java - 使用 try/catch 块在 Java 中运行方法时,在调试模式下找不到源错误
- java - 如何将 XDDF 图表添加到特定段落运行(即表格中的单元格)?兴趣点 4.0.1
- html - 如何使 flexbox 嵌套列在嵌套 flex 容器内独立垂直滚动
- botframework - 有没有办法为 Microsoft bot 框架对话对话框使用存储而不是直接服务存储来存储对话历史
- javascript - 如何在 ASP.NET MVC 中使用 JQUERY 将项目列表添加到表中?
- javascript - 保持对 React 状态变量的“引用”
- google-apps-script - 我们如何在单个 google drive 应用程序中运行多个功能?