首页 > 解决方案 > Combined Spark output into single file

问题描述

I'm wondering if there's a way to combine the final result into a single file when using Spark? Here's the code I have:

conf = SparkConf().setAppName("logs").setMaster("local[*]")
sc = SparkContext(conf = conf)

logs_1 = sc.textFile('logs/logs_1.tsv')
logs_2 = sc.textFile('logs/logs_2.tsv')

url_1 = logs_1.map(lambda line: line.split("\t")[2])
url_2 = logs_2.map(lambda line: line.split("\t")[2])

all_urls = uls_1.intersection(urls_2)
all_urls = all_urls.filter(lambda url: url != "localhost") 

all_urls.collect()

all_urls.saveAsTextFile('logs.csv')

The collect() method doesn't seem to be working (or I've misunderstood its purpose). Essentially, I need the 'saveAsTextFile' to output to a single file, instead of a folder with parts.

标签: apache-sparkpyspark

解决方案


Well, before you save, you can repartition once, like below:

all_urls.repartition(1).saveAsTextFile(resultPath)

then you would get just one result file.


推荐阅读