首页 > 解决方案 > Apache Spark - dataset presenting a csv to java.io.File

问题描述

Small question regarding Apache Spark, and how to get the dataset as File please.

I would like to upload some java.io.File to some destinations. The destinations are not databases, but rather some sort of DropBox, S3, and such.

The good thing, I have some utility packages that are already provided to me, and they are working fine, tested with non-Spark jobs.

public static void main(String[] args) {
        File myCSVfile = new File("/path/to/my/file.csv");
        SomeUtil.uploadfileToDropBox(myCSVfile);
        SomeOtherUtil.uploadFileToS3(myCSVfile);
//this is working fine!

Above successfully runs fine, very happy.

Now I need to upload the file result of a Spark job using the same.

Therefore, I tried:

public static void main(String[] args) {
        final Dataset<Row> dataSetRow = sparkSession.read().[...].load();
        final Dataset<Row> dataSetRowTransformed = dataSetRow.map((MapFunction<Row, Row>) row -> doSomeComplexTransformation(row), getMyEncoder());
        dataSetRowTransformed.repartition(1).write().csv("/path/to/where/to/save/the/csv");

And magic, I do see the final csv file generated by Spark in the folder, I can open it.

However, I am not able in the code to get it as File to upload it with previous mechanism.

Question: How to get the file that I generated (I see it, I can open it, it is correct) as File, so Spark can upload it using the utility classes mentioned above, all within one Spark job please?

Thank you

标签: javaapache-spark

解决方案


推荐阅读