首页 > 解决方案 > 有没有更好的方法快速生成 500 万个 csv 文件

问题描述

我想创建 500 万个 csv 文件,我已经等了将近 3 个小时,但程序仍在运行。有人可以给我一些建议,如何加快文件生成。

在这 500 万个文件生成完成后,我必须将它们上传到 s3 存储桶。

如果有人知道如何通过AWS生成这些文件会更好,这样,我们可以直接将文件移动到s3存储桶,而忽略网络速度问题。(刚开始学习AWS,有很多知识需要知道)

以下是我的代码。

public class ParallelCsvGenerate implements Runnable {
    private static AtomicLong baseID = new AtomicLong(8160123456L);
    private static ThreadLocalRandom random = ThreadLocalRandom.current();
    private static ThreadLocalRandom random2 = ThreadLocalRandom.current();
    private static String filePath = "C:\\5millionfiles\\";
    private static List<String> headList = null;
    private static String csvHeader = null;
    public ParallelCsvGenerate() {
        headList = generateHeadList();
        csvHeader = String.join(",", headList);
    }


    @Override
    public void run() {
        for(int i = 0; i < 1000000; i++) {
            generateCSV();
        }s
    }


    private void generateCSV() {
        StringBuilder builder = new StringBuilder();
        builder.append(csvHeader).append(System.lineSeparator());
        for (int i = 0; i < headList.size(); i++) {
            if(i < headList.size() - 1) {
                builder.append(i % 2 == 0 ? generateRandomInteger() : generateRandomStr()).append(",");
            } else {
                builder.append(i % 2 == 0 ? generateRandomInteger() : generateRandomStr());
            }
        }


        String fileName = String.valueOf(baseID.addAndGet(1));
        File csvFile = new File(filePath + fileName + ".csv");
        FileWriter fileWriter = null;
        try {
            fileWriter = new FileWriter(csvFile);
            fileWriter.write(builder.toString());
            fileWriter.flush();
        } catch (Exception e) {
            System.err.println(e);
        } finally {
            try {
                if(fileWriter != null) {
                    fileWriter.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }




    private static List<String> generateHeadList() {
        List<String> headList = new ArrayList<>(20);
        String baseFiledName = "Field";
        for(int i = 1; i <=20; i++) {
            headList.add(baseFiledName + i);
        }
        return headList;
    }




    /**
     * generate a number in range of 0-50000
     * @return
     */
    private Integer generateRandomInteger() {
        return random.nextInt(0,50000);
    }




    /**
     * generate a string length is 5 - 8
     * @return
     */
    private String generateRandomStr() {
        int strLength = random2.nextInt(5, 8);
        String str="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
        int length = str.length();
        StringBuilder builder = new StringBuilder();
        for (int i = 0; i < strLength; i++) {
            builder.append(str.charAt(random.nextInt(length)));
        }
        return builder.toString();
    }

主要的

ParallelCsvGenerate generate = new ParallelCsvGenerate();


Thread a = new Thread(generate, "A");
Thread b = new Thread(generate, "B");
Thread c = new Thread(generate, "C");
Thread d = new Thread(generate, "D");
Thread e = new Thread(generate, "E");

a.run();
b.run();
c.run();
d.run();
e.run();

谢谢大佬的建议,直接重构代码,用2.8h生成380万个文件,好很多。重构代码:

public class ParallelCsvGenerate implements Callable<Integer> {
    private static String filePath = "C:\\5millionfiles\\";
    private static String[] header = new String[]{
            "FIELD1","FIELD2","FIELD3","FIELD4","FIELD5",
            "FIELD6","FIELD7","FIELD8","FIELD9","FIELD10",
            "FIELD11","FIELD12","FIELD13","FIELD14","FIELD15",
            "FIELD16","FIELD17","FIELD18","FIELD19","FIELD20",
    };
    private String fileName;
    public ParallelCsvGenerate(String fileName) {
        this.fileName = fileName;
    }

    @Override
    public Integer call() throws Exception {
        try {
            generateCSV();
        } catch (IOException e) {
            e.printStackTrace();
        }
        return 0;
    }

    private void generateCSV() throws IOException {

        CSVWriter writer = new CSVWriter(new FileWriter(filePath + fileName + ".csv"), CSVWriter.DEFAULT_SEPARATOR, CSVWriter.NO_QUOTE_CHARACTER);
        String[] content = new String[]{
                RandomGenerator.generateRandomInteger(),
                RandomGenerator.generateRandomStr(),
                RandomGenerator.generateRandomInteger(),
                RandomGenerator.generateRandomStr(),
                RandomGenerator.generateRandomInteger(),
                RandomGenerator.generateRandomStr(),
                RandomGenerator.generateRandomInteger(),
                RandomGenerator.generateRandomStr(),
                RandomGenerator.generateRandomInteger(),
                RandomGenerator.generateRandomStr(),
                RandomGenerator.generateRandomInteger(),
                RandomGenerator.generateRandomStr(),
                RandomGenerator.generateRandomInteger(),
                RandomGenerator.generateRandomStr(),
                RandomGenerator.generateRandomInteger(),
                RandomGenerator.generateRandomStr(),
                RandomGenerator.generateRandomInteger(),
                RandomGenerator.generateRandomStr(),
                RandomGenerator.generateRandomInteger(),
                RandomGenerator.generateRandomStr()
        };
        writer.writeNext(header);
        writer.writeNext(content);
        writer.close();
    }

}

主要的

public static void main(String[] args) {
        System.out.println("Start generate");
        long start = System.currentTimeMillis();
        ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(8, 8,
                0L, TimeUnit.MILLISECONDS,
                new LinkedBlockingQueue<Runnable>());
        List<ParallelCsvGenerate> taskList = new ArrayList<>(3800000);
        for(int i = 0; i < 3800000; i++) {
            taskList.add(new ParallelCsvGenerate(i+""));
        }
        try {
            List<Future<Integer>> futures = threadPoolExecutor.invokeAll(taskList);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        System.out.println("Success");
        long end = System.currentTimeMillis();
        System.out.println("Using time: " + (end-start));
    }

标签: javaamazon-web-servicesamazon-s3

解决方案


  1. 您可以直接写入文件(无需在一个 StringBuilder 中分配整个文件)。(我认为这是这里最大的时间+内存瓶颈:)builder.toString()

  2. 您可以并行生成每个文件。

  3. (小调整:)省略 if 的内部循环。

    if(i < headList.size() - 1)不需要,当您进行更聪明的循环 + 1 次额外迭代时。

    i % 2 == 0可以通过更好的迭代()来消除......i+=2以及循环内的更多劳动(i -> int, i + 1 -> string

  4. 如果适用,更append(char)喜欢append(String). (append(',')append(",")!)

...


推荐阅读