java - 有没有更好的方法快速生成 500 万个 csv 文件
问题描述
我想创建 500 万个 csv 文件,我已经等了将近 3 个小时,但程序仍在运行。有人可以给我一些建议,如何加快文件生成。
在这 500 万个文件生成完成后,我必须将它们上传到 s3 存储桶。
如果有人知道如何通过AWS生成这些文件会更好,这样,我们可以直接将文件移动到s3存储桶,而忽略网络速度问题。(刚开始学习AWS,有很多知识需要知道)
以下是我的代码。
public class ParallelCsvGenerate implements Runnable {
private static AtomicLong baseID = new AtomicLong(8160123456L);
private static ThreadLocalRandom random = ThreadLocalRandom.current();
private static ThreadLocalRandom random2 = ThreadLocalRandom.current();
private static String filePath = "C:\\5millionfiles\\";
private static List<String> headList = null;
private static String csvHeader = null;
public ParallelCsvGenerate() {
headList = generateHeadList();
csvHeader = String.join(",", headList);
}
@Override
public void run() {
for(int i = 0; i < 1000000; i++) {
generateCSV();
}s
}
private void generateCSV() {
StringBuilder builder = new StringBuilder();
builder.append(csvHeader).append(System.lineSeparator());
for (int i = 0; i < headList.size(); i++) {
if(i < headList.size() - 1) {
builder.append(i % 2 == 0 ? generateRandomInteger() : generateRandomStr()).append(",");
} else {
builder.append(i % 2 == 0 ? generateRandomInteger() : generateRandomStr());
}
}
String fileName = String.valueOf(baseID.addAndGet(1));
File csvFile = new File(filePath + fileName + ".csv");
FileWriter fileWriter = null;
try {
fileWriter = new FileWriter(csvFile);
fileWriter.write(builder.toString());
fileWriter.flush();
} catch (Exception e) {
System.err.println(e);
} finally {
try {
if(fileWriter != null) {
fileWriter.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
private static List<String> generateHeadList() {
List<String> headList = new ArrayList<>(20);
String baseFiledName = "Field";
for(int i = 1; i <=20; i++) {
headList.add(baseFiledName + i);
}
return headList;
}
/**
* generate a number in range of 0-50000
* @return
*/
private Integer generateRandomInteger() {
return random.nextInt(0,50000);
}
/**
* generate a string length is 5 - 8
* @return
*/
private String generateRandomStr() {
int strLength = random2.nextInt(5, 8);
String str="abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
int length = str.length();
StringBuilder builder = new StringBuilder();
for (int i = 0; i < strLength; i++) {
builder.append(str.charAt(random.nextInt(length)));
}
return builder.toString();
}
主要的
ParallelCsvGenerate generate = new ParallelCsvGenerate();
Thread a = new Thread(generate, "A");
Thread b = new Thread(generate, "B");
Thread c = new Thread(generate, "C");
Thread d = new Thread(generate, "D");
Thread e = new Thread(generate, "E");
a.run();
b.run();
c.run();
d.run();
e.run();
谢谢大佬的建议,直接重构代码,用2.8h生成380万个文件,好很多。重构代码:
public class ParallelCsvGenerate implements Callable<Integer> {
private static String filePath = "C:\\5millionfiles\\";
private static String[] header = new String[]{
"FIELD1","FIELD2","FIELD3","FIELD4","FIELD5",
"FIELD6","FIELD7","FIELD8","FIELD9","FIELD10",
"FIELD11","FIELD12","FIELD13","FIELD14","FIELD15",
"FIELD16","FIELD17","FIELD18","FIELD19","FIELD20",
};
private String fileName;
public ParallelCsvGenerate(String fileName) {
this.fileName = fileName;
}
@Override
public Integer call() throws Exception {
try {
generateCSV();
} catch (IOException e) {
e.printStackTrace();
}
return 0;
}
private void generateCSV() throws IOException {
CSVWriter writer = new CSVWriter(new FileWriter(filePath + fileName + ".csv"), CSVWriter.DEFAULT_SEPARATOR, CSVWriter.NO_QUOTE_CHARACTER);
String[] content = new String[]{
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr(),
RandomGenerator.generateRandomInteger(),
RandomGenerator.generateRandomStr()
};
writer.writeNext(header);
writer.writeNext(content);
writer.close();
}
}
主要的
public static void main(String[] args) {
System.out.println("Start generate");
long start = System.currentTimeMillis();
ThreadPoolExecutor threadPoolExecutor = new ThreadPoolExecutor(8, 8,
0L, TimeUnit.MILLISECONDS,
new LinkedBlockingQueue<Runnable>());
List<ParallelCsvGenerate> taskList = new ArrayList<>(3800000);
for(int i = 0; i < 3800000; i++) {
taskList.add(new ParallelCsvGenerate(i+""));
}
try {
List<Future<Integer>> futures = threadPoolExecutor.invokeAll(taskList);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Success");
long end = System.currentTimeMillis();
System.out.println("Using time: " + (end-start));
}
解决方案
您可以直接写入文件(无需在一个 StringBuilder 中分配整个文件)。(我认为这是这里最大的时间+内存瓶颈:)
builder.toString()
您可以并行生成每个文件。
(小调整:)省略 if 的内部循环。
if(i < headList.size() - 1)
不需要,当您进行更聪明的循环 + 1 次额外迭代时。i % 2 == 0
可以通过更好的迭代()来消除......i+=2
以及循环内的更多劳动(i -> int, i + 1 -> string
)如果适用,更
append(char)
喜欢append(String)
. (append(',')
比append(",")
!)
...
推荐阅读
- python - 您如何使用 Kivy GUI 访问其他类方法和函数?
- typescript - ReferenceError:未定义 WebGLRenderingContext
- python - 导入 TensorFlow 时出错 - 无法加载原生 TensorFlow 运行时
- javascript - typescript:在 typescript 项目中使用声明文件来全局公开类型
- c - 如何在 MSYS2 中建立 GMP?
- sql - 使用插入和更新模拟 SQL 合并
- spring - 在多对多中只获取 id 而不是整个对象
- java - 为什么当我打开一个新的 JFrame 时,我的组件会改变格式?
- arrays - 分配时对数组使用 typedef 是否比数组本身占用更多内存?
- python - 无法在命令提示符下使用 pip 下载 streamlit