javascript - 错误:使用 createReadStream 时无法创建长度超过 0x3fffffe7 个字符的字符串
问题描述
我正在解析非常大的 CSV 文件(~37gbs)。我正在使用fs.createReadStream
和csv-parser。我将它们分成 5000 行,然后将它们插入 mongo db。但是,即使注释掉了 mongo 部分,也会发生此错误。
这是解析文件的函数:
function parseCsv(fileName: string, db: Db): Promise<any[]> {
let parsedData: any[] = [];
let counter = 0;
return new Promise((resolve, reject) => {
const stream = fs.createReadStream(fileName)
.pipe(csvParser())
.on('data', async (row) => {
const data = parseData(row);
parsedData.push(data);
if (parsedData.length > 5000) {
stream.pause();
// insert to mongo
counter++;
console.log('counter - ', counter, parsedData[0].personfirstname, parsedData[23].personfirstname);
parsedData = [];
// try {
// await db.collection('people').insertMany(parsedData, { ordered: false });
// parsedData = [];
// }
// catch (e) {
// console.log('error happened', e, parsedData.length);
// process.exit();
// }
stream.resume();
}
})
.on('error', (error) => {
console.error('There was an error reading the csv file', error);
})
.on('end', () => {
console.log('CSV file successfully processed');
resolve()
});
});
}
这是解析数据的函数。一个由管道分隔的单元格中的所有值有点混乱,所以我只是将它们分开:
function parseData(data: any) {
let values = '';
for (var key in data) {
if (data.hasOwnProperty(key)) {
values += data[key];
}
}
const splitValues = values.split('|');
let parsedData: any = {};
// Remove deep reference
parsedData = JSON.parse(JSON.stringify(template));
let keyCounter = 0;
for (let key in parsedData) {
if (parsedData.hasOwnProperty(key)) {
try {
parsedData[key] = splitValues[keyCounter].trim();
}
catch (e) {
console.log('error probably trimming', key, splitValues[keyCounter], splitValues, data);
throw '';
}
keyCounter++;
}
}
const now = new Date();
parsedData.createdAt = now;
parsedData.updatedAt = now;
return parsedData;
}
它会很好地解析(直到~200万行)然后挂起。最后在挂了一夜之后,我早上检查并看到以下错误:
buffer.js:580
if (encoding === 'utf-8') return buf.utf8Slice(start, end);
^
Error: Cannot create a string longer than 0x3fffffe7 characters
at stringSlice (buffer.js:580:44)
at Buffer.toString (buffer.js:643:10)
at CsvParser.parseValue (C:\js_scripts\csv-worker\node_modules\csv-parser\index.js:175:19)
at CsvParser.parseCell (C:\js_scripts\csv-worker\node_modules\csv-parser\index.js:86:17)
at CsvParser.parseLine (C:\js_scripts\csv-worker\node_modules\csv-parser\index.js:142:24)
at CsvParser._flush (C:\js_scripts\csv-worker\node_modules\csv-parser\index.js:196:10)
at CsvParser.prefinish (_stream_transform.js:140:10)
at CsvParser.emit (events.js:200:13)
at prefinish (_stream_writable.js:633:14)
at finishMaybe (_stream_writable.js:641:5) {
code: 'ERR_STRING_TOO_LONG'
}
不应该createReadStream
确保不会发生这种情况吗?每行有 415 列。单行是否可能太大?它总是停在同一个地方,所以这似乎很可能。由于文件太大,我没有办法打开它们。如果是这样,我怎样才能检测到这一点并跳过这一行或以不同的方式处理它?
解决方案
推荐阅读
- php - AJAX 无法加载状态为 500 的服务器
- haskell - 为里面的辅助函数添加类型签名
- html - 如何将多行文本对齐为表格的列
- python - `(?P
...) ` 和 `\g ` 在 re 模块中
- javascript - JavaScript:使用正则表达式在字符和数字之间添加空格
- android - 如果快速按下按钮,导航到 DialogFragment 会产生 IllegalArgumentException?
- javascript - import vuex store into axios and app.js file
- c# - Summing by multiple columns in a list
- git - GitHub Remote: forbidden fatal: unable to access the requested url returned error: 403
- sql - Finding Duplicates: GROUP BY and DISTINCT giving different answers