首页 > 解决方案 > 如何在Nodejs中批量读取大csv文件?

问题描述

我有一个包含超过 50 万条记录的 csv 文件。csv的字段是

如果不将大量数据加载到内存中,我需要处理文件中的所有记录。需要读取少量记录,将它们插入到收集和操作中,然后继续读取剩余的记录。因为我是新手,所以无法理解它是如何工作的。如果我尝试打印批次,它会打印缓冲数据,下面的代码可以满足我的要求吗?使用该缓冲值,我如何获取 csv 记录和插入、操作文件数据。

var stream = fs.createReadStream(csvFilePath)
        .pipe(csv())
        .on('data',(data) => {
            batch.push(data)
            counter ++;
            if(counter == 100){
                
                stream.pause()
                setTimeout(() => {
                    console.log("batch in ",data)
                    counter = 0;
                    batch = []
                    stream.resume()},5000)
            }
        })
        .on('error',(e) => {
             console.log("er ",e);
        })
        .on('end',() => {
             console.log("end");
        })
    

标签: node.jscsv

解决方案


我已经为您编写了一些如何使用流的示例代码。您基本上创建了一个流并继续处理它的块。块是类型的对象buffer。将其作为文本调用处理toString()

没有太多时间向您解释更多,但评论应该会有所帮助。

还要考虑使用一个模块,因为 csv 解析已经做了很多。希望这有帮助>

import * as fs from 'fs'

// end oof line delimiter, system specific.
import { EOL } from 'os'

// the delimiter used in csv
var delimiter = ','

// add your own implementttaion of parsing a portion of  the text here.
const parseChunk = (text,  index) => {

    // first chunk, the header is included here. 
    if(index === 0) {
        // The first row will be the header. So take it
        var headerLine = text.substring(0, text.indexOf(EOL))
        
        // remove the header from the text for further processing.
        // also replace the  new line character..
        text = text.replace(headerLine+EOL, '')

        // Do something with header here..
        
    }


    // Now you have a part of the file to process without headers.
    // The csv parse function you need to figure out yourself. Best
    // is to use some module for that. there are plenty of edge cases
    // when parsing csv.

    // custom csv parser here =>h ttps://stackoverflow.com/questions/1293147/example-javascript-code-to-parse-csv-data

    // if the csv is well formatted it could be enough to use  this
    var lines = text.split(EOL)
        
    for(var line of lines) {
        var values = line.split(delimiter)
        console.log('liine  received', values)
        // StoreToDb(values)
    }
}

// create the stream
const stream = fs.createReadStream('file.csv')

// variable to count the  chunks  for knowing if header is inckuded..
var chunkCount = 0

// handle data event of stream
stream.on('data', chunk => {


    // the stream sends you a Buffer
    // to have it as text, convert it to string
    const text = chunk.toString()

    // Note that chunks will be a fixed size
    // but mostly consist of multiple lines,
    parseChunk(text, chunkCount)

    // increment the count.
    chunkCount++
})

stream.on('end', () => {
    console.log('parsing finished')
})

stream.on('error', (err) => {
    // error, handle properly here, maybe rollback changess already made to db
    // and parse again. You can may also use the chunkCount to start the parsing
    // again and omit first x chunks, so u can restsart at given point
    console.log('parsing error ', err)
})

推荐阅读