首页 > 解决方案 > Powershell to Break up CSV by Number of Row

问题描述

So I am now tasked with getting constant reports that are more than 1 Million lines long.

My last question did not explain all things so I'm tryin got do a better question.

I'm getting a dozen + daily reports that are coming in as CSV files. I don't know what the headers are or anything like that as I get them.

They are huge. I cant open in excel.

I wanted to basically break them up into the same report, just each report maybe 100,000 lines long.

The code I wrote below does not work as I keep getting a

Exception of type 'System.OutOfMemoryException' was thrown.

I am guessing I need a better way to do this.

I just need this file broken down to a more manageable size. It does not matter how long it takes as I can run it over night.

I found this on the internet, and I tried to manipulate it, but I cant get it to work.

$PSScriptRoot

write-host $PSScriptRoot

$loc = $PSScriptRoot

$location = $loc

# how many rows per CSV?
$rowsMax = 10000; 

# Get all CSV under current folder
$allCSVs = Get-ChildItem "$location\Split.csv"


# Read and split all of them
$allCSVs | ForEach-Object {
    Write-Host $_.Name;
    $content = Import-Csv "$location\Split.csv"
    $insertLocation = ($_.Name.Length - 4);
    for($i=1; $i -le $content.length ;$i+=$rowsMax){
    $newName = $_.Name.Insert($insertLocation, "splitted_"+$i)
    $content|select -first $i|select -last $rowsMax | convertto-csv -NoTypeInformation | % { $_ -replace '"', ""} | out-file $location\$newName -fo -en ascii
    }
}

标签: powershell

解决方案


关键是不要将大文件全部读入内存,这是通过从变量( )Import-Csv中捕获输出来执行的操作。$content = Import-Csv "$location\Split.csv"

也就是说,虽然使用单个管道可以解决您的内存问题,但性能可能会很差,因为您正在从 CSV 转换回 CSV,这会产生大量开销。

但是,即使使用 and 将文件作为文本读取和写入也Get-ContentSet-Content慢。
因此,我建议使用基于 .NET 的方法将文件处理为 text,这将大大加快处理速度。

以下代码演示了这种技术:

Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {

    $csvFile = $_.FullName

    # Construct a file-path template for the sequentially numbered chunk
    # files; e.g., "...\file_split_001.csv"
    $csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'

    # Set how many lines make up a chunk.
    $chunkLineCount = 10000

    # Read the file lazily and save every chunk of $chunkLineCount
    # lines to a new file.
    $i = 0; $chunkNdx = 0
    foreach ($line in [IO.File]::ReadLines($csvFile)) {
        if ($i -eq 0) { ++$i; $header = $line; continue } # Save header line.
        if ($i++ % $chunkLineCount -eq 1) { # Create new chunk file.
            # Close previous file, if any.
            if (++$chunkNdx -gt 1) { $fileWriter.Dispose() }

            # Construct the file path for the next chunk, by
            # instantiating the template with the next sequence number.
            $csvFileChunk = $csvFileChunkTemplate -f $chunkNdx
            Write-Verbose "Creating chunk: $csvFileChunk"

            # Create the next chunk file and write the header.
            $fileWriter = [IO.File]::CreateText($csvFileChunk)
            $fileWriter.WriteLine($header)
        }
        # Write a data row to the current chunk file.
        $fileWriter.WriteLine($line)
    }
    $fileWriter.Dispose() # Close the last file.

}

请注意,上面的代码创建了无 BOM 的 UTF-8 文件;如果您的输入仅包含 ASCII 范围的字符,则这些文件实际上将是 ASCII 文件。


这是等效的单管道解决方案,它可能要慢得多。

Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {

    $csvFile = $_.FullName

    # Construct a file-path template for the sequentially numbered chunk
    # files; e.g., ".../file_split_001.csv"
    $csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'

    # Set how many lines make up a chunk.
    $chunkLineCount = 10000

    $i = 0; $chunkNdx = 0
    Get-Content -LiteralPath $csvFile | ForEach-Object {
        if ($i -eq 0) { ++$i; $header = $_; return } # Save header line.
        if ($i++ % $chunkLineCount -eq 1) { # 
            # Construct the file path for the next chunk.
            $csvFileChunk = $csvFileChunkTemplate -f ++$chunkNdx
            Write-Verbose "Creating chunk: $csvFileChunk"
            # Create the next chunk file and write the header.
            Set-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $header
        }
        # Write data row to the current chunk file.
        Add-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $_
    }

}

推荐阅读