powershell - Powershell to Break up CSV by Number of Row
问题描述
So I am now tasked with getting constant reports that are more than 1 Million lines long.
My last question did not explain all things so I'm tryin got do a better question.
I'm getting a dozen + daily reports that are coming in as CSV files. I don't know what the headers are or anything like that as I get them.
They are huge. I cant open in excel.
I wanted to basically break them up into the same report, just each report maybe 100,000 lines long.
The code I wrote below does not work as I keep getting a
Exception of type 'System.OutOfMemoryException' was thrown.
I am guessing I need a better way to do this.
I just need this file broken down to a more manageable size. It does not matter how long it takes as I can run it over night.
I found this on the internet, and I tried to manipulate it, but I cant get it to work.
$PSScriptRoot
write-host $PSScriptRoot
$loc = $PSScriptRoot
$location = $loc
# how many rows per CSV?
$rowsMax = 10000;
# Get all CSV under current folder
$allCSVs = Get-ChildItem "$location\Split.csv"
# Read and split all of them
$allCSVs | ForEach-Object {
Write-Host $_.Name;
$content = Import-Csv "$location\Split.csv"
$insertLocation = ($_.Name.Length - 4);
for($i=1; $i -le $content.length ;$i+=$rowsMax){
$newName = $_.Name.Insert($insertLocation, "splitted_"+$i)
$content|select -first $i|select -last $rowsMax | convertto-csv -NoTypeInformation | % { $_ -replace '"', ""} | out-file $location\$newName -fo -en ascii
}
}
解决方案
关键是不要将大文件全部读入内存,这是通过从变量( )Import-Csv
中捕获输出来执行的操作。$content = Import-Csv "$location\Split.csv"
也就是说,虽然使用单个管道可以解决您的内存问题,但性能可能会很差,因为您正在从 CSV 转换回 CSV,这会产生大量开销。
但是,即使使用 and 将文件作为文本读取和写入也Get-Content
很Set-Content
慢。
因此,我建议使用基于 .NET 的方法将文件处理为 text,这将大大加快处理速度。
以下代码演示了这种技术:
Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {
$csvFile = $_.FullName
# Construct a file-path template for the sequentially numbered chunk
# files; e.g., "...\file_split_001.csv"
$csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'
# Set how many lines make up a chunk.
$chunkLineCount = 10000
# Read the file lazily and save every chunk of $chunkLineCount
# lines to a new file.
$i = 0; $chunkNdx = 0
foreach ($line in [IO.File]::ReadLines($csvFile)) {
if ($i -eq 0) { ++$i; $header = $line; continue } # Save header line.
if ($i++ % $chunkLineCount -eq 1) { # Create new chunk file.
# Close previous file, if any.
if (++$chunkNdx -gt 1) { $fileWriter.Dispose() }
# Construct the file path for the next chunk, by
# instantiating the template with the next sequence number.
$csvFileChunk = $csvFileChunkTemplate -f $chunkNdx
Write-Verbose "Creating chunk: $csvFileChunk"
# Create the next chunk file and write the header.
$fileWriter = [IO.File]::CreateText($csvFileChunk)
$fileWriter.WriteLine($header)
}
# Write a data row to the current chunk file.
$fileWriter.WriteLine($line)
}
$fileWriter.Dispose() # Close the last file.
}
请注意,上面的代码创建了无 BOM 的 UTF-8 文件;如果您的输入仅包含 ASCII 范围的字符,则这些文件实际上将是 ASCII 文件。
这是等效的单管道解决方案,它可能要慢得多。
Get-ChildItem $PSScriptRoot/*.csv | ForEach-Object {
$csvFile = $_.FullName
# Construct a file-path template for the sequentially numbered chunk
# files; e.g., ".../file_split_001.csv"
$csvFileChunkTemplate = $csvFile -replace '(.+)\.(.+)', '$1_split_{0:000}.$2'
# Set how many lines make up a chunk.
$chunkLineCount = 10000
$i = 0; $chunkNdx = 0
Get-Content -LiteralPath $csvFile | ForEach-Object {
if ($i -eq 0) { ++$i; $header = $_; return } # Save header line.
if ($i++ % $chunkLineCount -eq 1) { #
# Construct the file path for the next chunk.
$csvFileChunk = $csvFileChunkTemplate -f ++$chunkNdx
Write-Verbose "Creating chunk: $csvFileChunk"
# Create the next chunk file and write the header.
Set-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $header
}
# Write data row to the current chunk file.
Add-Content -Encoding ASCII -LiteralPath $csvFileChunk -Value $_
}
}
推荐阅读
- python - 我如何在一个包含函数调用的列表中编写下面的 python 代码?
- php - PHP setcookie 域参数
- c - 使用 C 写入 csv 文件会导致问题
- javascript - GetDisplayMedia - 调整共享应用程序窗口大小时是否有要监听的事件
- node.js - VsCode 终端 - 如何清除终端回滚缓冲区 - Windows 10
- c - 为什么 c 程序代码不询问第二个值以及为什么 char 以 int 形式存储我从网络上的示例中获取
- c# - 如何以最有效的方式呈现巨大的 JSON 文件 (1.8) Gb
- javascript - 如何访问vue事件全局总线传递的对象属性
- google-apps-script - 如果单独调用或从循环调用,Classroom API 在函数中的不同行为
- python - 将字符串拆分为列表,但引号中的两个单词为一个?