json - 如何优化此 Powershell 脚本,将 JSON 转换为 CSV?
问题描述
我有一个非常大的 JSON 行文件,包含 4.000.000 行,我需要从每一行转换几个事件。生成的 CSV 文件包含 15.000.000 行。如何优化此脚本?
我使用的是 Powershell core 7,完成转换大约需要 50 个小时。
我的 Powershell 脚本:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$output = @()
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
foreach ($line in [System.IO.File]::ReadLines($Importfile, $encoding)) {
$json = $line | ConvertFrom-Json
foreach ($item in $json.events.items) {
$CSVLine = [pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $item.type
Eventdate = $item.date
Eventdescription = $item.description
}
$output += $CSVLine
}
$i++
$ig++
if ($i -ge 30000) {
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
$i = 0
$output = @()
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
}
$output | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
这是JSON的结构。
{
"id": "111111111",
"name": {
"name": "Test Company GmbH",
"legalForm": "GmbH"
},
"address": {
"street": "Berlinstr.",
"postalCode": "11111",
"city": "Berlin"
},
"status": "liquidation",
"events": {
"items": [{
"type": "Liquidation",
"date": "2001-01-01",
"description": "Liquidation"
}, {
"type": "NewCompany",
"date": "2000-01-01",
"description": "Neueintragung"
}, {
"type": "ControlChange",
"date": "2002-01-01",
"description": "Tested Company GmbH"
}]
},
"relatedCompanies": {
"items": [{
"company": {
"id": "2222222",
"name": {
"name": "Test GmbH",
"legalForm": "GmbH"
},
"address": {
"city": "Berlin",
"country": "DE",
"formattedValue": "Berlin, Deutschland"
},
"status": "active"
},
"roles": [{
"date": "2002-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"demotion": true,
"group": "Control",
"dir": "Source"
}, {
"date": "2001-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"group": "Control",
"dir": "Source"
}]
}, {
"company": {
"id": "33333",
"name": {
"name": "Test2 GmbH",
"legalForm": "GmbH"
},
"address": {
"city": "Berlin",
"country": "DE",
"formattedValue": "Berlin, Deutschland"
},
"status": "active"
},
"roles": [{
"date": "2002-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"demotion": true,
"group": "Control",
"dir": "Source"
}, {
"date": "2001-01-01",
"name": "Komplementär",
"type": "Komplementaer",
"group": "Control",
"dir": "Source"
}]
}]
}
}
解决方案
根据评论:尽量避免使用增加赋值运算符 ( +=
) 创建集合。
请改用 PowerShell 管道,例如:
$stopwatch = [system.diagnostics.stopwatch]::StartNew()
$totalrows = 4000000
$encoding = [System.Text.Encoding]::UTF8
$i = 0
$ig = 0
$Importfile = "C:\file.jsonl"
$Exportfile = "C:\file.csv"
if (test-path $Exportfile) {
Remove-Item -path $Exportfile
}
Get-Content $Importfile -Encoding $encoding | Foreach-Object {
$json = $_ | ConvertFrom-Json
$json | ConvertFrom-Json | Foreach-Object {
[pscustomobject]@{
Key = $json.Register.Key
CompanyID = $json.id
Eventtype = $_.type
Eventdate = $_.date
Eventdescription = $_.description
}
}
$i++
$ig++
if ($i -ge 30000) {
$i = 0
$minutes = $stopwatch.elapsed.TotalMinutes
$percentage = $ig / $totalrows * 100
$totalestimatedtime = $minutes * (100/$percentage)
$timeremaining = $totalestimatedtime - $minutes
Write-Host "Events: Total minutes passed: $minutes. Total minutes remaining: $timeremaining. Percentage: $percentage"
}
} | Export-Csv -Path $Exportfile -NoTypeInformation -Delimiter ";" -Encoding UTF8 -Append
Write-Output $ig
$stopwatch.Stop()
更新 2020-05-07
根据问题的评论和额外信息,我编写了一个小的可重用 cmdlet,它使用 PowerShell 管道来读取.jsonl
(Json Lines)文件。它收集每一行,直到找到一个结束的 '}' 字符,然后它检查一个有效的 json 字符串(使用Test-Json
可能有嵌入的对象。如果它是有效的,它会中间释放管道中的提取对象并再次开始收集行:
Function ConvertFrom-JsonLines {
[CmdletBinding()][OutputType([Object[]])]Param (
[Parameter(ValueFromPipeLine = $True, Mandatory = $True)][String]$Line
)
Begin { $JsonLines = [System.Collections.Generic.List[String]]@() }
Process {
$JsonLines.Add($Line)
If ( $Line.Trim().EndsWith('}') ) {
$Json = $JsonLines -Join [Environment]::NewLine
If ( Test-Json $Json -ErrorAction SilentlyContinue ) {
$Json | ConvertFrom-Json
$JsonLines.Clear()
}
}
}
}
你可以像这样使用它:
Get-Content .\file.jsonl | ConvertFrom-JsonLines | ForEach-Object { $_.events.items } |
Export-Csv -Path $Exportfile -NoTypeInformation -Encoding UTF8