powershell - HTML 表格转 CSV
问题描述
我对 powershell 很陌生,我使用 JohnLBevan 的这段代码将 HTML 表转换为 CSV:
function ConvertFrom-HtmlTableRow {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTableRow
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
$headers
,
[Parameter(Mandatory = $false, ValueFromPipeline = $false)]
[switch]$isHeader
)
process {
$cols = $htmlTableRow | select -expandproperty td
if($isHeader.IsPresent) {
0..($cols.Count - 1) | %{$x=$cols[$_] | out-string; if(($x) -and ($x.Trim() -gt [string]::Empty)) {$x} else {("Column_{0:0000}" -f $_)}} #clean the headers to ensure each col has a name
} else {
$colCount = ($cols | Measure-Object).Count - 1
$result = new-object -TypeName PSObject
0..$colCount | %{
$colName = if($headers[$_]){$headers[$_]}else{("Column_{0:00000} -f $_")} #in case we have more columns than headers
$colValue = $cols[$_]
$result | Add-Member NoteProperty $colName $colValue
}
write-output $result
}
}
}
function ConvertFrom-HtmlTable {
[CmdletBinding()]
param (
[Parameter(Mandatory = $true, ValueFromPipeline = $true)]
$htmlTable
)
process {
#currently only very basic <table><tr><td>...</td></tr></table> structure supported
#could be improved to better understand tbody, th, nested tables, etc
#$htmlTable.childNodes | ?{ $_.tagName -eq 'tr' } | ConvertFrom-HtmlTableRow
#remove anything tags that aren't td or tr (simplifies our parsing of the data
[xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>{0}</root>" -f ($htmlTable | select -ExpandProperty innerHTML | %{(($_ | out-string) -replace '(</?t[rdh])[^>]*(/?>)|(?:<[^>]*>)','$1$2') -replace '(</?)(?:th)([^>]*/?>)','$1td$2'}))
[string[]]$headers = $cleanedHtml.root.tr | select -first 1 | ConvertFrom-HtmlTableRow -isHeader
if ($headers.Count -gt 0) {
$cleanedHtml.root.tr | select -skip 1 | ConvertFrom-HtmlTableRow -Headers $headers | select $headers
}
}
}
但是,每当我从 parsedHTML 变量执行它并获取 elementbytagname “table”时,我都会收到此错误:
Cannot convert value "<!DOCTYPE doctypeName [<!ENTITY nbsp ' '>]><root>
</root>" to type "System.Xml.XmlDocument". Error: "The 'Tr' start tag on line 16 position 124 does not match the end tag of 'td'. Line 20, position 3."
At line:108 char:9
+ [xml]$cleanedHtml = ("<!DOCTYPE doctypeName [<!ENTITY nbsp ' ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidArgument: (:) [], RuntimeException
+ FullyQualifiedErrorId : InvalidCastToXmlDocument
我希望有人可以帮助我。提前致谢。
我正在尝试使用外部网站。这是表格的 HTML 代码:
<table class="organization-admin__table table">
<thead>
<tr>
<th colspan="2">Name</th>
<th>Email address</th>
<th>Timezone</th>
<th>Last logged in</th>
<th>Actions</th>
</tr>
</thead>
<tbody>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user1 <user1@site.com>" title="user1 <user1@site.com>">
<img src="https://portal.website.com/avatar/0fd7f51cee04789c617b1cc973e0b245.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F87d37c%2Ffff%26text%3DTM" alt="user1 <user1@site.com>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">user1</a></td>
<td><a href="mailto:user1@site.com">user1@site.com</a></td>
<td>Canada/Eastern</td>
<td>05 Aug 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user2 <user2@site.ca>" title="user2 <user2@site.ca>">
<img src="https://portal.website.com/avatar/481355c93fa79e47ca56110da63d6da5.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2F044f67%2Ffff%26text%3DVS" alt="user2 <user2@site.ca>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">user2</a></td>
<td><a href="mailto:user2@site.ca">user2@site.ca</a></td>
<td>Canada/Eastern</td>
<td>16 Jul 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
<tr>
<td width="48px">
<a href="/site/users/samluser" class="avatar hz-hint hz-hint--bottom" data-hint="user3 <user3@site.com>" title="user3 <user3@site.com>">
<img src="https://portal.website.com/avatar/450f564aaba30e75fe70dc5f4bbefaf6.jpg?s=64&r=g&d=https%3A%2F%2Fportal.website.com%2Fplaceholders%2F64%2Fffb61e%2Ffff%26text%3DWP" alt="Wilfred <user3@site.com>" width="32" height="32">
</a>
</td>
<td><a href="/site/users/samluser">Wilfred</a></td>
<td><a href="mailto:user3@site.com">Wilfred@site.com</a></td>
<td>UTC</td>
<td>26 Jul 2021</td>
<td>
<ul class="button-group">
<li>
<a href="/site/users/samluser/edit" class="btn btn-sm btn-primary">
<i class="fa fa-pencil-alt"></i>
Edit
</a>
</li>
<li>
<a href="/site/users/samluser/delete" class="btn btn-sm btn-danger">
<i class="fa fa-trash-alt"></i>
Delete
</a>
</li>
</ul>
</td>
</tr>
</tbody>
</table>
解决方案
正如所评论的,转换为 XML 具有严格的规则,当 HTML 忽略编写结束标记时</tr>
将其加载为 xml 将失败。<img>
没有结束标记的标记也是如此</img>
。
我没有您正在加载的完整 html,但也许可以尝试以下函数:
function ConvertFrom_HtmlTable {
# adapted from: https://www.leeholmes.com/blog/2015/01/05/extracting-tables-from-powershells-invoke-webrequest/
[CmdletBinding(DefaultParameterSetName = 'ByIndex')]
param(
[Parameter(Mandatory = $true, Position = 0)]
[Microsoft.PowerShell.Commands.HtmlWebResponseObject]$WebRequest,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
[int]$TableIndex = 0,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
[string]$TableId,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
[string]$TableName,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
[string]$TableClassName
)
# Extract the table out of the web request
switch ($PSCmdlet.ParameterSetName) {
'ById' { $table = $WebRequest.ParsedHtml.getElementByID($TableId) }
'ByIndex' { $table = @($WebRequest.ParsedHtml.getElementsByTagName('table'))[$TableIndex]}
'ByName' { $table = @($WebRequest.ParsedHtml.getElementsByName($TableName))[0] }
'ByClass' { $table = @($WebRequest.ParsedHtml.getElementsByClassName($TableClassName))[0] }
}
if (!$table) {
Write-Warning "Could not find the given table."
return $null
}
# load the System.Web assembly to be able to decode HTML entities
Add-Type -AssemblyName System.Web
$headers = @()
# Go through all of the rows in the table
foreach ($row in $table.Rows) {
$cells = @($row.Cells)
# If there is a table header, remember its titles
if($cells[0].tagName -eq "TH") {
$i = 0
$headers = @($cells | ForEach-Object {
$i++
# decode HTML entities and double-up quotes that the value may contain
$th = ([System.Web.HttpUtility]::HtmlDecode($_.InnerText) -replace '"', '""').Trim()
# if the table header is empty, create it
if ([string]::IsNullOrEmpty($th)) { "H$i" } else { $th }
})
# proceed with the next row
continue
}
# if we haven't found any table headers, make up names "H1", "H2", etc.
if(-not $headers) {
$headers = @(1..($cells.Count + 2) | ForEach-Object { "H$_" })
}
# Now go through the cells in the the row. For each, try to find the
# title that represents that column and create a hashtable mapping those
# titles to content
$hash = [Ordered]@{}
for ($i = 0; $i -lt $cells.Count; $i++) {
# decode HTML entities and double-up quotes that the value may contain
$value = ([System.Web.HttpUtility]::HtmlDecode($cells[$i].InnerText) -replace '"', '""').Trim()
$th = $headers[$i]
$hash[$th] = $value.Trim()
}
# And finally cast that hashtable to a PSCustomObject
[PSCustomObject]$hash
}
}
像这样称呼它:
$request = Invoke-WebRequest $uri
$table = ConvertFrom_HtmlTable -WebRequest $request -TableClassName 'organization-admin__table table'
或者如果您知道它是 html 中的第一个或第 x 个表,则使用 TableIndex 参数,因为它显然没有id
或name
如果成功,您可以简单地写入 csv:
$table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation
从您的评论来看,您似乎由于某种原因不能使用Invoke-WebRequest
并且必须使用 IE com 对象进行解析。
试试这个版本的函数:
function ConvertFrom_HtmlTable {
[CmdletBinding(DefaultParameterSetName = 'ByIndex')]
param(
[Parameter(ValueFromPipeline = $true, Mandatory = $true, Position = 0)]
[string]$Url,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByIndex')]
[int]$TableIndex = 0,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ById')]
[string]$TableId,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByName')]
[string]$TableName,
[Parameter(Mandatory = $false, Position = 1, ParameterSetName = 'ByClass')]
[string]$TableClassName,
[switch]$FirstRowHasHeaders
)
$ie = New-Object -ComObject 'InternetExplorer.Application'
$ie.Visible = $false
$ie.Silent = $true
$ie.Navigate($Url)
# wait for IE to fully load the document
while($ie.Busy) { Start-Sleep -Milliseconds 100 }
$doc = $ie.Document
switch ($PSCmdlet.ParameterSetName) {
'ById' { $table = $doc.IHTMLDocument3_getElementByID($TableId) }
'ByIndex' { $table = @($doc.IHTMLDocument3_getElementsByTagName('table'))[$TableIndex]}
'ByName' { $table = @($doc.IHTMLDocument3_getElementsByName($TableName))[0] }
'ByClass' { $table = @($doc.IHTMLDocument3_getElementsByClassName($TableClassName))[0] }
}
if ($table) {
# Extracting table rows as a collection.
$tbody = $table.childNodes | Where-Object { $_.tagName -eq "tbody" }
if ($tbody) {
$rows = $tbody.childNodes | Where-Object { $_.tagName -eq "tr" }
}
else {
$rows = $table.childNodes | Where-Object { $_.tagName -eq "tr" }
}
# read or create table headers
# assume the first row has headers either in <th> or <td> tags
$firstRow = 1
$headers = @($rows[0].childNodes | Where-Object { $_.tagName -eq "th" } | Foreach-Object { $_.innerHTML })
if (!($headers)) {
# there were no <th> tags found, so either use the first row as headers or create from scratch
$values = @($rows[0].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
if ($FirstRowHasHeaders) {
# the headers are considered to be the values from the first row
$headers = $values
}
else {
# the table has no headers, so dynamically create them
$firstRow = 0
$headers = for ($i = 1; $i -le $values.Count; $i++) { "Column_$i" }
}
}
# create a List object to store the values found as PSObjects
$result = [System.Collections.Generic.List[object]]::new()
for ($i = $firstRow; $i -lt $rows.Count; $i++) {
$values = @($rows[$i].childNodes | Where-Object { $_.tagName -eq "td" } | Foreach-Object { $_.innerHTML })
$valuesCount = $values.Count
while ($headers.Count -lt $valuesCount) {
$colName = "Column_{0}" -f ($headers.Count + 1)
$headers += $colName
# we have just added a new header column. Make sure the first item also has this new column
if ($result.Count) {
$result[0] | Add-Member -MemberType NoteProperty -Name $colName -Value $null
}
}
# create a Hashtable to get store the values
$data = [ordered]@{}
for ($j = 0; $j -lt $valuesCount; $j++) { $data[$headers[$j]] = $values[$j] }
# add the hash cast to PsCustomObject to the list
$result.Add(([PsCustomObject]$data))
}
}
else { Write-Warning "Could not find the given table." }
# quit IE and clean up
$ie.Quit()
$null = [System.Runtime.Interopservices.Marshal]::ReleaseComObject($ie)
[System.GC]::Collect()
[System.GC]::WaitForPendingFinalizers()
return $result
}
$table = ConvertFrom_HtmlTable -Url 'your URL here' -TableClassName 'organization-admin__table'
$table | Export-Csv -Path 'X:\path\to\theTable.csv' -NoTypeInformation
使用 COM 对象的第二个函数InternetExplorer.Application
需要使用 DOM 查找表对象。为此,该函数当前使用IHTMLDocument3 接口,例如,当我在 Windows 10 Pro、PowerShell 5.1 和 IE 版本 11.789.19041.0 上进行测试时,该接口可以工作
ConvertFrom_HtmlTable -Url 'https://www.w3schools.com/html/html_tables.asp' -TableId 'customers'
根据您的评论,您收到错误消息:
方法调用失败,因为 [mshtml.HTMLDocumentClass] 不包含名为“IHTMLDocument3_getElementsByClassName”的方法。
这意味着您的机器上有不同的(未更新/损坏的)版本,您必须自己尝试哪种方法有效:
- 首先通过在 PowerShell 控制台中键入以下内容来测试您拥有的 IE 版本:
(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').SvcVersion
如果返回空白,请尝试
(Get-ItemProperty 'HKLM:\SOFTWARE\Microsoft\Internet Explorer').Version
- 接下来,在
switch
更改内部使用的方法
$doc.IHTMLDocument3_getElementsByClassName($TableClassName)
到
$doc.getElementsByClassName($TableClassName)
或者
$doc.documentElement.getElementsByClassName($TableClassName)
如果所有这些都失败了,恐怕您的计算机上存在严重问题(也许也是Invoke-Webrequest
无法正常工作的原因?)。尝试使用fsc /scannow修复它
推荐阅读
- vue.js - 如何减少包含 AWS 开发工具包的 Vue.js / Nuxt.js 项目的 webpack 包大小?
- c# - 使用 C# 在数组中获取对象值时遇到问题
- apache-nifi - 使用 UpdateRecord 处理器将流文件内容列从十六进制转换为十进制
- git - Git repo 子文件夹作为另一个 repo 的子树?
- sql - 为什么执行 INSERT 语句后我的 SQL 表中的列仍然为 NULL?
- graphql - React Apollo GraphQL Mutation 返回 400(错误请求)
- css - 如何滚动flex-grow-1的flex行的内容
- python - 数据未显示在表格中
- android - 无法找到 DialogFragment 中的 NavController
- typescript - 在 Typescript 和 GraphQL 中删除用户后有消息