vba - 在 VBA 中是否有循环类型、函数或方法来清理 HTML 文件中节点的嵌套 For Each 循环?
问题描述
试图找到一种方法来简化多个 For Each 循环。当我开始使用添加代码来实际处理解析的数据时,我意识到这需要一些工作。
我已经有一些使用 Internet Explorer 参考的工作,但我的目标是不使用任何额外的参考,因为它更快。也希望有一天能在 Mac 上使用它。我在 Excel 中编码,看看我在处理它时得到了什么。决赛实际上将在 PowerPoint 中进行。
Sub TestHTML()
'Load Document
Set objDocument = CreateObject("MSXML2.DOMDocument")
objDocument.async = False: objDocument.validateOnParse = False
objDocument.Load (ThisWorkbook.Path & "ThisFile.html")
Set ZeroNode = objDocument.DocumentElement
'Set Rows and Columns
intRow = 0
intColAttribute = 1
intColTag = 2
intColText = 3
'Loop through Nodes
For Each OneNode In ZeroNode.ChildNodes
If OneNode.HasChildNodes() Then
For Each TwoNode In OneNode.ChildNodes
If TwoNode.HasChildNodes() Then
For Each ThreeNode In TwoNode.ChildNodes
If ThreeNode.HasChildNodes() Then
For Each FourNode In ThreeNode.ChildNodes
If FourNode.HasChildNodes() Then
For Each FiveNode In FourNode.ChildNodes
If FiveNode.HasChildNodes() Then
For Each SixNode In FiveNode.ChildNodes
If SixNode.HasChildNodes() Then
For Each SevenNode In SixNode.ChildNodes
intRow = intRow + 1
If SixNode.Attributes.Length > 0 Then Worksheets("Test").Cells(intRow, intColAttribute) = SixNode.Attributes(0).Text
Worksheets("Test").Cells(intRow, intColTag) = SevenNode.BaseName
Worksheets("Test").Cells(intRow, intColText) = SevenNode.Text
Next SevenNode
Else 'SixNode.HasChildNodes()
intRow = intRow + 1
If FiveNode.Attributes.Length > 0 Then Worksheets("Test").Cells(intRow, intColAttribute) = FiveNode.Attributes(0).Text
Worksheets("Test").Cells(intRow, intColTag) = SixNode.BaseName
Worksheets("Test").Cells(intRow, intColText) = SixNode.Text
End If 'SixNode.HasChildNodes()
Next SixNode
Else 'FiveNode.HasChildNodes()
intRow = intRow + 1
If FourNode.Attributes.Length > 0 Then Worksheets("Test").Cells(intRow, intColAttribute) = FourNode.Attributes(0).Text
Worksheets("Test").Cells(intRow, intColTag) = FiveNode.BaseName
Worksheets("Test").Cells(intRow, intColText) = FiveNode.Text
End If 'FiveNode.HasChildNodes()
Next FiveNode
Else 'FourNode.HasChildNodes()
intRow = intRow + 1
If ThreeNode.Attributes.Length > 0 Then Worksheets("Test").Cells(intRow, intColAttribute) = ThreeNode.Attributes(0).Text
Worksheets("Test").Cells(intRow, intColTag) = FourNode.BaseName
Worksheets("Test").Cells(intRow, intColText) = FourNode.Text
End If 'FourNode.HasChildNodes()
Next FourNode
Else 'ThreeNode.hasChildNode()
intRow = intRow + 1
If TwoNode.Attributes.Length > 0 Then Worksheets("Test").Cells(intRow, intColAttribute) = TwoNode.Attributes(0).Text
Worksheets("Test").Cells(intRow, intColTag) = ThreeNode.BaseName
Worksheets("Test").Cells(intRow, intColText) = ThreeNode.Text
End If 'ThreeNode.hasChildNode()
Next ThreeNode
Else 'TwoNode.hasChildNode()
intRow = intRow + 1
If OneNode.Attributes.Length > 0 Then Worksheets("Test").Cells(intRow, intColAttribute) = OneNode.Attributes(0).Text
Worksheets("Test").Cells(intRow, intColTag) = TwoNode.BaseName
Worksheets("Test").Cells(intRow, intColText) = TwoNode.Text
End If 'TwoNode.hasChildNode()
Next TwoNode
Else 'OneNode.hasChildNode()
intRow = intRow + 1
Worksheets("Test").Cells(intRow, intColTag) = OneNode.BaseName
Worksheets("Test").Cells(intRow, intColText) = OneNode.Text
End If 'OneNode.hasChildNode()
Next OneNode
Set objDocument = Nothing
End Sub
这是一个示例 HTML:
<?xml version='1.0' encoding='utf-8'?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Title</title>
<meta content="http://www.w3.org/1999/xhtml; charset=utf-8" http-equiv="Content-Type"/>
<link href="stylesheet.css" type="text/css" rel="stylesheet"/></head>
<body class="c0">
<div class="sheader" id="c_pb_21">
<span class="snumber">1</span>
<span class="stitle">Title</span>
<div class="sinfo">
InfoLine1 <br class="c1"/>
InfoLine2
</div>
</div>
<div class="sbody">
<p class="left">Intro</p>
<dl class="v">
<dt class="vnumber">1.</dt>
<dd class="vbody">
VLine1<br class="c1"/>
VLine2<br class="c1"/>
VLine3<br class="c1"/>
VLine4<p class="c6"/>
<p class="c6">VLine6<br class="c1"/>
VLine7<br class="c1"/>
VLine8<br class="c1"/>
VLine9</p>
<p class="c6">VLine11<br class="c1"/>
VLine12<br class="c1"/>
VLine13<br class="c1"/>
VLine14<br class="c1"/>
VLine15<br class="c1"/>
VLine16</p></dd>
</dl>
<dl class="v">
<dt class="vnumber">2.</dt>
<dd class="vbody">
VLine1<br class="c1"/>
VLine2<br class="c1"/>
VLine3<br class="c1"/>
VLine4<p class="c6"/>
<p class="c6">VLine6<br class="c1"/>
VLine7<br class="c1"/>
VLine8<br class="c1"/>
VLine9</p>
<p class="c6">VLine11<br class="c1"/>
VLine12<br class="c1"/>
VLine13<br class="c1"/>
VLine14<br class="c1"/>
VLine15<br class="c1"/>
VLine16</p></dd>
</dl>
<dl class="v">
<dt class="vnumber"> </dt>
<dd class="cs">
CLine1<br class="c1"/>
CLine2<br class="c1"/>
CLine3<br class="c1"/>
CLine4</dd>
</dl>
</div>
</body></html>
这是我试图从这个 HTML 中提取的内容:
snumber: 1
stitle: Title
sinfo[Line1]: InfoLine1
sinfo[Line2]: InfoLine2
left: Intro
v[1](vnumber): 1
v[1](TYPE): vbody << TYPE is from the class name
v[1](Line1): VLine1 << vbody is split at the <br class="c1"/>
v[1](Line2): VLine2
v[1](Line3): VLine3
v[1](Line4): VLine4
v[1][1](Line1): VLine6 << <p class="c6"> needs to be identified, yet <dd class="vbody"> continues
v[1][1](Line2): VLine7
v[1][1](Line3): VLine8
v[1][1](Line4): VLine9
v[1][2](Line1): VLine11
v[1][2](Line2): VLine12
...
v[2][2](Line6): VLine16
v[3](vnumber):
v[3](TYPE): cs << TYPE is from the class name
v[3](Line1): CLine1
v[3](Line2): CLine2
v[3](Line3): CLine3
v[3](Line4): CLine4
这段代码有效,只是试图清理它,以便我可以更轻松地使用它。
我的最终目标是做到这一点,以便我可以获取多种类型的 HTML 文件并将它们“转换”为 PowerPoint。我已经以另一种方式为本示例文档完成了此操作。此代码有助于查看可以提取的内容,但实际使用信息的下一步是它变得困难的地方。
我是编程新手,但编写了很多东西。这是我第一次在论坛发帖。
解决方案
我弄清楚了我最初想要做什么。在进行更多研究时,我发现了一个遍历文件夹的示例。我从中学到的是,潜艇可以自称。这使得清理代码成为可能。请看下面的代码:
Public intRow As Integer
Public intColAttribute As Integer
Public intColTag As Integer
Public intColText As Integer
Sub TestHTML()
'Load Document
Set objDocument = CreateObject("MSXML2.DOMDocument")
objDocument.async = False: objDocument.validateOnParse = False
objDocument.Load (ThisWorkbook.path & "\ThisFile.html")
Set ParentNode = objDocument.DocumentElement
'Set Rows and Columns
intRow = 1
intColAttribute = 1
intColTag = 2
intColText = 3
'Loop through Nodes
If Not ParentNode Is Nothing Then
TraverseNodes ParentNode
End If 'Not ParentNode
End Sub
Sub TraverseNodes(ParentNode)
For Each ChildNode In ParentNode.ChildNodes
If ChildNode.HasChildNodes() Then
TraverseNodes ChildNode
Else 'ChildNode.HasChildNodes()
intRow = intRow + 1
Debug.Print intRow
If ParentNode.Attributes.Length > 0 Then
' Here is where I can decide what to do with the Class Name
Worksheets("Test").Cells(intRow, intColAttribute) _
= ParentNode.Attributes(0).Text
End If 'ParentNode.Attributes.Length
' Here is where I can decide what to do with the Tag Name and Text
Worksheets("Test").Cells(intRow, intColTag) = ChildNode.BaseName
Worksheets("Test").Cells(intRow, intColText) = ChildNode.Text
End If 'ChildNode.HasChildNodes()
Next ChildNode
End Sub
推荐阅读
- webpack - 使用 ScalaJSBundlerPlugin (webpack bundler) 时无法从 Javascript 访问 JSExport
- linux - 添加到拒绝列表后如何获得配置服务器防火墙(csf)的正确返回码?
- windows - Makefile 在 Windows 上失败:语法错误“(”意外
- javascript - 将 javascript 对象从一种形式转换为另一种形式
- excel - 尝试使用 VBA 表单中的数据创建一个 excel 工作表,然后将其添加到工作簿的末尾
- html - Django 将 HTML 渲染为 PDF
- android - 如何为一系列工作人员调用 setForegroundAsync
- swift - 如何在swift中使用更少的参数创建闭包调用闭包
- python - 显示图表时无法更改线条粗细
- deep-learning - dm-sonnet=2.0.0 中的 snt.AbstractModule