首页 > 解决方案 > VBA 中的网页抓取,其中某些 HTML 信息无法引用

问题描述

我有这个 VBA 脚本从这个 URL https://accessgudid.nlm.nih.gov/devices/10806378034350

我想要下图中的 LOT、SERIAL 和 EXPIRATION 信息,在 HTML 中具有“是”或“否”。

我如何只返回是或否信息?

HTML 剪辑

Sub GetInnerInformation(HTMLPage As MSHTML.HTMLDocument)
    Dim HTMLResult As MSHTML.IHTMLElement
    Dim HTMLResults As MSHTML.IHTMLElementCollection
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
    Dim RowNum As Long, ColNum As Integer

    Set HTMLResults = HTMLPage.getElementsByClassName("device-attribute")

    For Each HTMLResult In HTMLResults
        If (HTMLResult.innerText Like "*Lot*") = True Then
            Debug.Print HTMLResult.innerText, HTMLResult.outerText, HTMLResult.innerHTML
        End If
    Next HTMLResult

End Sub

在我的即时窗口中,我得到:

Lot or Batch Number:        Lot or Batch Number:        Lot or Batch Number:

所以不要引用 HTML 中的 Yes 或 No。

标签: htmlexcelvbaweb-scraping

解决方案


HTML解析器:

您可以使用 css属性 = 值选择器来定位感兴趣的span[?] 之前的目标div。然后爬到共享的父节点parentElement,并移动到div感兴趣的节点NextSibling。然后,您可以使用getElementsByTagName来获取labels节点,并循环该 nodeList 以写出所需的信息。要获取与标签关联的值,您再次需要使用NextSibling来处理brparent 中的子项div

我使用 xmlhttp 发出比打开浏览器更快的请求。

Option Explicit   
Public Sub WriteOutYesNos()
    Dim html As MSHTML.HTMLDocument

    Set html = New MSHTML.HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://accessgudid.nlm.nih.gov/devices/10806378034350", False
        .send
        html.body.innerHTML = .responseText
    End With

    Dim nodes As Object, i As Long

    Set nodes = html.querySelector("[title*='A production identifier (PI) is a variable']").parentElement.NextSibling.getElementsByTagName("LABEL")

    For i = 0 To nodes.Length - 3
        With ActiveSheet
            .Cells(i + 1, 1) = nodes(i).innerText
            .Cells(i + 1, 2) = nodes(i).NextSibling.NodeValue
        End With
    Next
End Sub

JSON解析器:

数据也可以作为 json 使用,这意味着您可以使用 json 解析器来处理。我使用 jsonconverter.bas 作为 json 解析器来处理响应。从这里下载原始代码并添加到名为JsonConverter. 然后,您需要转到 VBE > 工具 > 参考 > 添加对Microsoft Scripting Runtime. Attribute从复制的代码中删除第一行。

Option Explicit   
Public Sub WriteOutYesNos()
    Dim json As Object, ws As Worksheet, results(), i As Long, s As String

    Set ws = ThisWorkbook.Worksheets("Sheet1")
    results = Array("lotBatch", "serialNumber", "manufacturingDate")

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://accessgudid.nlm.nih.gov/devices/10806378034350.json", False
        .send
        Set json = JsonConverter.ParseJson(.responseText)
    End With

    With ws
        For i = LBound(results) To UBound(results)
            .Cells(i + 1, 1) = results(i)
            .Cells(i + 1, 2).Value = IIf(json(results(i)), "Yes", "No")
        Next
    End With
End Sub

XML解析器:

结果也以 xml 形式出现,只要您适当地处理默认命名空间,您就可以使用 xml 解析器对其进行解析:

Option Explicit
Public Sub WriteOutYesNos()
    Dim xmlDoc As Object, ws As Worksheet, results(), i As Long

    Set xmlDoc = CreateObject("MSXML2.DOMDocument")
    Set ws = ThisWorkbook.Worksheets("Sheet1")
    results = Array("lotBatch", "serialNumber", "manufacturingDate")

    With xmlDoc
        .validateOnParse = True
        .setProperty "SelectionLanguage", "XPath"
        .setProperty "SelectionNamespaces", "xmlns:i='http://www.fda.gov/cdrh/gudid'"
        .async = False

        If Not .Load("https://accessgudid.nlm.nih.gov/devices/10806378034350.xml") Then
            Err.Raise .parseError.ErrorCode, , .parseError.reason
            Exit Sub
        End If
    End With

    With ws
        For i = LBound(results) To UBound(results)
            .Cells(i + 1, 1) = results(i)
            .Cells(i + 1, 2).Value = IIf(xmlDoc.SelectSingleNode("//i:" & results(i)).Text, "Yes", "No")
        Next
    End With
End Sub

推荐阅读