首页 > 解决方案 > 如何使用 Excel 宏抓取嵌套在 Div id 中的 Table 类的第一列

问题描述

好的,这是目标网页:http: //dnd.arkalseif.info/items/index.html_page=27

这是我当前的代码:

Sub GetItemsList()
' This macro uses manually entered links to scrap the content of the target page.
' It does not (yet) capture hyperlinks, it only grabs text.
Dim ie As Object
Dim retStr As String
Dim sht As Worksheet
Dim LastRow As Long
Dim rCell As Range
Dim rRng As Range
Dim Count As Long
Dim Status As String
Dim BadCount As Long


Set sht = ThisWorkbook.Worksheets("List")
BadCount = 0

LastRow = sht.Cells(sht.Rows.Count, "A").End(xlUp).Row
    Set ie = CreateObject("internetexplorer.application")
    Set rRng = sht.Range("b1:b" & LastRow)
    Status = "Starting at row "
    For Each rCell In rRng.Cells
        Count = rCell.Row
        Application.StatusBar = BadCount & " dead links so far. " & Status & Count & "of " & LastRow & "."
        Wait 1
        If rCell = "" Then
            With ie
                .Navigate rCell.Offset(0, -1).Value
                .Visible = False
            End With
            Do While ie.Busy
                DoEvents
            Loop
            Wait 1

            On Error GoTo ErrHandler
'            rCell.Value = ie.Document.getElementById("content").innerText
            rCell.Value = ie.Document.getElementsByClassName("common").innerText
            rCell.WrapText = False
            Status = "This row successfully scraped. Moving on to row "
            Application.StatusBar = BadCount & " dead links so far. " & Status & Count + 1 & "of " & LastRow & "."
            Status = "Previous row succeded. Now at row "
98            Wait 1
        End If
    Next rCell
    If BadCount > 0 Then
        Application.StatusBar = "Macro finshed running with " & BadCount & " errors."
        Else
        Application.StatusBar = "Finished."
    End If
    Exit Sub
ErrHandler:
    rCell.Value = ""
    Status = "Previous row failed. Moving on to row "
    BadCount = BadCount + 1
    Application.StatusBar = "This row is a dead link. " & BadCount & " dead links so far. Moving on to row " & Count + 1 & "of " & LastRow & "."
    Resume 98
End Sub

(尝试忽略我所有的 StatusBar 更新,此代码最初是用于超链接的 looooong 列表,我需要(当时)知道什么时候出现问题)

现在,注释掉的行起作用了,因为它从div id Content中获取了整个文本。但是我想获取嵌套在表的第一列中的超链接,该表嵌套在表的第一列中div id(这就是下一行的用途)。但它只是失败了。Excel 什么也不做,将其视为错误,然后继续执行下一个链接。

我想我需要告诉ExcelTable class . Div id但我不知道该怎么做,我也无法弄清楚。

感谢大家。

标签: htmlexcelvbaweb-scraping

解决方案


我会使用 CSS 选择器来定位链接,并使用XMLHTTP作为比启动浏览器更快的检索方法。


CSS 选择器:

以下:

td:first-child [href]

td:first-child 是标记元素的CSS伪:first-child类选择器;后代组合选择器,是属性选择器。基本上,在这种情况下,它选择每一行中的第一个元素,即第一列,然后选择其中的属性元素。td" "[]tdhref

:first-child CSS 伪类表示一组兄弟元素中的第一个元素。

遗憾的是,VBA 实现不支持:not选择器,因为确切的元素也可以与 .common tr + tr td :not([href*='rule'],br). 对伪选择器的支持非常有限。在这种情况下,如果后代组合:nth-child()支持. 我一直想写下支持和不支持的内容,以防有人想要作为参考。如果您随后选择切换到支持的语言,请注意甚至不支持 VBA 的方法。td:nth-child(1)td:nth-child(1) [href]

querySelectorAll在这种情况下,选择器是通过 的方法应用的HTMLDocument。它将所有匹配项返回为可以通过索引访问单个匹配元素的匹配项nodeList.Length

节点列表项:

在此处输入图像描述


Option Explicit
Public Sub GetLinks()
    Dim sResponse As String, html As HTMLDocument, nodeList As Object, i As Long
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "http://dnd.arkalseif.info/items/index.html_page=27", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
    Set html = New HTMLDocument
    With html
        .body.innerHTML = sResponse
        Set nodeList = .querySelectorAll("td:first-child [href]")
        For i = 0 To nodeList.Length - 1
            Debug.Print Replace$(nodeList.item(i), "about:", "http://dnd.arkalseif.info/items/")
        Next
    End With
End Sub

参考资料(VBE > 工具 > 参考资料):

  1. Microsoft HTML 对象库

推荐阅读