首页 > 解决方案 > 网站在网络抓取期间未识别多个表

问题描述

我一直在尝试学习网络抓取技术并在这个问题上碰壁。我知道 VBA 不是网络抓取的最佳工具,但在我的工作中,我们没有能力使用 Python 或其他工具。只有 Excel 和 VBA。主要目标是从各个网站上抓取表格数据,并且在大多数情况下,网站使用一个表格很好,但我遇到了一个有多个表格的网站,我无法隔离该表格,甚至无法将所有表格拉到 Excel 中床单。我当前的代码只获取第一个表,我尝试修改代码以获取每个游戏表的点,但我无法弄清楚。

如果有人可以提供帮助,我已经在这几天了。

提前致谢。

    Sub GetHTMLDocumentXML()

Dim XMLPage As New MSXML2.XMLHTTP60
Dim HTMLDOC As New MSHTML.HTMLDocument

XMLPage.Open "GET", "https://www.basketball-reference.com/teams/TOR/2020.html", False
XMLPage.send

HTMLDOC.body.innerHTML = XMLPage.responseText

ProcessHTMLPage HTMLDOC


End Sub

Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

    Dim HTMLTable As MSHTML.IHTMLElement
    Dim HTMLTables As MSHTML.IHTMLElementCollection
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
    Dim RowNum As Long, ColNum As Integer


    Set HTMLTables = HTMLPage.getElementsByTagName("table")

    For Each HTMLTable In HTMLTables

    With Worksheets("Sheet1")
        .Range("A1").Value = HTMLTable.className
        .Range("B1").Value = Now
    End With

        RowNum = 2

        For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
           'Debug.Print vbTab & HTMLRow.innerText
            ColNum = 1
            For Each HTMLCell In HTMLRow.Children
                Cells(RowNum, ColNum) = HTMLCell.innerText
                ColNum = ColNum + 1
            Next HTMLCell

            RowNum = RowNum + 1
        Next HTMLRow

    Next HTMLTable


End Sub

标签: excelvbaweb-scraping

解决方案


该页面中的大多数表格内容都不是动态的;相反,它们被注释掉了。<!--要与它们交互,您需要先踢出那些-->不让您的脚本访问它们的恶意标志。

这是Per Game使用 xhr 刮取标题为的表的方法:

Sub FetchData()
    Dim post As Object, trow As Object
    Dim S$, R&, C&

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.basketball-reference.com/teams/TOR/2020.html", False
        .send
        S = Replace(Replace(.responseText, "<!--", ""), "-->", "")
    End With

    With CreateObject("HTMLFile")
        .body.innerHTML = S
        For Each post In .getElementById("per_game").Rows
            For Each trow In post.Cells
                C = C + 1: Cells(R + 1, C) = trow.innerText
            Next trow
            C = 0: R = R + 1
        Next post
    End With
End Sub

推荐阅读