首页 > 解决方案 > Excel VBA Web Scraping - 通过 XML HTTP 请求忽略多个表之一

问题描述

我真的可以使用一些帮助来找出一段我似乎无法开始工作的网络抓取代码:

我的问题的长版本:该页面有 10 个足球运动员桌(有些有几行,有些有几行……每个“小”表代表一个层)。页面上的最后一个表 - 表 id = "table_10" - 是一个包含所有位置的大型综合表......不仅仅是四分卫(页面和较小的表专用)

使用下面的代码,我的 Excel 表中只得到“table_10”:

Option Explicit

Sub ETR_QB_Tiers_XMLHTTP()

   Dim XMLPage As New MSXML2.XMLHTTP60
   Dim HTMLDoc As New MSHTML.HTMLDocument

   XMLPage.Open "GET", "https://establishtherun.com/2020-tiers-of-evan-quarterbacks/", False
   XMLPage.send

   If XMLPage.Status <> 200 Then
      MsgBox XMLPage.Status & " - " & XMLPage.statusText
      Exit Sub
   End If

   HTMLDoc.body.innerHTML = XMLPage.responseText

   ProcessHTMLPage HTMLDoc

End Sub

Option Explicit

Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

   Dim HTMLTable As MSHTML.IHTMLElement
   Dim HTMLTables As MSHTML.IHTMLElementCollection
   Dim HTMLRow As MSHTML.IHTMLElement
   Dim HTMLCell As MSHTML.IHTMLElement
   Dim RowNum As Long, ColNum As Integer

   Set HTMLTables = HTMLPage.getElementsByTagName("table")

   For Each HTMLTable In HTMLTables
      'Debug.Print HTMLTable.ID

      Sheets("XMLHTTP").Select

      RowNum = 1
      For Each HTMLRow In HTMLTable.getElementsByTagName("tr")
         'Debug.Print vbTab & HTMLRow.innerText

         ColNum = 1
         For Each HTMLCell In HTMLRow.Children
            'Debug.Print vbTab & HTMLCell.innerText
            Cells(RowNum, ColNum) = HTMLCell.innerText
            ColNum = ColNum + 1
         Next HTMLCell

         RowNum = RowNum + 1

      Next HTML Row

   Next HTMLTable

End Sub

当我将代码设置为Debug.Print HTMLTable.IDProcessHTMLPageFor Each HTMLTable In HTMLTables行时,我会在即时窗口中看到所有 10 个表 ID:

table_1
table_2
table_3
. . .
table_10

当我将代码设置为Debug.Print vbTab & HTMLRow.innertext使用该For Each HTMLTable In HTMLTables行时,我会在即时窗口中显示较小表(表 1 到表 9)和大表(表 10)的结果:

table_1
   TierOne
   Patrick Mahomes (QB1)Lamar Jackson (QB2)
table_2
   TierTwo
   Dak Prescott (QB3)Josh Allen (QB4)
   Deshaun Watson (QB5)Russell Wilson (QB6)
   Kyler Murray (QB7)
. . .
table_10
   RankWRRBTEQB
   1Michael Thomas (1)Christian McCaffrey (1)Travis Kelce (1)Patrick Mahomes (1)
   2Davante Adams (1)Ezekiel Elliott (1)George Kittle (1)Lamar Jackson (1)
   3Tyreek Hill (1)Saquon Barkley (1)Zach Ertz (1)Dak Prescott (2)
   ...

所以 - 我知道那些“较小”的表在那里并且可以访问,但是代码只吐出全面的“table_10”(如下),而我真的想要单独的表 1 到 9 - 根本不是表 10:

再次......有没有办法忽略“table_10”并确保给我表1到9(而不仅仅是“table_10”)?我已经尝试以多种方式合并“如果”语句,但我已经数不清了。


额外的问题- “较小”的表格以 Z 模式设置(即 - 单元格 A1 是玩家 #1 > 单元格 B1 是玩家 #2 > 单元格 A2 是玩家 #3 > 单元格 B2 是玩家 #4,等等) . 有没有办法让 B 列的玩家按照他们的排名顺序来到 A 列?基本上,将两列转换为一列?

标签: excelvbaweb-scrapingxmlhttprequest

解决方案


我认为您正在覆盖表格,因此您只能看到最后一个表格(最大的表格可能会覆盖以前的所有内容)。

尝试移动RowNum = 1到循环表之前,否则我认为您为每个表重置并覆盖。

您可能还想在之前添加一个 r + 1Next HTMLTable以便在表之间有一些空格。

这是一个简单的示例来演示:

你的逻辑:

Option Explicit

Public Sub Demo_XMLHTTP()

   Dim XMLPage As New MSXML2.XMLHTTP60
   Dim HTMLDoc As New MSHTML.HTMLDocument

   XMLPage.Open "GET", "https://www.w3schools.com/html/html_tables.asp", False
   XMLPage.send

   If XMLPage.Status <> 200 Then
      MsgBox XMLPage.Status & " - " & XMLPage.statusText
      Exit Sub
   End If

   HTMLDoc.body.innerHTML = XMLPage.responseText

   ProcessHTMLPage HTMLDoc

End Sub


Public Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

   Dim HTMLTable As MSHTML.IHTMLElement
   Dim HTMLTables()
   Dim HTMLRow As MSHTML.IHTMLElement
   Dim HTMLCell As MSHTML.IHTMLElement
   Dim RowNum As Long, ColNum As Long
   Dim hTable As Variant

   HTMLTables = Array(1, 2, 3)

   For Each hTable In HTMLTables

      Set HTMLTable = HTMLPage.getElementById("customers") '<== yeah same table but imagine it is different

      ThisWorkbook.Worksheets("XMLHTTP").Select

      RowNum = 1

      With ActiveSheet

      For Each HTMLRow In HTMLTable.getElementsByTagName("tr")

         ColNum = 1

         For Each HTMLCell In HTMLRow.Children
            .Cells(RowNum, ColNum) = HTMLCell.innerText
            ColNum = ColNum + 1
         Next HTMLCell

         RowNum = RowNum + 1

      Next HTMLRow
      Set HTMLTable = Nothing
      .Cells(RowNum, ColNum + 1) = hTable '< note which iteration we are viewing
      End With

   Next hTable

End Sub

相对:

Option Explicit

Public Sub Demo_XMLHTTP()

   Dim XMLPage As New MSXML2.XMLHTTP60
   Dim HTMLDoc As New MSHTML.HTMLDocument

   XMLPage.Open "GET", "https://www.w3schools.com/html/html_tables.asp", False
   XMLPage.send

   If XMLPage.Status <> 200 Then
      MsgBox XMLPage.Status & " - " & XMLPage.statusText
      Exit Sub
   End If

   HTMLDoc.body.innerHTML = XMLPage.responseText

   ProcessHTMLPage HTMLDoc

End Sub


Public Sub ProcessHTMLPage(HTMLPage As MSHTML.HTMLDocument)

   Dim HTMLTable As MSHTML.IHTMLElement
   Dim HTMLTables()
   Dim HTMLRow As MSHTML.IHTMLElement
   Dim HTMLCell As MSHTML.IHTMLElement
   Dim RowNum As Long, ColNum As Long
   Dim hTable As Variant

   HTMLTables = Array(1, 2, 3)

   RowNum = 1

   For Each hTable In HTMLTables

      Set HTMLTable = HTMLPage.getElementById("customers") '<== yeah same table but imagine it is different

      ThisWorkbook.Worksheets("XMLHTTP").Select

      With ActiveSheet

      For Each HTMLRow In HTMLTable.getElementsByTagName("tr")

         ColNum = 1

         For Each HTMLCell In HTMLRow.Children
            .Cells(RowNum, ColNum) = HTMLCell.innerText
            ColNum = ColNum + 1
         Next HTMLCell

         RowNum = RowNum + 1

      Next HTMLRow
      Set HTMLTable = Nothing
      .Cells(RowNum, ColNum + 1) = hTable '< note which iteration we are viewing
      End With

   Next hTable

End Sub

忽略表 10:

您可以使用 aFor i = 0 To HTMLTables.Length - 2而不是For Each来忽略最后一个表。使用 访问任何给定的表 HTMLTables.item(i)。否则,您可以测试 id 并基于它忽略,甚至基于索引(记住 -1)。我可能会使用 id 更可靠。通常,你会跑到.Length-1.


奖金:

我无法针对您的测试页运行,但如果您执行 a For Loop,您可以调整要写入的列,具体取决于i是奇数还是偶数(MOD例如使用);Odd number MOD 2 = 1; Even MOD 2 = 0然后ColNum用 -1 调整,或酌情调整。


推荐阅读