首页 > 解决方案 > Can I recycle a webscraping Excel VBA script for other sites?

问题描述

So in my previous post, Here, everyone that chimed in was such a big help but unfortunately I didn't learn much from it. Is it possible to recycle one of those scripts to scrape this page and pull confirmed/projected lineups in to Excel? Upon looking at the html, I see that they are housed in a "lineups is-compact" div class then separated in "lineup is-nba" div class.

I am trying to get Team name, Player Name and expected/confirmed.

Here are other sites providing the same info it they are easier to pull from.

RotoGrinders < Same place the other code was create for BB Monster

This is the code I ended up using because it seemed simpler to modify for other tasks. Boy was I wrong.

Option Explicit 
Public Sub GetInfo()

Dim IE As New InternetExplorer, iColumns As Object, iRow As Object, i As Long, j As Long, r As Long, c As Long

Application.ScreenUpdating = False

With IE
    .Visible = True
    .navigate "https://rotogrinders.com/team-stats/nba-earned?site=draftkings"

    While .Busy Or .readyState < 4: DoEvents: Wend

    Set iColumns = .document.querySelectorAll(".rgt-col")

    With ThisWorkbook.Worksheets("Sheet1")
        For i = 0 To iColumns.Length - 1
            c = c + 1: r = 0
            Set iRow = iColumns.item(i).getElementsByTagName("div")
            For j = 0 To iRow.Length - 1
                r = r + 1
                .Cells(r, c) = iRow(j).innerText
            Next
        Next
    End With
    Application.ScreenUpdating = True
    .Quit
End With
End Sub

Please keep in mind I have exactly 4 days of experience. Noob in every way.

标签: excelvbaweb-scraping

解决方案


关于网络抓取的一件令人愉快且具有挑战性的事情是,通常每个站点都是不同的,并且通常属于同一站点的页面可能会有所不同。我承认您只有一点经验,所以恐怕以下内容有点学习曲线。您的其他答案的脚本非常基本,在表格格式的列中循环,然后是行。

所有这一切的可转移部分是学习如何阅读 HTML,决定何时使用 XMLHTTP(我在下面使用它是一种更快的检索方法,但不会检索页面外的所有内容 - 特别是如果页面是 javascript 很重)与浏览器基于的解决方案。练习使用检查/开发工具来选择信息。

然后是您通常每次都会使用的常见代码位,例如,在使用 IE 时,您几乎总是有相同的连接代码行和等待代码行。使用 xmlHttp,您通常还会重复使用开头的代码行。但是,由于网站通常非常不同,您需要探索如何DOM每次解析以获取您想要的信息。对于属于同一站点/主机的页面,如果他们的开发人员在他们的页面设计中保持一致,您可能能够重用更多代码。只是不要指望会是这样。

下面的脚本使用querySelectorAll(在本例中是HTMLDocumentnodeLists的一个方法)通过匹配元素类名来生成。

下面的这些行生成您可能认为的列表。列表中的每个元素都具有相同的类名。

Set teamsVisitors = .querySelectorAll(".lineup__team.is-visit")
Set teamsHomies = .querySelectorAll(".lineup__team.is-home")
Set nickNamesVisitors = .querySelectorAll(".lineup__mteam.is-visit")
Set nickNamesHomies = .querySelectorAll(".lineup__mteam.is-home")
Set visitors = .querySelectorAll(".lineup__list.is-visit") '  then by li
Set homies = .querySelectorAll(".lineup__list.is-home") ' then by li

那么,让我们来看看其中一个列表。nodeList与_

Set teamsVisitors = .querySelectorAll(".lineup__team.is-visit")

您可以看到这是如何将 4 个访客团队的 2 个字母名称收集到一个中的nodeList(您可以认为是集合,但您不能For Each超过它,它实际上更像是一个数组)。

我已经为变量提供了相当描述性的名称,因此您对每个列表中的内容有所了解,但如果不确定,您可以进入您的开发人员工具(Chrome 中的 F12,FireFox),在元素选项卡中突出显示任何 HTML,然后Ctrl+F调出搜索 HTML 框并在该框之间输入文本""querySelectorAll例如.lineup__team.is-visit

您可以看到它返回 HTML 中 CSS 选择器的匹配数。您可以使用 enter 循环浏览它们。

所以,我有一系列的nodeLists。each 中的每个索引,例如 index 0,都nodeList与相同的匹配项相关。所以,在索引处0我有GS v BKN i.e. Warriors v Nets.

我将 s 循环nodeList写到工作表中。为了获得确认/播放器信息,我需要进一步细分nodeList我拥有的 s:

Set visitors = .querySelectorAll(".lineup__list.is-visit") '  then by li
Set homies = .querySelectorAll(".lineup__list.is-home") ' then by li

取索引我们有0visitors nodeList

我们需要进一步分解这些信息;仅仅使用类名是不够的。如果我们看一下 HTML,我们可以看到实际上,各个项目被分成li列表标签元素:

这意味着我们可以使用.getElementsByTagName方法来返回这些项目。例如:

homies.item(i).getElementsByTagName("li")

然后最终看起来像这样(示例):

在我的循环中,我将访问者写到左侧列,将家写到右侧。当我循环遍历原始 s 中的索引(即每个匹配项)时,nodeList我添加+3到输出列号,以便您从每个表中获得间隔写入。


示例输出:

在此处输入图像描述


VBA:

Option Explicit
Public Sub GetMatchInfo()
    Dim sResponse As String, html As HTMLDocument
    Application.ScreenUpdating = False

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.rotowire.com/basketball/nba-lineups.php", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    Set html = New HTMLDocument

    Dim visitors As Object, teamsVisitors As Object, nickNamesVisitors As Object
    Dim homies As Object, teamsHomies As Object, nickNamesHomies As Object
    Dim i As Long, r As Long, c As Long, j As Long

    With html
        .body.innerHTML = sResponse
        Set teamsVisitors = .querySelectorAll(".lineup__team.is-visit")
        Set teamsHomies = .querySelectorAll(".lineup__team.is-home")
        Set nickNamesVisitors = .querySelectorAll(".lineup__mteam.is-visit")
        Set nickNamesHomies = .querySelectorAll(".lineup__mteam.is-home")
        Set visitors = .querySelectorAll(".lineup__list.is-visit") '  then by li
        Set homies = .querySelectorAll(".lineup__list.is-home") ' then by li
    End With

    With ThisWorkbook.Worksheets("Sheet1")
        r = 1: c = 1

        For i = 0 To teamsHomies.Length - 1
            .Cells(r, c) = teamsVisitors.item(i).innerText
            .Cells(r, c + 1) = teamsHomies.item(i).innerText

            r = r + 1
            .Cells(r, c) = nickNamesVisitors.item(i).innerText
            .Cells(r, c + 1) = nickNamesHomies.item(i).innerText

            Dim numHomiesLiElements As Long, numVisitorsLiElements As Long, maxNumberofLiElements As Long

            numHomiesLiElements = homies.item(i).getElementsByTagName("li").Length - 1
            numVisitorsLiElements = visitors.item(i).getElementsByTagName("li").Length - 1

            maxNumberofLiElements = IIf(numHomiesLiElements > numVisitorsLiElements, numHomiesLiElements, numVisitorsLiElements)
            For j = 0 To maxNumberofLiElements
                r = r + 1
                On Error Resume Next
                .Cells(r, c) = visitors.item(i).getElementsByTagName("li")(j).innerText
                .Cells(r, c + 1) = homies.item(i).getElementsByTagName("li")(j).innerText
                On Error GoTo 0
            Next

            r = 1: c = c + 3
        Next

    End With

    Application.ScreenUpdating = True

End Sub

参考资料(VBE > 工具 > 参考资料):

  1. Microsoft HTML 对象库

可帮助您的资源:

  1. getElementsByTagName
  2. CSS 类选择器
  3. XMLHTTP 请求

有关改进的基于 python 的脚本,请参见此处:

https://stackoverflow.com/a/55626217/6241235


推荐阅读