首页 > 解决方案 > 从网页抓取数据时如何选择

问题描述

在此处输入图像描述在此处输入图像描述我一直在学习如何使用 Excel 和 VBA 从网页中抓取数据。我使用一个数据源遇到了一些障碍,因此更改为https://www.skyscanner.net/news/airports/heathrow-airport-live-flight-arrivals-and-departures。我遇到的问题是在表 id "flight-status-board-arrivals" 中有一个 and 。我可以很好地提取标题数据,但是当我尝试使用正文数据时,循环结束。我尝试更改为 .children(1) 以使用返回 0 的长度进行测试和调试。这表明对象中没有任何内容,但我不明白为什么。我希望我已经涵盖了这里的所有内容,并且可以提供任何帮助。另外,我知道这可以使用另一种语言来实现,但是当我学习 VBA 时,我认为最好在钻研一门新语言之前学习使用 VBA。

Sub GrabWebData()

Dim ie As InternetExplorer 'refer to the running copy of internet explorer
Dim html As HTMLDocument 'refer to the HTML document returned
Dim ele As Object
Dim y As Integer
Dim fSht As Worksheet

Set fSht = Sheets("Sheet1")
Set ie = New InternetExplorer
ie.Visible = False
ie.navigate "https://www.skyscanner.net/news/airports/heathrow-airport-live-flight-arrivals-and-departures"

'wait until IE is done loading page
Do While ie.READYSTATE <> READYSTATE_COMPLETE
Application.StatusBar = "Loading Flight Times"
DoEvents
Loop

y = 1

Debug.Print ie.document.getElementById("flight-status-board-arrivals").Children(1) _
.getElementsByTagName("td").Length


For Each ele In ie.document.getElementById("flight-status-board- arrivals").Children(1) _
.getElementsByTagName("tr")
Debug.Print ele.textContent
fSht.Range("A" & y).Value = ele.Children(0).textContent
'On Error GoTo skip1:
fSht.Range("B" & y).Value = ele.Children(1).textContent
'On Error GoTo skip1:
fSht.Range("C" & y).Value = ele.Children(2).textContent
'On Error GoTo skip1:
fSht.Range("D" & y).Value = ele.Children(3).textContent
'On Error GoTo skip1:
fSht.Range("E" & y).Value = ele.Children(4).textContent
'On Error GoTo skip1:
fSht.Cells.WrapText = False
fSht.Rows.AutoFit
fSht.Columns.AutoFit
'skip1:
y = y + 1
Next

'Rows(2).Select
'Selection.Delete shift:=xlUp

End Sub`

标签: vbaexcelweb-scraping

解决方案


尝试以下代码从该表中获取数据。我Explicit Wait在脚本中进行了定义,以便它会等到该网页中的表格数据可用。尽管我强烈建议您按照QHarr已经显示的方式进行操作,但您可以另外试一试。该脚本将无头运行,因此您将看不到任何浏览器。但是,唯一的问题是您可能会stale element在操作中的某个地方遇到错误,但并非总是如此。

这是脚本的样子:

Sub FetchData()
    Const link As String = "https://www.skyscanner.net/news/airports/heathrow-airport-live-flight-arrivals-and-departures"
    Dim posts As Object, post As Object, elem As Object, R&, C&

    With New ChromeDriver
        .AddArgument "--headless"
        .get link
        Set posts = .FindElementByCss("#flight-status-board-arrivals tbody tr", Timeout:=30000)
        For Each post In .FindElementsByCss("#flight-status-board-arrivals tr")
            For Each elem In post.FindElementsByCss("th,td")
                C = C + 1: Cells(R + 1, C) = elem.Text
            Next elem
            C = 0: R = R + 1
        Next post
    End With
End Sub

为了使执行时间更快并恢复到HTMLDocument解析器重用.PageSource,您应该尝试如下所示。

Sub FetchData()
    Const link As String = "https://www.skyscanner.net/news/airports/heathrow-airport-live-flight-arrivals-and-departures"
    Dim posts As Object, post As Object, elem As Object, R&, C&
    Dim Html As New HTMLDocument

    With New ChromeDriver
        .AddArgument "--headless"
        .get link
        Set posts = .FindElementByCss("#flight-status-board-arrivals tbody tr", timeout:=30000)
        Html.body.innerHTML = .PageSource  'this is how you can go
    End With

    For Each post In Html.getElementById("flight-status-board-arrivals").Rows
        For Each elem In post.Cells
            C = C + 1: Cells(R + 1, C) = elem.innerText
        Next elem
        C = 0: R = R + 1
    Next post
End Sub

执行前添加到库的参考:

Selenium Type Library
Microsoft HTML Object Library

推荐阅读