首页 > 解决方案 > Webscrape VBA - 列表

问题描述

我正在尝试设置一个 webscraping VBA 代码以将数据从该网站导入 Excel:https ://www.thewindpower.net/windfarms_list_en.php

我希望启动这个网页,选择一个国家,然后从下表中抓取数据(包括名称列中的 url)。

然而,我坚持以下几点:

这是我已经准备好的代码(基于网络上的一些研究:

Sub Grabdata()

'dimension (set aside memory for) our variables
Dim objIE As InternetExplorer
Dim ele As Object
Dim y As Integer

'start a new browser instance
Set objIE = New InternetExplorer
'make browser visible
objIE.Visible = True

'navigate to page with needed data
objIE.navigate "https://www.thewindpower.net/windfarms_list_en.php"
'wait for page to load
Do While objIE.Busy = True Or objIE.readyState <> 4: DoEvents: Loop

'we will output data to excel, starting on row 1
y = 1

'look at all the 'tr' elements in the 'table' with id 'myTable',
'and evaluate each, one at a time, using 'ele' variable
For Each ele In objIE.document.getElementById("myTable").getElementsByTagName("tr")
    'show the text content of 'tr' element being looked at
    Debug.Print ele.textContent
    'each 'tr' (table row) element contains 4 children ('td') elements
    'put text of 1st 'td' in col A
    Sheets("Sheet1").Range("A" & y).Value = ele.Children(0).textContent
    'put text of 2nd 'td' in col B
    Sheets("Sheet1").Range("B" & y).Value = ele.Children(1).textContent
    'put text of 3rd 'td' in col C
    Sheets("Sheet1").Range("C" & y).Value = ele.Children(2).textContent
    'put text of 4th 'td' in col D
    Sheets("Sheet1").Range("D" & y).Value = ele.Children(3).textContent
    'increment row counter by 1
    y = y + 1
'repeat until last ele has been evaluated
Next

'save the Excel workbook
ActiveWorkbook.Save

结束子

标签: excelvbaweb-scraping

解决方案


大多数可抓取页面几乎总是具有静态页面布局,因此使用它们的索引选择元素是相当安全的。

下面的代码选择带有 id 的容器元素,bloc_texte然后选择里面的第二个表。

如果您计划按照您的评论建议执行大量请求,则应添加一些代码以减慢您的请求(Application.wait类型交易)。在请求之后触发请求可能会惹恼主机。

' Required References
' Microsoft HTML Object Library
' Microsoft XML, v6.0

Sub Main()
    GetData ("GB")
End Sub

Sub GetData(ByVal Location As String)

Dim Request As MSXML2.ServerXMLHTTP60: Set Request = New MSXML2.ServerXMLHTTP60

Dim Result As HTMLDocument: Set Result = New HTMLDocument

Request.Open "POST", "https://www.thewindpower.net/windfarms_list_en.php", False
Request.setRequestHeader "content-type", "application/x-www-form-urlencoded"
Request.send "action=submit&pays=" & Location

Result.body.innerHTML = Request.responseText

Dim oRows As MSHTML.IHTMLElementCollection
Dim oRow As MSHTML.IHTMLElement

Dim oCells As MSHTML.IHTMLElementCollection
Dim oCell As MSHTML.IHTMLElement

Dim oLinks As MSHTML.IHTMLElementCollection

Set oRows = Result.getElementById("bloc_texte").getElementsByTagName("table")(2).getElementsByTagName("tr")

Dim iRow As Integer 'output row counter
Dim iColumn As Integer 'output column counter
Dim Sheet As Worksheet 'output sheet

Set Sheet = ThisWorkbook.Worksheets("Sheet1")
iRow = 1
iColumn = 1

For Each oRow In oRows
    If Not oRow.className = "puce_texte" Then
        Set oCells = oRow.getElementsByTagName("td")
        For Each oCell In oCells
            Set oLinks = oCell.getElementsByTagName("a")
            If oLinks.Length = 0 Then
                Sheet.Cells(iRow, iColumn).Value = oCell.innerText
            Else
                Sheet.Cells(iRow, iColumn).Value = Replace(oLinks(0).getAttribute("href"), "about:", "")
            End If
            iColumn = iColumn + 1
        Next oCell
        iRow = iRow + 1
        iColumn = 1
    End If
Next oRow

End Sub

推荐阅读