首页 > 解决方案 > VBA 抓取:需要对象

问题描述

      Dim oIE As InternetExplorer
        Dim wwwAdd$, NazwaArk$, AdrStrony$
        Dim doc As HTMLDocument

        Dim wElement As HTMLAnchorElement
        Dim row As Long, col%, n%, pozDat% ' co to są te procenty??
        Dim PoprzedniElement As HTMLAnchorElement
        Dim Tekst As String
        Dim NrOferty As Variant
        Dim i, j As Integer
        Dim IleSekund As Integer
        Dim newHour As Variant
        Dim newMinute As Variant
        Dim newSecond As Variant
        Dim waitTime As Variant
        Dim link As Integer
        Dim maxlink As Integer  

Set oIE = New SHDocVw.InternetExplorer
oIE.Visible = True
oIE.Navigate AdrStrony, 2

row = row + 1

Do Until oIE.ReadyState = tagREADYSTATE.READYSTATE_COMPLETE: Loop
Set doc = oIE.Document

      For Each wElement In doc.getElementsByClassName("info")(0).getElementsByTagName("a")(3) ' "a" = a href
        MsgBox (wElement)
        Sheets("Oferty").Cells(1, 1).Value = wElement
        Next wElement
        oIE.Quit

MS VBA 说该对象是必需的,但是当我在第一行(实际上是第 25 行:)中删除(3)时,“一切都很好”。它给了我三个 MsgBoxes(1. 我想要的链接,2. 空的,3. 我想要的链接)。所以最简单的方法是采取第三种选择。


我通过 QHarr 实现了代码,但我想从网站的一页中获取 20 个链接(下一步它将在多页上执行),所以我做了一个简单的 For Next:

Set oIE = New SHDocVw.InternetExplorer
oIE.Visible = True
oIE.Navigate AdrStrony, 2 'AdrStrony is the the www site address

For iNo = 0 To 20
 With oIE
        While .Busy Or .readyState < 4: DoEvents: Wend 'I personally don't understand this line
        Set doc = oIE.document
        For Each wElement In doc.getElementsByClassName("info")(iNo).getElementsByTagName("a")
        'MsgBox (wElement)
            If iRow <> 2 Then 'Works like a charm, but how the magic works? What it does actually?
                iRow = iRow + 1
                On Error Resume Next '<=In case no href attribute
                Worksheets("Oferty").Cells(iRow, 1) = wElement.getAttribute("href")
                'Worksheets("Oferty").Cells(iRow, 1) = wElement.innerText
                On Error GoTo 0
            End If
            .Quit
        Next wElement

        'other code

    End With
Next iNo

但它显示自动化错误(未知错误)。

标签: vbaexcelweb-scraping

解决方案


新信息:

根据您的新信息,这里是您如何使用 XHR 在没有打开浏览器的情况下在第一页上获得 20 个列出的属性:

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, html As New HTMLDocument, i As Long
    Const BASE_URL As String = "http://www.kontrakt.szczecin.pl"
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "http://www.kontrakt.szczecin.pl/lista-ofert/?f_location_locality[0]=szczecin&submit=Szukaj&offset=0", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
        sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
        Dim propertyList As Object
        With html
            .body.innerHTML = sResponse
            Set propertyList = html.querySelectorAll(".info")
            For i = 0 To propertyList.Length - 1
                Debug.Print BASE_URL & Split(Split(propertyList.item(i).innerHTML, "about:/")(1), Chr$(34) & ">")(0)
            Next
        End With
    End With
End Sub

您可以通过以下方式获取结果的页数(如果您想稍后循环它们):

Dim numPages As Long
numPages = html.querySelectorAll(".pagination li")(6).innerText

它不如 健壮.pagination li:nth-last-child(3),但似乎 VBA 不支持该选择器。


浏览器也一样:

Option Explicit
Public Sub SubCentalineAutomation()
    Dim myIE As InternetExplorer, propertyList As Object, i As Long
    Const BASE_URL As String = "http://www.kontrakt.szczecin.pl"
    Set myIE = New InternetExplorer

    With myIE
        .navigate "http://www.kontrakt.szczecin.pl/lista-ofert/?f_location_locality[0]=szczecin&submit=Szukaj&offset=0"
        .Visible = True
        While .Busy Or .readyState < 4: DoEvents: Wend
        Set propertyList = .document.querySelectorAll("#properties header > a")
        For i = 0 To propertyList.Length - 1
                 Debug.Print BASE_URL & propertyList.item(i).getAttribute("href")
        Next i

        '.Quit '<== Remember to quit application
    End With

End Sub

旧信息(您的原始问题):

您想在一个集合中获得 3 个链接中的 2 个。这意味着你不想要doc.getElementsByClassName("info")(0).getElementsByTagName("a")(3). 集合中的 3 个索引并返回单个元素。显示相关的 HTML 和预期的链接,将有助于澄清这里的问题。

您没有显示 tagREADYSTATE. 我会去 While oIE.Busy Or oIE.readyState < 4: DoEvents: Wend确保页面加载。

For Each您显示的循环中,如果选择有效,则对象将返回为wElement. 然后你尝试Sheets("Oferty").Cells(1, 1).Value = wElement. 我希望,您要么使用 . innerText的属性wElement,或使用.getAttribute("href"), 来获得a标签的链接。

此外,我将使用 Worksheets 集合,以及循环内的递增行变量,并使用明确的名称,例如 iRow。要仅获取第 1 和第 3 个链接,您可以测试 iRow 的值以确定是否检索链接,例如If iRow <> 2 Then.

一个可能看起来像的例子:

Option Explicit
Public Sub test1()
    'Other code
    Dim iRow
    With oIE
        While .Busy Or .readyState < 4: DoEvents: Wend
        Set doc = oIE.document
        For Each wElement In doc.getElementsByClassName("info")(0).getElementsByTagName("a")
            If iRow <> 2 Then
                iRow = iRow + 1
                On Error Resume Next '<=In case no href attribute
                Worksheets("Oferty").Cells(iRow, 1) = wElement.getAttribute("href")
                ' Worksheets("Oferty").Cells(iRow, 1) = wElement.innerText
                On Error GoTo 0
            End If
        Next wElement

        'other code
        .Quit
    End With
End Sub

推荐阅读