首页 > 解决方案 > 如何使用 xml 查询从以下网站获取数据

问题描述

您好我想从以下两个网站获取专利号和摘要数据:

我知道如何使用 HTML 查询从这些网站上抓取数据,我想知道是否有办法使用 XML 查询获取数据。

Sub google()

 Dim IE As New SHDocVw.InternetExplorer
    Dim HTMLDoc As MSHTML.HTMLDocument
    Dim pageText, pageclaim As String
    Dim HTMLTable, HTMLp As MSHTML.IHTMLElement
    Dim HTMLTables, HTMLps As MSHTML.IHTMLElementCollection
    Dim HTMLRow As MSHTML.IHTMLElement
    Dim HTMLCell As MSHTML.IHTMLElement
    Dim RowNum As Long, ColNum As Integer
    Dim pointer  As Integer
    
    IE.Visible = True
    IE.navigate ""
    
    Do While IE.readyState <> READYSTATE_COMPLETE
    Loop
    
    Set HTMLDoc = IE.Document

End sub

谢谢您的帮助

标签: excelxmlvbaweb-scraping

解决方案


编辑

我又想了想,记得你也可以交出 UserAgent。所以你可以得到谷歌链接页面的HTML源代码:

Sub google()
  Dim http As New MSXML2.XMLHTTP60
  Dim htmlDoc As New MSHTML.HTMLDocument
  Dim url As String
  
  url = "https://patents.google.com/patent/US8805587B1/en?oq=US8805587B1"
  
  http.Open "GET", url, False
  http.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
  http.Send
  htmlDoc.body.innerHTML = http.responseText
  
  Close
  Open "D:\httpRequest.txt" For Output As #1
  Print #1, htmlDoc.body.outerHTML
  Close
  'Debug.Print htmlDoc.body.outerHTML
End Sub

第一次发帖(第二个链接的部分仍然有效)

~~坏消息。第一个页面不适用于 XML 请求:~~

Sub google()
  Dim http As New MSXML2.XMLHTTP60
  Dim htmlDoc As New MSHTML.HTMLDocument
  Dim url As String
  
  url = "https://patents.google.com/patent/US8805587B1/en?oq=US8805587B1"
  
  http.Open "GET", url, False
  http.Send
  htmlDoc.body.innerHTML = http.responseText
  
  Debug.Print htmlDoc.body.outerHTML
End Sub

这是结果:

<BODY>
  <DIV style="MAX-WIDTH: 590px; MARGIN: 64px auto 0px">
    <H2>Your Browser Isn't Supported By Google Patents</H2>
    <P>It looks like you're using an old browser which isn't supported by Google Patents. To use Google Patents, you'll need an up-to-date browser.
      <A href="https://support.google.com/faqs/answer/6261372">Learn more</A>.
    </P>
  </DIV>
</BODY>

第二个页面不适用于 XML 请求,因为它是一个动态内容页面。XML 请求只能使用静态 HTML,这意味着来自 URL 调用的第一个交付的 HTML。


推荐阅读