首页 > 解决方案 > 使用经典的 ASP,如何获取或截屏 html 页面的元标记?

问题描述

使用以下代码,我可以访问站点,获取数据,但无法获取元标题标签。令人惊讶的是,我在使用经典 ASP 进行屏幕抓取时搜索了获取元标记的方法,但只找到了几个示例,但我都无法使用。

有什么帮助吗?

rss_url = "https://www.nationalgeographic.com/science/2019/06/opal-fossils-reveal-new-species-dinosaur-australia-fostoria/"

Set objHTTP = CreateObject("Microsoft.XMLHTTP")
objHTTP.setRequestHeader "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
objHTTP.Open "GET", rss_url, False
objHTTP.Send

if objHTTP.Status = 200 Then sdata = BinaryToString(objHTTP.ResponseBody)

Set objHTTP = Nothing      

Set regEx = New RegExp
regEx.Pattern = "<meta.*property=""og:image"".*content=""(.*)"".*\/>"
regEx.IgnoreCase = True
Set matches = regEx.Execute(sdata)
if matches.Count > 0 then
KeywordAl = matches(0).SubMatches(0)
response.write "Image = " & KeywordAl&"<hr>"
end if

我包括 BinaryToString 函数只是为了完成:

Function BinaryToString(byVal Binary)
    '--- Converts the binary content to text using ADODB Stream

    '--- Set the return value in case of error
    BinaryToString = ""

    '--- Creates ADODB Stream
    Dim BinaryStream
    Set BinaryStream = CreateObject("ADODB.Stream")

    '--- Specify stream type.
    BinaryStream.Type = 1 '--- adTypeBinary

    '--- Open the stream And write text/string data To the object
    BinaryStream.Open
    BinaryStream.Write Binary

    '--- Change stream type to text
    BinaryStream.Position = 0
    BinaryStream.Type = 2 '--- adTypeText

    '--- Specify charset for the source text (unicode) data.
    BinaryStream.CharSet = "UTF-8"

    '--- Return converted text from the object
    BinaryToString = BinaryStream.ReadText
End Function 

标签: asp-classic

解决方案


尝试这个:

Function GetTextFromUrl(url)
  Dim oXMLHTTP
  Dim strStatusTest
  Set oXMLHTTP = CreateObject("MSXML2.ServerXMLHTTP.3.0")
  oXMLHTTP.Open "GET", url, False
  oXMLHTTP.Send
  If oXMLHTTP.Status = 200 Then
    GetTextFromUrl = oXMLHTTP.responseText
  End If
End Function

Dim sResult : sResult = GetTextFromUrl("https://www.nationalgeographic.com/science/2019/06/opal-fossils-reveal-new-species-dinosaur-australia-fostoria/")

Set regEx = New RegExp
regEx.Pattern = "<meta.*property=""og:image"".*content=""(.*)"".*\/>"
regEx.IgnoreCase = True
Set matches = regEx.Execute(sResult)
if matches.Count > 0 then
  KeywordAl = matches(0).SubMatches(0)
  response.write "Image = " & KeywordAl&"<hr>"
end if

对我来说,该页面的输出:

图片 = https://www.nationalgeographic.com/content/dam/science/2019/05/22/gemstone-dino/og-fostoria_final.ngsversion.1559624211907.adapt.1900.1.jpg

编辑:在此处添加了一些调试信息。试试这个片段,看看它对你的 TLS 版本的说明——这个站点可能会拒绝低于某个 TLS 级别的连接。

Set objHttp = Server.CreateObject("WinHTTP.WinHTTPRequest.5.1") 
objHttp.open "GET", "https://howsmyssl.com/a/check", False 
objHttp.Send 
Response.Write objHttp.responseText 
Set objHttp = Nothing 
Response.End 

推荐阅读