excel - 网页抓取 ETF 每日数据 VBA
问题描述
我正在尝试通过网络抓取不同 ETF 的一些日常信息。我发现https://www.marketwatch.com/有准确的信息。最相关的信息是 ETF 的开盘价、流通股、资产净值、总资产。这是 IVV 美国股票的链接:https ://www.marketwatch.com/investing/fund/ivv
我之前用 VBA 抓取过网页,但我使用的页面的 HTML 不同,我不知道这是不是因为 ETF 的某些值(例如价格和 Taded Volume)不断变化。这个想法是创建一个代码来提取相关信息并创建一个数据库来分析宏观经济因素,使用 ETF 作为国家、地区等之间流动的市场指标......
我的第一种方法是使用 VBA,但在我更深入地了解数据之后,我想尝试使用 Python(在我对它更加自信之后)来每天自动执行网络抓取过程。
我对任何可能有用的建议或任何其他网站持开放态度(我曾尝试使用 Yahoo Finance 和 Morningstar,但我在 HTML 代码中遇到了同样的问题)。
这是我糟糕的代码:
Sub Get_Data()
Dim ticker As String, enlace As String
ticker = ThisWorkbook.Worksheets("ETFs").Cells(2, 2).Value 'IVV
'link = "https://www.morningstar.com/etfs/arcx/" & ticker & "/quote.html"
'link = "https://finance.yahoo.com/quote/" & ticker & "?p=" & ticker
link = "https://www.marketwatch.com/investing/fund/" & ticker
Application.ScreenUpdating = False
Dim x As Integer
x = ThisWorkbook.Worksheets("ETFs").Cells(Rows.Count, 1).End(xlUp).Row
'Dim i As Integer
'For i = 2 To x
Dim total_net_assets As Variant, open_price As Variant, NAV As Variant, shares_out
Set ie = CreateObject("InternetExplorer.application")
With ie
.Visible = False
.navigate link
While .Busy Or .readyState < 4: DoEvents: Wend
Do
DoEvents
On Error Resume Next
' Here is where I get the problem of not knowing how to reference the values I need because the class name appears repeatedly
total_net_assets = .document.getElementsByClassName("").Value
open_price = .document.getElementByClassName("price").Value
NAV = .document.getElementByClassName("").Value
shares_out = .document.getElementByClassName("kv__value kv__primary ").Value
On Error GoTo 0
Loop
End With
ThisWorkbook.Worksheets("ETFs").Cells(2, 13).Value = total_net_assets
ThisWorkbook.Worksheets("ETFs").Cells(2, 14).Value = NAV
ThisWorkbook.Worksheets("ETFs").Cells(2, 15).Value = open_price
ThisWorkbook.Worksheets("ETFs").Cells(2, 16).Value = shares_out
ie.Quit
'Next i
Application.ScreenUpdating = True
End Sub
解决方案
访问方法:
我使用XMLHTTP请求的速度比打开 IE 快得多。
代码注释:
下面从 Sheet1 列 A 中读取基金短代码,从 开始A2
,到一个数组中。您可以轻松地将其扩展为 A 列添加更多资金。
通过将基金代码连接到BASE_URL
变量中,该数组循环发出 XMLHTTP 请求。
我使用一个类 ,clsHTTP
来保存 XMLHTTP 对象以提高效率 - 无需继续创建和销毁该对象。
我为这个类提供了两种方法。一个用于检索目标页面 innerHTML ( GetString
),另一个用于提取所需信息(如果可用)(GetInfo
)。我使用字典来测试搜索的标签是否存在。如果存在,我会获取相关的值。如果没有,我vbNullString
在字典中有一个占位符。
我将每个抓取的结果添加到一个名为results
. 最后,我将这段文字循环到纸上。通过将大部分工作保留在内存中,这提供了更快的抓取速度。
从 HTML 中检索信息:
labels
egOpen
和成对出现values
。
您可以通过使用方法应用类 CSS 选择器来生成一个nodeList
(将集合视为 with ),以通过它们的类名称来收集标签元素。是类选择器。getElementsByClassName
querySelectorAll
kv__label
"."
Set labels = .querySelectorAll(".kv__label") '<== nodeList of labels
您可以执行相同的操作来获取关联的值:
Set values = .querySelectorAll(".kv__value.kv__primary") '<== nodeList of associated values. Same length as labels nodeList so can use same index to retrieve associated label/value pairs from each nodeList.
clsHTTP
您在方法中使用字典循环标签.GetInfo
以查看您搜索的标签是否存在,如果存在,则使用与在 中找到标签的位置相同的索引从值中检索关联的值nodeList
labels
,并且字典vbNullString
值为该标签将使用实际检索到的值进行更新,否则将保留为vbNullString
.
样本结果:
VBA:
类模块 clsHTTP:
Option Explicit
Private http As Object
Private Sub Class_Initialize()
Set http = CreateObject("MSXML2.XMLHTTP")
End Sub
Public Function GetString(ByVal url As String) As String
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function
标准模块1:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, http As clsHTTP, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set http = New clsHTTP
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = http.GetString(url)
results.Add http.GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
设置:
不使用类:
Option Explicit
Public Sub GetFundInfo()
Dim sResponse As String, html As HTMLDocument, i As Long
Dim headers(), funds(), url As String, results As Collection, ws As Worksheet
Const BASE_URL As String = "https://www.marketwatch.com/investing/fund/"
Application.ScreenUpdating = False
headers = Array("Open", "Shares Outstanding", "Total Net Assets", "NAV")
Set results = New Collection
Set ws = ThisWorkbook.Worksheets("Sheet1")
Set html = New HTMLDocument
funds = Application.Transpose(ws.Range("A2:A3").Value) '<== Change the range here to the single column range containing your dotNums.
For i = LBound(funds) To UBound(funds)
If Not IsEmpty(funds(i)) Then
url = BASE_URL & funds(i)
html.body.innerHTML = GetString(url)
results.Add GetInfo(html).Items
End If
Next
If results.Count > 0 Then
Dim item As Variant, r As Long, c As Long
r = 2: c = 2
With ws
.Cells(1, c).Resize(1, UBound(headers) + 1) = headers
For Each item In results
.Cells(r, c).Resize(1, UBound(item) + 1) = item
r = r + 1
Next
End With
End If
Application.ScreenUpdating = True
End Sub
Public Function GetString(ByVal url As String) As String
Dim http As Object
Set http = CreateObject("MSXML2.XMLHTTP")
Dim sResponse As String
With http
.Open "GET", url, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
GetString = sResponse
End With
End Function
Public Function GetInfo(ByVal html As HTMLDocument) As Object
Dim dict As Object, i As Long
Set dict = CreateObject("Scripting.Dictionary")
dict.Add "Open", vbNullString
dict.Add "Shares Outstanding", vbNullString
dict.Add "Total Net Assets", vbNullString
dict.Add "NAV", vbNullString
Dim values As Object, labels As Object
With html
Set values = .querySelectorAll(".kv__value.kv__primary")
Set labels = .querySelectorAll(".kv__label")
For i = 0 To labels.Length - 1
If dict.Exists(labels.item(i).innerText) Then dict(labels.item(i).innerText) = values.item(i).innerText
Next
End With
Set GetInfo = dict
End Function
推荐阅读
- javascript - 如何将 Prop 从类组件传递到功能组件
- mysql - 添加自定义自动增量值
- c - cs50 pset1 cash.c 预期表达式
- python - 如何使用 Keras 实现 CNN-LSTM
- python - Python if 语句结果未打印
- r - dplyr mutate_at 并一起重命名
- python - 如何显示配置文件中的值
- python - GST_ELEMENT_FACTORY gstelementfactory.c:467:gst_element_factory_make:没有这样的元素工厂“voaacenc”
- c# - 如何在 .NET Framework Web 应用程序中为 Web Api 进行自定义模型绑定?
- cakephp-3.x - 片段缓存 cakephp 4