首页 > 解决方案 > MATLAB 中网络抓取的用户代理/Cookie 解决方法

问题描述

我已经尝试了几天(使用本网站MathWorks上的其他答案)来解决crumbYahoo Finance 在下载 CSV 文件的链接末尾添加的问题,例如 Chrome 浏览器中包含 Nasdaq100 数据的 CSV将获得链接:https://query1.finance.yahoo.com/v7/finance/download/%5ENDX?period1=496969200&period2=1519513200&interval=1d&events=history&crumb=dnhBC8SRS9G(通过单击Yahoo上的“下载数据”按钮财务页面)。

crumb=dnhBC8SRS9G显然会根据 Cookie 和用户代理而改变,因此我尝试相应地配置 MATLAB 以将自己伪装成 Chrome 浏览器(复制 Chrome 中的 cookie/用户代理):

useragent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36';

cookie ='PRF=t%3D%255ENDX; expires=Thu, 11-Jun-2020 09:06:31 GMT; path=/; domain=.finance.yahoo.com';

opts = weboptions('UserAgent',useragent,'KeyName','WWW_Authenticate','KeyValue','dnhBC8SRS9G','KeyName','Cookie','KeyValue',cookie)

url = 'https://query1.finance.yahoo.com/v7/finance/download/^NDX?period1=496969200&period2=1519513200&interval=1d&events=history&crumb=dnhBC8SRS9G' ;

response = webread(url,opts)

但无论我做什么(使用其中一个webread或额外的功能urlread2),我都会得到“未经授权”的响应。上面的 MATLAB 代码给了我响应:

Error using readContentFromWebService (line 45)
The server returned the status 401 with message "Unauthorized" in response to the request to URL
https://query1.finance.yahoo.com/v7/finance/download/%5ENDX?period1=496969200&period2=1519513200&interval=1d&events=history&crumb=dnhBC8SRS9G.

Error in webread (line 122)
[varargout{1:nargout}] = readContentFromWebService(connection, options);

Error in TEST2 (line 22)
response = webread(url,opts)

任何帮助将不胜感激,我只想让基础知识能够正常工作,即使这意味着我必须crumb在第一次请求之前手动将 Chrome 浏览器中的内容复制到 MATLAB 中。(我看到他们用 Python、C# 等解决了这个问题,我尽可能地遵循了这些解决方案,所以它在 MATLAB 中也应该是可行的,对吧?)

编辑:如果有任何帮助,当我运行urlread2而不是webread在代码末尾时,即:

[output,extras] = urlread2(url,'GET');
extras.firstHeaders

我从 MATLAB 得到以下输出:

ans = 

  struct with fields:

                   Response: 'HTTP/1.1 401 Unauthorized'
     X_Content_Type_Options: 'nosniff'
           WWW_Authenticate: 'crumb'
               Content_Type: 'application/json;charset=utf-8'
             Content_Length: '136'
                       Date: 'Tue, 12 Jun 2018 13:07:38 GMT'
                        Age: '0'
                        Via: 'http/1.1 media-router-omega4.prod.media.ir2.yahoo.com (ApacheTrafficServer [cMsSf ]), http/1.1 media-ncache-api17.prod.media.ir2.yahoo.com (ApacheTrafficServer [cMsSf ]), http/1.1 media-ncache-api15.prod.media.ir2.yahoo.com (ApacheTrafficServer [cMsSf ]), http/1.1 media-router-api12.prod.media.ir2.yahoo.com (ApacheTrafficServer [cMsSf ]), https/1.1 e3.ycpi.seb.yahoo.com (ApacheTrafficServer [cMsSf ])'
                     Server: 'ATS'
                    Expires: '-1'
              Cache_Control: 'max-age=0, private'
  Strict_Transport_Security: 'max-age=15552000'
                 Connection: 'keep-alive'
                  Expect_CT: 'max-age=31536000, report-uri="http://csp.yahoo.com/beacon/csp?src=yahoocom-expect-ct-report-only"'
Public_Key_Pins_Report_Only: 'max-age=2592000; pin-sha256="2fRAUXyxl4A1/XHrKNBmc8bTkzA7y4FB/GLJuNAzCqY="; pin-sha256="2oALgLKofTmeZvoZ1y/fSZg7R9jPMix8eVA6DH4o/q8="; pin-sha256="Gtk3r1evlBrs0hG3fm3VoM19daHexDWP//OCmeeMr5M="; pin-sha256="I/Lt/z7ekCWanjD0Cvj5EqXls2lOaThEA0H2Bg4BT/o="; pin-sha256="JbQbUG5JMJUoI6brnx0x3vZF6jilxsapbXGVfjhN8Fg="; pin-sha256="SVqWumuteCQHvVIaALrOZXuzVVVeS7f4FGxxu6V+es4="; pin-sha256="UZJDjsNp1+4M5x9cbbdflB779y5YRBcV6Z6rBMLIrO4="; pin-sha256="Wd8xe/qfTwq3ylFNd3IpaqLHZbh2ZNCLluVzmeNkcpw="; pin-sha256="WoiWRyIOVNa9ihaBciRSC7XHjliYS9VwUGOIud4PB18="; pin-sha256="cAajgxHlj7GTSEIzIYIQxmEloOSoJq7VOaxWHfv72QM="; pin-sha256="dolnbtzEBnELx/9lOEQ22e6OZO/QNb6VSSX2XHA3E7A="; pin-sha256="i7WTqTvh0OioIruIfFR4kMPnBqrS2rdiVPl/s2uC/CY="; pin-sha256="iduNzFNKpwYZ3se/XV+hXcbUonlLw09QPa6AYUwpu4M="; pin-sha256="lnsM2T/O9/J84sJFdnrpsFp3awZJ+ZZbYpCWhGloaHI="; pin-sha256="r/mIkG3eEpVdm+u/ko/cwxzOMo1bk4TyHIlByibiA5E="; pin-sha256="uUwZgwDOxcBXrQcntwu+kYFpkiVkOaezL0WYEZ3anJc="; includeSubdomains; report-uri="http://csp.yahoo.com/beacon/csp?src=yahoocom-hpkp-report-only"'

我的weboptions输出是:

opts = 

  weboptions with properties:

  CharacterEncoding: 'auto'
          UserAgent: 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36'
            Timeout: 5
           Username: ''
           Password: ''
            KeyName: ''
           KeyValue: ''
        ContentType: 'auto'
      ContentReader: []
          MediaType: 'application/x-www-form-urlencoded'
      RequestMethod: 'auto'
        ArrayFormat: 'csv'
       HeaderFields: {'Cookie'  'PRF=t%3D%255ENDX; expires=Thu, 11-Jun-2020 09:06:31 GMT; path=/; domain=.finance.yahoo.com'}
CertificateFilename: '/opt/matlab/r2017a/sys/certificates/ca/rootcerts.pem'

标签: matlabcsvcookiesweb-scrapinguser-agent

解决方案


好的,使用 Curl 进行了一些处理,看来您尝试执行的操作在指定的 URL 上是不可能的。值得注意的是,crumbcookie经常变化,所以每次我运行脚本来获取它们的值时,我都必须解析两个 GET 请求的响应。

我会引导你完成我的尝试。

  1. GET 请求并保存 cookie 文件。
  2. 为 cookie 解析 cookie 文件。
  3. 将 cookie 打印到文件中。
  4. 获取请求并保存 html。
  5. 解析 HTML 并获取 crumb。
  6. 表单网址。
  7. 形成卷曲请求。
  8. 执行请求。

编码:

%Get cookie.
command = 'curl -s --cookie-jar cookie.txt https://finance.yahoo.com/quote/GOOG?p=GOOG';
%Execute request.
system(command);
%Read file.
cookie_file = fileread('cookie.txt');
%regexp the cookie.
cookie = regexp(cookie_file,'B\s*(.*)','tokens');
cookie = cell2mat(cookie{1});

%Print cookie to file (for curl purposes only).
file = fopen('mycookie.txt','w');
fprintf(file,'%s',cookie);

%Get request.
command = 'curl https://finance.yahoo.com/quote/GOOG?p=GOOG > goog.txt';
%Execute request.
system(command);
%Read file.
crumb_file = fileread('goog.txt');
%regexp the crumb.
crumb = regexp(crumb_file,'(?<="CrumbStore":{"crumb":")(.*)(?="},"UserStore":)','tokens');
crumb = crumb{:};

%Form the URL.
url = 'https://query1.finance.yahoo.com/v7/finance/download/AAPL?period1=1492524105&period2=1495116105&interval=1d&events=history&crumb=';
url = strcat(url,crumb);

%Form the curl command.
command = strcat('curl',{' '},'-v -L -b',{' '},'mycookie.txt',{' '},'-H',{' '},'"User-Agent:',{' '},'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36','"',{' '},'"',url,'"');
command = command{1};
system(command);

最后的 curl 请求:

curl -v -L -b mycookie.txt -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.79 Safari/537.36" "https://query1.finance.yahoo.com/v7/finance/download/^NDX?period1=496969200&period2=1519513200&interval=1d&events=history&crumb=dSpwQstrQDp"

在最终的 curl 请求中,我使用了以下标志:

-v: verbosity
-L: follow redirects
-b: use cookie file
-H: user agent header field (tried spoofing it with my browser)

对于每次尝试,响应如下:

 {
    "finance": {
        "error": {
            "code": "Unauthorized",
            "description": "Invalid cookie"
        }
    }
}

我研究了服务器响应,客户端成功发送了每个标头值,但是它总是导致相同的错误。现在我怀疑你根本不能再这样做了,正如这里解释的那样。正如用户所指出的,您可能需要从不同的位置执行网络抓取。也许如果您找到一个有效的 URL,您可以提出一个新问题,我很乐意提供帮助。


推荐阅读