首页 > 解决方案 > 使用 BeautifulSoup 进行网页抓取时,我可以接受或忽略 Google 隐私声明吗?

问题描述

从控制台运行以下代码时,我无法查看 Google 新闻页面的 HTML。我看到的 HTML 是 Google 隐私声明(以“在您继续之前”开头的那个)。

from bs4 import BeautifulSoup
import requests

headers = {'User-Agent': 'Mozilla/5.0'}
r = requests.get("https://www.google.com/news", headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
print(soup.prettify())

有没有办法完全防止隐私通知弹出?

我得到的一个片段:

  <title>
   Before you continue
  </title>
  <meta content="initial-scale=1, maximum-scale=5, width=device-width" name="viewport"/>
  <link href="//www.google.com/favicon.ico" rel="shortcut icon"/>
 </head>
 <body>
  <div class="signin">
   <a class="button" href="https://accounts.google.com/ServiceLogin?hl=en-US&amp;continue=https://news.google.com/topics/CAAqBwgKMKHQ9Qowlc7cAg&amp;gae=cb-">
    Sign in
   </a>
  </div>
  <div class="box">
   <img alt="Google" height="28" src="//www.gstatic.com/images/branding/googlelogo/1x/googlelogo_color_68x28dp.png" srcset="//www.gstatic.com/images/branding/googlelogo/2x/googlelogo_color_68x28dp.png 2x" width="68"/>
   <div class="productLogoContainer">
    <img alt="" aria-hidden="true" class="image" height="100%" src="https://www.gstatic.com/ac/cb/scene_cookie_wall_search_v2.svg" width="100%"/>
   </div>

标签: pythonweb-scrapingbeautifulsoup

解决方案


您可以将CONSENTcookie 设置为不获取“继续之前”页面:

import requests
from bs4 import BeautifulSoup

headers = {"User-Agent": "Mozilla/5.0"}
cookies = {"CONSENT": "YES+cb.20210720-07-p0.en+FX+410"}
r = requests.get(
    "https://www.google.com/news", headers=headers, cookies=cookies
)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.prettify())

推荐阅读