首页 > 解决方案 > 像浏览器一样的 Python 请求?

问题描述

我想从“https://www.fanfiction.net/s/5218118/1/”获取网络文档,但遗憾的是我无法复制浏览器的行为 - 服务器总是向我发送类似“请启用 cookie”或“完成验证码”。有没有办法像浏览器一样发送请求,所以服务器会像我是浏览器一样向我提供相同的文档?我已经用谷歌搜索并尝试集成 cookie 和假用户代理。这是我的代码:

import requests
from fake_useragent import UserAgent

url = 'https://www.fanfiction.net/s/5218118/1/'

ua = UserAgent()
S = requests.Session()

header = {'User-Agent':str(ua.chrome)}
res = S.get(url, headers=header)
cookies = dict(res.cookies)


response = S.get(url, headers=header, cookies=cookies)

已经提前感谢了!编辑:我知道我可以使用 selenium,但我不想总是更新我的 chromedriver,而且我也不想在 selenium 上浪费性能。

标签: pythonweb-scrapingcookiespython-requests

解决方案


看到你的编辑,但以防万一,......

硒的简单示例,为您提供故事文本

from selenium import webdriver
from bs4 import BeautifulSoup


browser = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
browser.get('https://www.fanfiction.net/s/5218118/1/')

soup=BeautifulSoup(browser.page_source, 'lxml')

print(soup.select_one('#storytext').get_text())

browser.close()

编辑

根据您的问题以及该站点受 cloudflare 保护以避免 ddos​​ 攻击的事实进行编辑。

您可以通过 selenium 提取标签文本,但如上例所示,我使用beautifulsoup

你是对的,html使用开发人员工具检查标签部分如下所示:

<span class="xgray xcontrast_txt">
  Rated: <a class="xcontrast_txt" href="https://www.fictionratings.com/" target="rating">Fiction T</a> - English - Romance/Adventure - Naruto U., Hinata H. - Chapters: 6 - Words: 14,894 - Reviews: <a href="/r/13747729/">5</a>
  - Favs: 29 - Follows: 24 - Updated:
  <span data-xutime="1610096566">33m ago</span>
  - Published: 
  <span data-xutime="1605552788">Nov 16, 2020</span>
  - id: 13747729 
  </span>

Aspan与类xgray xcontrast_txt,所以我们这样选择它:

tags = soup.select_one('span.xgray.xcontrast_txt').get_text(strip=True)

您可能想了解更多关于beautifulsoup 的信息?

例子

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup


browser = webdriver.Chrome('C:\Program Files\ChromeDriver\chromedriver.exe')
browser.get('https://www.fanfiction.net/s/5218118/4/Yet-again-with-a-little-extra-help')

try:
    # wait until certain element with id 'storytext' showed up
    element = WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.ID, 'storytext'))
    )
    
    soup=BeautifulSoup(browser.page_source, 'lxml')

    storytext = soup.select_one('#storytext').get_text()
    tags = soup.select_one('span.xgray.xcontrast_txt').get_text(strip=True)
    
    print(tags)
    print(storytext)
    
finally:
    browser.close()

推荐阅读