python - BeautifulSoup 的 find 方法返回 None 而不是链接
问题描述
感谢您在这里查看我的问题,我正在尝试从旧的 Reddit 博客页面获取下一页链接,但不知何故 find 方法返回我None对象,代码:
def crawl(self):
curr_page_url = self.start_url
curr_page = requests.get(curr_page_url)
bs = BeautifulSoup(curr_page.text,'lxml')
# all_links = GetAllLinks(self.start_url)
nxtlink = bs.find('a',attrs={'rel':'nofollow next'})['href']
print(nxtlink)
并且 HTML 页面链接是此页面上的旧 Reddit 页面链接我正在尝试获取下一页的链接在跨度标记中:
<span class="next-button">
<a href="https://old.reddit.com/r/learnprogramming/?count=25&after=t3_j54ae2" rel="nofollow
next">next ›
</a>
</span>
解决方案
我认为您必须在请求中添加标头,否则服务器会认为您是机器人,这是正确的。
尝试这个:
import requests
from bs4 import BeautifulSoup
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "en-GB,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:81.0) Gecko/20100101 Firefox/81.0",
}
response = requests.get("https://old.reddit.com/r/learnprogramming/", headers=headers).text
soup = BeautifulSoup(response, "html.parser").find('a', attrs={'rel': 'nofollow next'})['href']
print(soup)
输出:
https://old.reddit.com/r/learnprogramming/?count=25&after=t3_j5ezm8