首页 > 解决方案 > 如何在python中使用beautifulsoup获取完整的href链接

问题描述

我正在尝试按类型获取顶级电影名称。我无法获得完整的href链接,我被半个href链接卡住了

通过我得到的以下代码,

https://www.imdb.com/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,
https://www.imdb.com/search/title?genres=adventure&sort=user_rating,desc&title_type=feature&num_votes=25000,
https://www.imdb.com/search/title?genres=animation&sort=user_rating,desc&title_type=feature&num_votes=25000,
https://www.imdb.com/search/title?genres=biography&sort=user_rating,desc&title_type=feature&num_votes=25000,
.........

就像那样,但我想按动作、冒险、动画、传记等类型列出所有前 100 部电影的名称......

我尝试了以下代码:

from bs4 import BeautifulSoup
import requests

url = 'https://www.imdb.com'
main_url = url + '/chart/top'
res = requests.get(main_url)
soup = BeautifulSoup(res.text, 'html.parser')
for href in soup.find_all(class_='subnav_item_main'):
               # print(href)
               all_links = url + href.find('a').get('href')
               print(all_links)

我想要完整的链接,如下所示的链接

/search/title?genres=action&sort=user_rating,desc&title_type=feature&num_votes=25000,&pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=5aab685f-35eb-40f3-95f7-c53f09d542c3&pf_rd_r=FM1ZEBQ7E9KGQSDD441H&pf_rd_s=right-6&pf_rd_t=15506&pf_rd_i=top&ref_=chttp_gnr_1"

标签: pythonweb-scrapingbeautifulsouprequest

解决方案


您需要对这些网址进行另一个循环,并且限制只能获得 100 个。我将键存储在字典中,其中键是流派,值是电影列表。注意可能会出现原始标题,例如 The Mountain II (2016) 是 Dag II(原始标题)。

links是一个元组列表,我将流派作为第一项,将 url 作为第二项。

import requests, pprint
from bs4 import BeautifulSoup as bs
from urllib.parse import  urljoin

url = 'https://www.imdb.com/chart/top'    
genres = {}

with requests.Session() as s:
    r = s.get(url)
    soup = bs(r.content, 'lxml')
    links = [(i.text, urljoin(url,i['href'])) for i in soup.select('.subnav_item_main a')]

    for link in links:
        r = s.get(link[1])
        soup = bs(r.content, 'lxml')
        genres[link[0].strip()] = [i['alt'] for i in soup.select('.loadlate', limit = 100)]

pprint.pprint(genres)

样本输出:

在此处输入图像描述


推荐阅读