首页 > 解决方案 > Webscraping-Python 循环卡住了

问题描述

这只是一个在 imdb 上抓取前 250 部电影的普通程序。但是,当我尝试访问每部电影的链接以获取更多信息时,循环卡住了。

import requests
from bs4 import BeautifulSoup

website="https://www.imdb.com/chart/top/"
d={}
r=requests.get(website, headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 
10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"})
c=r.content
soup=BeautifulSoup(c, "html.parser")

all=soup.find_all("td",{"class": "titleColumn"})

for item in all:
    d["Name"]=item.find("a").text
    #print(d["Name"])
    d["Links"]="https://www.imdb.com"+item.find("a").get("href")
    #print(d["Links"])
    r2=requests.get(d["Links"], headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac 
    OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 
    Safari/537.36"})
    c2=r2.content
    soup2=BeautifulSoup(c2, "html.parser")
    d["Info"]=soup2.find("div", class=False)
    print(d["Info"])

我们不能一次抓取多个网页吗?我正在使用 jupyter 笔记本。

我试图获得电影摘要。但后来意识到循环卡在这个语句上(使用 print 语句找到这个)

标签: python-3.xloopsweb-scraping

解决方案


我不太确定为什么你的循环卡住了,但看起来你有一个语法错误class=False,至少这是我的 IDE 所说的。无论如何,您可以抓取多个页面,但为了做到这一点,您需要先获取网址。

这是一个构建字典列表的工作示例,如下所示:

[{url: movie_url, title: movie_title, summary: movie_summary}]

编码:

import csv

import requests
import unicodedata

from bs4 import BeautifulSoup


def get_movie_urls(main_url = "https://www.imdb.com/chart/top/") -> list:
    r = requests.get(main_url).content
    soup = BeautifulSoup(r, "html.parser")
    titles = soup.find_all("td", {"class": "titleColumn"})
    return [f"https://imdb.com{title.find('a').get('href')}" for title in titles]


def get_movie_info(url: str) -> dict:
    s = BeautifulSoup(requests.get(url).content, "html.parser")
    title_wrapper = s.find("div", {"class": "title_wrapper"}).find("h1")
    summary = s.find("div", {"class": "summary_text"})
    title = title_wrapper.text.strip().split('(')[0].rstrip()
    return {
        "url": url,
        "title": unicodedata.normalize("NFKD", title),
        "summary": summary.text.strip(),
    }


def scrape_imdb(limit: int = 0) -> list:
    return [get_movie_info(url) for url in get_movie_urls()[:limit]]


def dump_to_csv(imdb_data: list):
    keys = imdb_data[0].keys()
    with open('imdb_top250.csv', 'w', newline='') as f:
        w = csv.DictWriter(f, keys)
        w.writeheader()
        w.writerows(imdb_data)


data = scrape_imdb(limit=2)
print(data)
dump_to_csv(data)

这输出:

[{'url': 'https://imdb.com/title/tt0111161/', 'title': 'The Shawshank Redemption', 'summary': 'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.'}, {'url': 'https://imdb.com/title/tt0068646/', 'title': 'The Godfather', 'summary': 'The aging patriarch of an organized crime dynasty transfers control of his clandestine empire to his reluctant son.'}]

要抓取所有 250 部热门电影,只需将 limit 参数从scrape_imdb(limit=2)中删除即可scrape_imdb()

作为奖励,我添加了将结果转储到.csv文件中。


推荐阅读