首页 > 解决方案 > 使用 Beautiful Soup 进行网页抓取,无类抓取多个元素

问题描述

所以我想把导演从这件事上刮下来。但正如我看到的页面,我知道这部电影有两位导演 Danny Boyle 和 Loveleen Tandan。但是如果我使用 find_all('a') 则无法获得它,那么它也会采用 Dev Patel、Freida Pinto 等演员的名字。

我不能使用 find_all('a')[1] 和 find_all('a')[2] 因为其他电影可能只有一个导演。唯一将演员与导演区分开来的是带有类幽灵的跨度标签。假设可能有一个、两个或三个董事,我应该如何收集这些数据。

<p class="">
             Directors:
             <a href="/name/nm0000965/">
              Danny Boyle
             </a>
             ,
             <a href="/name/nm0849164/">
              Loveleen Tandan
             </a>
             <span class="ghost">
              |
             </span>
             Stars:
             <a href="/name/nm2353862/">
              Dev Patel
             </a>
             ,
             <a href="/name/nm2951768/">
              Freida Pinto
             </a>
             ,
             <a href="/name/nm0795661/">
              Saurabh Shukla
             </a>
             ,
             <a href="/name/nm0438463/">
              Anil Kapoor
             </a>
            </p>

该页面的网址是: https ://www.imdb.com/search/title/?count=100&groups=oscar_best_picture_winners&sort=year%2Cdesc&ref_=nv_ch_osc

标签: pythonweb-scrapingbeautifulsoup

解决方案


这应该可以帮助你:

from bs4 import BeautifulSoup

html = """
<p class="">
             Directors:
             <a href="/name/nm0000965/">
              Danny Boyle
             </a>
             ,
             <a href="/name/nm0849164/">
              Loveleen Tandan
             </a>
             <span class="ghost">
              |
             </span>
             Stars:
             <a href="/name/nm2353862/">
              Dev Patel
             </a>
             ,
             <a href="/name/nm2951768/">
              Freida Pinto
             </a>
             ,
             <a href="/name/nm0795661/">
              Saurabh Shukla
             </a>
             ,
             <a href="/name/nm0438463/">
              Anil Kapoor
             </a>
            </p>
""" #The html code provided by you

soup = BeautifulSoup(html,'html5lib')

p_tag = soup.find('p')

span = p_tag.find('span',class_ = "ghost")

prev = list(span.previous_siblings) #Finds all the tags before the span tag with class ghost and converts them into a list

prev = [str(x) for x in prev]

prev = ''.join(prev) #Converts the list to a string

soup2 = BeautifulSoup(prev,'html5lib') #Creates a new BeautifulSoup object with the newly formed string

a_tags = soup2.find_all('a')

for a in a_tags:
    txt = a.text.strip()
    print(txt)

输出:

Loveleen Tandan
Danny Boyle

希望这会有所帮助!


推荐阅读