首页 > 解决方案 > 从网页的表格中抓取一定范围

问题描述

我正在尝试从该网站上抓取数据,该网站有一张不同类别的游戏积分表。我想将总共 24 个类别分为 24 列。在示例网页中有 5 个(生产、设计、工程和感谢)。

如果它们具有不同的类但它们都具有相同的 h3 类:“干净”,那将很容易。不同的页面有不同的类别,并且根据页面的不同,顺序也会发生变化。最重要的是,我需要的信息实际上是在表格的下一行和不同的类中。

所以我想的是,如果我可以为每个类别制作 24 个 if 语句来查找 h3 class:"clean" 是否有任何类别,那么我可以抓取我需要的类,否则不放。但问题是他们都共享同一个班级。所以我想我可以尝试使用 td colspan="5" 作为 python 的标记,让 python 知道每个类别何时结束和开始。

我的问题是,有没有办法让它在遇到 td colspan="5" 然后停止时对其进行编程?

import bs4 as bs
import urllib.request


gameurl = "https://www.mobygames.com/developer/sheet/view/developerId,1"

req = urllib.request.Request(gameurl,headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
infopage = soup.find_all("div", {"class":"col-md-8 col-lg-8"})
core_list =[]

for credits in infopage:
        niceHeaderTitle = credits.find_all("h1", {"class":"niceHeaderTitle"})
        name = niceHeaderTitle[0].text

        Titles = credits.find_all("h3", {"class":"clean"})

        Titles = [title.get_text() for title in Titles]

        if 'Business' in Titles:

            businessinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            business = businessinfo[0].get_text(strip=True)


        else:
            business = 'none'


        if 'Production' in Titles:

            productioninfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            production = productioninfo[0].get_text(strip=True)


        else:
            production = 'none'

        if 'Design' in Titles:

            designinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            design = designinfo[0].get_text(strip=True)


        else:
            design = 'none'

        if 'Writers' in Titles:

            writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            writers = writersinfo[0].get_text(strip=True)


        else:
            writers = 'none'            

        if 'Writers' in Titles:

            writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            writers = writersinfo[0].get_text(strip=True)


        else:
            writers = 'none'

        if 'Programming/Engineering' in Titles:

            programinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            program = programinfo[0].get_text(strip=True)


        else:
            video = 'none' 

        if 'Video/Cinematics' in Titles:

            videoinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            video = videoinfo[0].get_text(strip=True)


        else:
            video = 'none'   

        if 'Audio' in Titles:

            Audioinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            audio = Audioinfo[0].get_text(strip=True)


        else:
            audio = 'none' 

        if 'Art/Graphics' in Titles:

            artinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            art = artinfo[0].get_text(strip=True)


        else:
            art = 'none'             


        if 'Support' in Titles:

            supportinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            support = supportinfo[0].get_text(strip=True)


        else:
            support = 'none' 

        if 'Thanks' in Titles:

            thanksinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
            thanks = thanksinfo[0].get_text(strip=True)


        else:
            thanks = 'none'             

        games=[name,business,production,design,writers,video,audio,art,support,program,thanks]

        core_list.append(games)            

print (core_list)

标签: pythonweb-scraping

解决方案


推荐阅读