python - 从网页的表格中抓取一定范围
问题描述
我正在尝试从该网站上抓取数据,该网站有一张不同类别的游戏积分表。我想将总共 24 个类别分为 24 列。在示例网页中有 5 个(生产、设计、工程和感谢)。
如果它们具有不同的类但它们都具有相同的 h3 类:“干净”,那将很容易。不同的页面有不同的类别,并且根据页面的不同,顺序也会发生变化。最重要的是,我需要的信息实际上是在表格的下一行和不同的类中。
所以我想的是,如果我可以为每个类别制作 24 个 if 语句来查找 h3 class:"clean" 是否有任何类别,那么我可以抓取我需要的类,否则不放。但问题是他们都共享同一个班级。所以我想我可以尝试使用 td colspan="5" 作为 python 的标记,让 python 知道每个类别何时结束和开始。
我的问题是,有没有办法让它在遇到 td colspan="5" 然后停止时对其进行编程?
import bs4 as bs
import urllib.request
gameurl = "https://www.mobygames.com/developer/sheet/view/developerId,1"
req = urllib.request.Request(gameurl,headers={'User-Agent': 'Mozilla/5.0'})
sauce = urllib.request.urlopen(req).read()
soup = bs.BeautifulSoup(sauce,'lxml')
infopage = soup.find_all("div", {"class":"col-md-8 col-lg-8"})
core_list =[]
for credits in infopage:
niceHeaderTitle = credits.find_all("h1", {"class":"niceHeaderTitle"})
name = niceHeaderTitle[0].text
Titles = credits.find_all("h3", {"class":"clean"})
Titles = [title.get_text() for title in Titles]
if 'Business' in Titles:
businessinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
business = businessinfo[0].get_text(strip=True)
else:
business = 'none'
if 'Production' in Titles:
productioninfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
production = productioninfo[0].get_text(strip=True)
else:
production = 'none'
if 'Design' in Titles:
designinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
design = designinfo[0].get_text(strip=True)
else:
design = 'none'
if 'Writers' in Titles:
writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
writers = writersinfo[0].get_text(strip=True)
else:
writers = 'none'
if 'Writers' in Titles:
writersinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
writers = writersinfo[0].get_text(strip=True)
else:
writers = 'none'
if 'Programming/Engineering' in Titles:
programinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
program = programinfo[0].get_text(strip=True)
else:
video = 'none'
if 'Video/Cinematics' in Titles:
videoinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
video = videoinfo[0].get_text(strip=True)
else:
video = 'none'
if 'Audio' in Titles:
Audioinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
audio = Audioinfo[0].get_text(strip=True)
else:
audio = 'none'
if 'Art/Graphics' in Titles:
artinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
art = artinfo[0].get_text(strip=True)
else:
art = 'none'
if 'Support' in Titles:
supportinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
support = supportinfo[0].get_text(strip=True)
else:
support = 'none'
if 'Thanks' in Titles:
thanksinfo = credits.find_all("tr", {"class":"devCreditsHighlight"})
thanks = thanksinfo[0].get_text(strip=True)
else:
thanks = 'none'
games=[name,business,production,design,writers,video,audio,art,support,program,thanks]
core_list.append(games)
print (core_list)
解决方案
推荐阅读
- node.js - 在 nodejs.db2 中触发插入查询时出错
- css - 如何做一个在Angular中部署菜单的“粘性”按钮?
- html - 如何在电子邮件签名中调整css中的图像大小?
- git - github-actions 获取 deployment_status 事件的分支名称或 PR 编号
- ios - EKEventEditViewController EKAlarm 警报标题已损坏
- ios - 如何在 Swift 中使用宽度约束将 Xib 加载到 CollectionView?
- swift - 如何在 Swift 中声明和初始化大于 UInt64 的常量?
- postgresql - 带有部分索引的 SQLAlchemy Postgres upsert
- html - 如何在移动浏览器的网页上显示左右箭头?
- java - 将图片从我的应用存储移动到 Android 10+ 上的 DCIM 或图片目录