首页 > 解决方案 > 清理抓取的 HTML 列表

问题描述

我正在尝试从 wiki 页面中提取名称。使用 BeautifulSoup,我可以获得一个非常脏的列表(包括许多无关的项目),我想清理它们,但是我尝试“清理”列表使其保持不变。

#1).
#Retreive the page
import requests
from bs4 import BeautifulSoup
weapons_url = 'https://escapefromtarkov.gamepedia.com/Weapons'
weapons_page = requests.get(weapons_url)
weapons_soup = BeautifulSoup(weapons_page.content, 'html.parser')

#2).    
#Attain the data I need, plus lot of unhelpful data   
flithy_scraped_weapon_names = weapons_soup.find_all('td', href="", title="")

#3a).
#Identify keywords that reoccur in unhelpful:extraneous list items
dirt = ["mm", "predecessor", "File", "image"]
#3b). - Fails
#Remove extraneous data containing above-defined keywords
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in s for xs in dirt)]

#4).
#Check data
print(weapon_names_sanitised)
#Returns  a list identical to flithy_scraped_weapon_names

标签: pythonweb-scrapingbeautifulsoup

解决方案


问题出在本节:

weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in s for xs in dirt)]

它应该是:

weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
    if not any(xs in str(s) for xs in dirt)]

原因是flithy_scraped_weapon_names包含Tag对象,这些对象在打印时将转换为字符串,但需要显式转换为字符串xs in str(s)才能按预期工作。


推荐阅读