python - 如何过滤字符串列表中的关键字?
问题描述
我有一个字符串列表,这些字符串是我使用 BeautifulSoup 抓取的链接。我不知道如何只返回包含单词“The”的字符串。该解决方案可能使用正则表达式,但对我不起作用。
我试过了
for i in links_list:
if re.match('^The', i) is not None:
eps_only.append(i)
但我得到了像这样的错误
File "/opt/homebrew/Caskroom/miniconda/base/lib/python3.8/re.py", line 191, in match
return _compile(pattern, flags).match(string)
TypeError: expected string or bytes-like object
该列表如下所示:
['index.html', 'seinfeld-scripts.html', 'episodes_oveview.html', 'seinfeld-characters.html', 'buy-seinfeld.html', 'http://addthis.com/bookmark.php?v=250&username=doctoroids', None, None, None, None, 'http://community.seinfeldscripts.com', 'buy-seinfeld.html', 'seinfeld-t-shirt.html', 'seinfeld-dvd.html', 'episodes_oveview.html', 'alpha.html', ' http://www.shareasale.com/r.cfm?u=439896&b=119192&m=16934&afftrack=seinfeldScriptsTop&urllink=search%2E80stees%2Ecom%2F%3Fcategory%3D80s%2BTV%26i%3D1%26theme%3DSeinfeld%26u1%3Dcategory%26u2%3Dtheme', ' TheSeinfeldChronicles.htm', ' TheStakeout.htm', ' TheRobbery.htm', ' MaleUnbonding.htm', ' TheStockTip.htm', ' TheExGirlfriend.htm', ' ThePonyRemark.htm', ' TheJacket.htm', ' ThePhoneMessage.htm', ' TheApartment.htm', ' TheStatue.htm', ' TheRevenge.htm', ' TheHeartAttack.htm', ' TheDeal.htm', ' TheBabyShower.htm', ' TheChineseRestaurant.htm', ' TheBusboy.htm', 'TheNote.html', ' TheTruth.htm', 'ThePen.html', ' TheDog.htm', ' TheLibrary.htm', ' TheParkingGarage.htm', 'TheCafe.html', ' TheTape.htm', 'TheNoseJob.html', 'TheStranded.html', ...]
更新:完整代码
import requests
import re
from bs4 import BeautifulSoup
##################
##--user agent--##
##################
user_agent_desktop = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 '\
'Safari/537.36'
headers = {'User-Agent': user_agent_desktop}
#########################
##--fetching the page--##
#########################
URL = 'https://www.seinfeldscripts.com/seinfeld-scripts.html'
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
############################################################
##--scraping the links to the scripts from the main page--##
############################################################
links_list = []
eps_only = []
for link in soup.find_all('a'):
links_list.append(link.get('href'))
### sorting for links that contain 'the' ###
for i in filter(None, links_list):
if re.match('^The', str(i)) is not None:
eps_only.append(i)
print(eps_only)
解决方案
如果它作为参数传递, Pythonre.match
将失败None
——因此你得到的错误。
您的一些列表元素是None
.
在将它们传递给re.match
.
例如:
for i in links_list:
if i is not None and re.match('^The', i) is not None:
eps_only.append(i)
或者,您可以事先将它们过滤掉,如下所示:
links_list = [l for l in links_list if l is not None]
推荐阅读
- graphics - DirectX 11:启用深度缓冲区视图隐藏所有几何图形
- python - 从不同来源创建字典
- java - 为什么在构造函数中更改对象中的变量之一?
- java - 在运行时使用 intellij 在 Spring Boot application.yml 文件中注入占位符值
- html - bootstrap-如何正确对齐结构?
- java - 在 java Android Studio 中找不到局部变量“OpenCVLoader”
- c# - 将数据库附加到 SQL Server
- django - Django,我如何比较剩菜
- java - 如何在我的 HashMap 而不是整个 HashMap 中找到仅某些整体的最高值?
- r - 将带有时区对象的日期时间传递给闪亮仪表板中的日期范围输入