python - 仅引用条件为 True Python Pandas 的 DataFrame
问题描述
类似于这个问题,但有些不同(那个答案不起作用)。我正在尝试引用条件为真的 DataFrame。就我而言,字符串中是否包含单词库中的单词。如果单词在字符串中,我希望以后能够使用该特定的 DataFrame(例如,如果为真,则拔出链接并继续搜索)。所以我有:
wordBank = ['bomb', 'explosion', 'protest',
'port delay', 'port closure', 'hijack',
'tropical storm', 'tropical depression']
rss = pd.read_csv('RSSfeed2019.csv')
# print(rss.head())
feeds = [] # list of feed objects
for url in rss['URL'].head(5):
feeds.append(feedparser.parse(url))
# print(feeds)
posts = [] # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
for post in feed.entries:
if hasattr(post, 'summary'):
posts.append((post.title, post.link, post.summary))
else:
posts.append((post.title, post.link))
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
if (df['summary'].str.find(wordBank)) or (df['title'].str.find(wordBank)):
print(df['title'])
并尝试从另一个问题...
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
for word in wordBank:
mask = (df['summary'].str.find(word)) or (df['title'].str.find(word))
df.loc[mask, 'summary'] = word
df.loc[mask, 'title'] = word
我怎样才能让它打印摘要或标题中包含单词的字段的标题?我希望能够仅进一步操作这些帧。使用当前代码,它会打印 DataFrame 中的每个标题,因为我认为既然一个为真,它会认为打印所有标题。我怎样才能只引用真实的标题?
解决方案
鉴于以下设置:
posts = [["Global protest Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "The world’s total cellular containership fleet has passed 23 million TEU for the first time, according to shipping experts Alphaliner."],
["Global TEU Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "The world’s total cellular containership fleet has passed 23 million TEU for the first time, according to shipping experts Alphaliner."],
["Global TEU Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "There is a tropical depression"]]
df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
print(df)
设置
title ... summary
0 Global protest Breaks Record ... The world’s total cellular containership fleet...
1 Global TEU Breaks Record ... The world’s total cellular containership fleet...
2 Global TEU Breaks Record ... There is a tropical depression
你可以:
# create mask
mask = df['summary'].str.contains(rf"\b{'|'.join(wordBank)}\b", case=False) | df['title'].str.contains(rf"\b{'|'.join(wordBank)}\b", case=False)
# extract titles
titles = df['title'].values
# print them
for title in titles[mask]:
print(title)
输出
Global protest Breaks Record
Global TEU Breaks Record
请注意,第一行在protest
标题中,最后一行tropical depression
在摘要中。关键思想是使用正则表达式来匹配wordBank
. 在此处查看有关正则表达式的更多信息以及str.contains的文档。
推荐阅读
- json - _CastError(“客户端”类型不是“列表”类型的子类型
' 在类型转换中) - c# - 是否可以使用 Roslyn 构建 ASP.Net Core Web 应用程序?
- python - 范围类的计数方法的目的是什么?
- excel - 在excel中转置日期以包含标题行
- sql - 'local-name()' 需要一个单例(或空序列)T-SQL Xquery
- powershell - 基于 ComboBox1 填充 ComboBox2 时遇到问题
- maven - 由于 maven 中央存储库已移至 https,因此无法构建 grails 2.3.7 项目
- python - 在 google colab 中重定向或查看 stderr
- ios - 为什么有时会重用 ObjectIdentifiers?
- amazon-web-services - 如何在我的虚拟私有云上获取 DASK?