首页 > 解决方案 > 仅引用条件为 True Python Pandas 的 DataFrame

问题描述

类似于这个问题,但有些不同(那个答案不起作用)。我正在尝试引用条件为真的 DataFrame。就我而言,字符串中是否包含单词库中的单词。如果单词在字符串中,我希望以后能够使用该特定的 DataFrame(例如,如果为真,则拔出链接并继续搜索)。所以我有:

wordBank = ['bomb', 'explosion', 'protest',
            'port delay', 'port closure', 'hijack',
            'tropical storm', 'tropical depression']

rss = pd.read_csv('RSSfeed2019.csv')
# print(rss.head())

feeds = []  # list of feed objects
for url in rss['URL'].head(5):
    feeds.append(feedparser.parse(url))
    # print(feeds)

posts = []  # list of posts [(title1, link1, summary1), (title2, link2, summary2) ... ]
for feed in feeds:
    for post in feed.entries:
        if hasattr(post, 'summary'):
            posts.append((post.title, post.link, post.summary))
        else:
            posts.append((post.title, post.link))



df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])

if (df['summary'].str.find(wordBank)) or (df['title'].str.find(wordBank)):
    print(df['title'])

并尝试从另一个问题...

df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])

for word in wordBank:
    mask = (df['summary'].str.find(word)) or (df['title'].str.find(word))
    df.loc[mask, 'summary'] = word
    df.loc[mask, 'title'] = word

我怎样才能让它打印摘要或标题中包含单词的字段的标题?我希望能够仅进一步操作这些帧。使用当前代码,它会打印 DataFrame 中的每个标题,因为我认为既然一个为真,它会认为打印所有标题。我怎样才能只引用真实的标题?

标签: pythonpandasdataframe

解决方案


鉴于以下设置:

posts = [["Global protest Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "The world’s total cellular containership fleet has passed 23 million TEU for the first time, according to shipping experts Alphaliner."],
         ["Global TEU Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "The world’s total cellular containership fleet has passed 23 million TEU for the first time, according to shipping experts Alphaliner."],
         ["Global TEU Breaks Record", 'porttechnology.org/news/global-teu-breaks-record/', "There is a tropical depression"]]

df = pd.DataFrame(posts, columns=['title', 'link', 'summary'])
print(df)

设置

                          title  ...                                            summary
0  Global protest Breaks Record  ...  The world’s total cellular containership fleet...
1      Global TEU Breaks Record  ...  The world’s total cellular containership fleet...
2      Global TEU Breaks Record  ...                     There is a tropical depression

你可以:

# create mask
mask = df['summary'].str.contains(rf"\b{'|'.join(wordBank)}\b", case=False) | df['title'].str.contains(rf"\b{'|'.join(wordBank)}\b", case=False)

# extract titles
titles = df['title'].values

# print them
for title in titles[mask]:
    print(title)

输出

Global protest Breaks Record
Global TEU Breaks Record

请注意,第一行在protest标题中,最后一行tropical depression在摘要中。关键思想是使用正则表达式来匹配wordBank. 在此处查看有关正则表达式的更多信息以及str.contains的文档。


推荐阅读