首页 > 解决方案 > 从数据框中删除列表中不存在的行

问题描述

我有一个数据框

29                             tech is a fucking bloodbath.
219       only 3 things guaranteed in life ATH taxes a...
255       market is at ath in zombie economy\n\nmarket c...
276       my aapl watch reminding me to breathe while i...

我有一个清单

names = 
['ATH', 'CRSR', 'GME', 'AMC', 'TSLA', 'MVIS', 'SPCE', 'CLNE', 'AAPL', 'WKHS']

我的代码看起来像这样

for ticker in top_tickers:
    df_ticker_lower = item[item.text.str.contains(ticker.lower())]
    df_ticker_upper = item[item.text.str.contains(ticker.upper())]
    df_ticker = pd.concat([df_ticker_lower, df_ticker_upper], axis=0)
    df_ticker['dt'] = pd.to_datetime(df_ticker.dt)


def dedup(sentence, to_dedup):
    for word in to_dedup:
        while sentence.split().count(word) > 3:
            sentence = ''.join(sentence.rsplit(word, 1)).replace('  ', ' ')
    return sentence

def foo(row):
    global names
    sentence = row['text']
    return dedup(sentence, names)
df_ticker['text'] = df_ticker.apply(foo, axis=1)

我想要做的是保留列表中包含任何单词的行。重要的部分是,如果列表中的任何单词周围有任何内容,则应删除该行。在这种情况下,需要删除第 29 行,因为 Bloodbath 是包含 ath 的单词。如果将其分开,我将保留此行,但在这种情况下,我希望删除此行 感谢您的帮助

标签: pythonpandasdataframe

解决方案


使用单词边界\b\b省略类似bloodbath的单词Series.str.contains

pat = '|'.join(r"\b{}\b".format(x) for x in names)
df = df[df['text'].str.contains(pat,case=False,na=True)]
print (df)
                                                text
1         only 3 things guaranteed in life ATH taxes
2                 market is at ath in zombie economy
4   my aapl watch reminding me to breathe while i...

如果需要提取第一个匹配值:

import re

names = ['ATH', 'CRSR', 'GME', 'AMC', 'TSLA', 'MVIS', 'SPCE', 'CLNE', 'AAPL', 'WKHS']

pat = '|'.join(r"\b{}\b".format(x) for x in names)
df['new'] = df['text'].str.extract(f'({pat})', flags=re.I)
print (df)
                                                text   new
0                       tech is a fucking bloodbath.   NaN
1         only 3 things guaranteed in life ATH taxes   ATH
2                 market is at ath in zombie economy   ath
3                                          market c.   NaN
4   my aapl watch reminding me to breathe while i...  aapl

或列表的所有匹配值:

import re

names = ['ATH', 'CRSR', 'GME', 'AMC', 'TSLA', 'MVIS', 'SPCE', 'CLNE', 'AAPL', 'WKHS']

pat = '|'.join(r"\b{}\b".format(x) for x in names)
df['new'] = df['text'].str.findall(pat, flags=re.I)
print (df)
                                                text     new
0                       tech is a fucking bloodbath.      []
1         only 3 things guaranteed in life ATH taxes   [ATH]
2                 market is at ath in zombie economy   [ath]
3                                          market c.      []
4   my aapl watch reminding me to breathe while i...  [aapl]

推荐阅读