首页 > 解决方案 > 如何在熊猫数据框中跨多行搜索文本?

问题描述

所以我对 Python 还是很陌生,我只是想知道我是否可以使用它来跨多行搜索文本。这是我的数据框的屏幕截图:

https://i.stack.imgur.com/jeqpv.png

为了更清楚,我想做的是搜索包含多个单词的短语或表达,例如“New Jersey”,但是,每个单词组成一个单独的行,所以我不知道如何去包含更多多于查询中的一行。如果可能的话,我还想创建一个新列,将任何匹配标记为“M”和没有“N”的匹配。感谢所有帮助,让我更轻松!

标签: pythonpandasrows

解决方案


这个想法是连接所有行以便能够搜索多个连续的单词。

例如,我们想在整个数据框中找到短语“她想要”:

>>> df
   subtitle
0       She  # <- start here (1)
1     wants  #
2        to  # <- end here (1)
3      sing
4       she  # <- start here (2)
5     wants  #
6        to  # <- end here (2)
7       act
8       she  # <- start here (3)
9     wants  # 
10       to  # <- end here (3)
11    dance
import re

search = "she wants to"
text = " ".join(df["subtitle"])

# index of start / end position of the word in text
end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1

# create additional columns
df["start"] = start.tolist()
df["end"] = end.tolist()
df["match"] = False

# find all iteration of the search text
for match in re.finditer(search, text, re.IGNORECASE):
    idx1 = df[df["start"] == match.start()].index[0]
    idx2 = df[df["end"] == match.end()].index[0]
    df.loc[idx1:idx2, "match"] = True
>>> df
   subtitle  start  end  match
0       She      0    3   True
1     wants      4    9   True
2        to     10   12   True
3      sing     13   17  False
4       she     18   21   True
5     wants     22   27   True
6        to     28   30   True
7       act     31   34  False
8       she     35   38   True
9     wants     39   44   True
10       to     45   47   True
11    dance     48   53  False

更新:搜索多个词:

仅更改:

# search = "she wants to"
search = ["she wants to", "if you", "I will"]
search = fr"({'|'.join(search)})"
# df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', 'she', 'wants', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', 'laugh', 'I', 'will', 'smile', 'if', 'you', 'love', 'I', 'will', 'smile']})
>>> df
   subtitle  start  end  match
0       She      0    3   True
1     wants      4    9   True
2        to     10   12   True
3      sing     13   17  False
4       she     18   21   True
5     wants     22   27   True
6        to     28   30   True
7       act     31   34  False
8       she     35   38   True
9     wants     39   44   True
10       to     45   47   True
11    dance     48   53  False
12       If     54   56   True
13      you     57   60   True
14     sing     61   65  False
15        I     66   67   True
16     will     68   72   True
17    smile     73   78  False
18       if     79   81   True
19      you     82   85   True
20    laugh     86   91  False
21        I     92   93   True
22     will     94   98   True
23    smile     99  104  False
24       if    105  107   True
25      you    108  111   True
26     love    112  116  False
27        I    117  118   True
28     will    119  123   True
29    smile    124  129  False

更新 2:将条款写入文本文件:

$ cat terms.txt
she wants to
if you
I will
search = [term.strip() for term in open("terms.txt").readlines()]
search = fr"({'|'.join(search)})"

推荐阅读