python - 如何在熊猫数据框中跨多行搜索文本?
问题描述
所以我对 Python 还是很陌生,我只是想知道我是否可以使用它来跨多行搜索文本。这是我的数据框的屏幕截图:
https://i.stack.imgur.com/jeqpv.png
为了更清楚,我想做的是搜索包含多个单词的短语或表达,例如“New Jersey”,但是,每个单词组成一个单独的行,所以我不知道如何去包含更多多于查询中的一行。如果可能的话,我还想创建一个新列,将任何匹配标记为“M”和没有“N”的匹配。感谢所有帮助,让我更轻松!
解决方案
这个想法是连接所有行以便能够搜索多个连续的单词。
例如,我们想在整个数据框中找到短语“她想要”:
>>> df
subtitle
0 She # <- start here (1)
1 wants #
2 to # <- end here (1)
3 sing
4 she # <- start here (2)
5 wants #
6 to # <- end here (2)
7 act
8 she # <- start here (3)
9 wants #
10 to # <- end here (3)
11 dance
import re
search = "she wants to"
text = " ".join(df["subtitle"])
# index of start / end position of the word in text
end = df["subtitle"].apply(len).cumsum() + pd.RangeIndex(len(df))
start = end.shift(fill_value=-1) + 1
# create additional columns
df["start"] = start.tolist()
df["end"] = end.tolist()
df["match"] = False
# find all iteration of the search text
for match in re.finditer(search, text, re.IGNORECASE):
idx1 = df[df["start"] == match.start()].index[0]
idx2 = df[df["end"] == match.end()].index[0]
df.loc[idx1:idx2, "match"] = True
>>> df
subtitle start end match
0 She 0 3 True
1 wants 4 9 True
2 to 10 12 True
3 sing 13 17 False
4 she 18 21 True
5 wants 22 27 True
6 to 28 30 True
7 act 31 34 False
8 she 35 38 True
9 wants 39 44 True
10 to 45 47 True
11 dance 48 53 False
更新:搜索多个词:
仅更改:
# search = "she wants to"
search = ["she wants to", "if you", "I will"]
search = fr"({'|'.join(search)})"
# df = pd.DataFrame({'subtitle': ['She', 'wants', 'to', 'sing', 'she', 'wants', 'to', 'act', 'she', 'wants', 'to', 'dance', 'If', 'you', 'sing', 'I', 'will', 'smile', 'if', 'you', 'laugh', 'I', 'will', 'smile', 'if', 'you', 'love', 'I', 'will', 'smile']})
>>> df
subtitle start end match
0 She 0 3 True
1 wants 4 9 True
2 to 10 12 True
3 sing 13 17 False
4 she 18 21 True
5 wants 22 27 True
6 to 28 30 True
7 act 31 34 False
8 she 35 38 True
9 wants 39 44 True
10 to 45 47 True
11 dance 48 53 False
12 If 54 56 True
13 you 57 60 True
14 sing 61 65 False
15 I 66 67 True
16 will 68 72 True
17 smile 73 78 False
18 if 79 81 True
19 you 82 85 True
20 laugh 86 91 False
21 I 92 93 True
22 will 94 98 True
23 smile 99 104 False
24 if 105 107 True
25 you 108 111 True
26 love 112 116 False
27 I 117 118 True
28 will 119 123 True
29 smile 124 129 False
更新 2:将条款写入文本文件:
$ cat terms.txt
she wants to
if you
I will
search = [term.strip() for term in open("terms.txt").readlines()]
search = fr"({'|'.join(search)})"
推荐阅读
- list - 检查列表列表是否具有两个或多个相同的元素
- android - 警告:org.apache.poi.hssf.usermodel.DummyGraphics2d
- python - 使用正则表达式获取 DataFrame 列中子字符串的位置
- ios - 状态栏背景颜色与视图控制器不同
- android - Admob - 广告加载失败:3
- java - 垃圾收集和同步可见性
- apache - 使用 Jmeter 进行负载测试时解决 429 Too Many Requests 问题
- java - Spring 框架使用哪些 JSR?
- flutter - 放大镜按钮从何而来
- laravel - 在 Laravel 中传递数据库查询以导出为 PDF