首页 > 解决方案 > 如何在 Python 中排除或删除特定部分

问题描述

我想分析下面的聊天记录,以获取最常用的单词。因此,我需要的唯一部分是在 [时间] 之后,例如 [01:25]。我将如何改变?

+++

John, Max, Tracey with SuperChats

Date Saved : 2019-11-22 19:29:46

--------------- Tuesday, 9 July 2019 ---------------

[John] [00:27] Hi

[Max] [01:25] No

[Tracey] [02:31] Anybody has some bananas?

[Max] [04:39] No

[John] [20:58] Oh my goodness

--------------- Wednesday, 10 July 2019 ---------------

[Tracey] [14:33] Anybody has a mug?

[Max] [14:45] No

[John] [14:45] Oh my buddha

+++
from collections import Counter
import re

wordDict = Counter()
with open(r'C:chatlog.txt', 'r', encoding='utf-8') as f:
    chatline = f.readlines()
    chatline = [x.strip() for x in chatline]
    chatline = [x for x in chatline if x]

    for count in range(len(chatline)):
        if count < 2:
            continue
        elif '---------------' in chatline:
            continue

        re.split(r"\[\d{2}[:]\d{2}\]", x for x in chatline) #Maybe need to modify this part

print('Word', 'Frequency')
for word, freq in wordDict.most_common(50):
    print('{0:10s} : {1:3d}'.format(word, freq))

标签: python-3.x

解决方案


您可以使用该模式/^.*?\[\d\d:\d\d\]\s*(.+)$/来匹配相关行之后的文本(我会逐行工作,而不是用 slurping 文件f.readlines(),这对内存不友好)。由于时间戳非常独特,因此无需专门处理其他任何事情,但是如果您愿意,可以对出现在行首的用户名周围的括号进行测试。

import re
from collections import Counter

words = []

with open("chatlog.txt", "r", encoding="utf-8") as f:
    for line in f:
        m = re.search(r"^.*?\[\d\d:\d\d\]\s*(.+)$", line)

        if m:
            words.extend(re.split(r"\s+", m.group(1)))

for word, freq in Counter(words).most_common(50):
    print("{0:10s} : {1:3d}".format(word, freq))

输出:

No         :   3
Anybody    :   2
has        :   2
Oh         :   2
my         :   2
Hi         :   1
some       :   1
bananas?   :   1
goodness   :   1
a          :   1
mug?       :   1
buddha     :   1

可以看出,剥离标点符号也可能是值得的。你可以使用类似的东西

# ...
if m:
    no_punc = re.split(r"\W+", m.group(1))
    words.extend([x for x in no_punc if x])
# ...

推荐阅读