python - 正则表达式：删除彼此相邻的重复行

问题描述

我已经从 Youtube 中提取了一些 cc 并且我坚持使用下面的值，我不知道如何处理它。我擅长替换字符串和其他东西，但是当事情变得严重时我真的很糟糕:(

这个

 we
 all
 have
 a
 unique
 perspective
 on
 the
 we all have a unique perspective on the

 we all have a unique perspective on the
 world
 around
 us
 and
 believe
 it
 or
 not
 world around us and believe it or not

 world around us and believe it or not

应替换为：

we all have a unique perspective on the
world around us and believe it or not

标签： pythonregex

使用这个正则表达式，你可以去掉所有只有一个单词的行，如果有行有多个单词并且完全重复，它们将被替换为单行，

\w+\s*\n|([\w ]+)\n*(\1\n+)*

这里交替中的第一部分\w+\s*\n匹配单个字行并被替换为空字符串，第二个交替([\w ]+)\n*(\1\n+)*捕获 group1 中的一行，然后(\1\n+)*消耗任何重复的行，最后被 group2 替换，这是同一行重复多次。

演示

python - 正则表达式：删除彼此相邻的重复行

问题描述

解决方案

推荐阅读