首页 > 解决方案 > 排除正则表达式并处理非常大的文件

问题描述

我有一个需要更正的文本文件。文件“exclude.txt”中的单词应从原始文本中删除。

original.txt

<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="tast" block-list:name="tart"/>
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="wark" block-list:name="wrok" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />

排除文件看起来像这样......

exclude.txt
tart
wrok

预期的输出将如下所示...

final.txt
<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />

此 grep 命令按预期工作。

grep -v -E 'tart|wrok' original.txt

如果我在排除文件中只有 2 或 3 个单词,这没关系。但问题是原始文件和排除文件都有数百万字。


更新:

我忘了提到我在 original.txt 中有这一行

<block-list:block block-list:abbreviated-name="tart" block-list:name="test"/>

而且我想将这一行保留在原始文件中,因为即使存在错误的单词“tart”,它也不在“ block-list:name ”中。


更新:

与排除文件相比,包含文件有 1500 万字(15000 字)

include.txt
test
work
table
total
exit

awk 和 grep + sed 命令被杀死。我更喜欢使用包含文件而不是排除文件(如果可能的话)。

标签: awksedgrep

解决方案


您可以在以下情况下使用此grep + sed解决方案bash

grep -vFf <(sed 's/.*/block-list:name="&"/' exclude.txt) original.txt

<block-list:block block-list:abbreviated-name="tost" block-list:name="test" />
<block-list:block block-list:abbreviated-name="werk" block-list:name="work" />
<block-list:block block-list:abbreviated-name="teble" block-list:name="table" />
<block-list:block block-list:abbreviated-name="tetal" block-list:name="total" />
<block-list:block block-list:abbreviated-name="exet" block-list:name="exit" />
  • sed 's/.*/block-list:name="&"/' exclude.txt用于包装每个exclude.txt单词block-list:name="<word>"
  • grep -vFf用于将所有不匹配的行original.txt与来自进程替换“<(....) that runssed”命令的模式匹配。

PS:根据已编辑的问题,此解决方案仅block-list:name="blocked-word"在原始文件中忽略。


推荐阅读