首页 > 解决方案 > Python排序和删除列表中的重复项使用re.sub

问题描述

我对 Python 完全陌生。我尝试制作模拟 bash 命令:cat domains.txt |sort -u|sed 's/^*.//g' > domains2.txt 文件域包含带和不带掩码前缀的域列表,*.例如:

*.example.com
example2.org

约 300k+ 行

我写了这段代码:

infile = "domains.txt"
outfile = "2"
outfile2 = "3"
with open(infile) as fin, open(outfile, "w+") as fout:
    for line in fin:
       line = line.replace('*.', "")
       fout.write(line)
with open('2', 'r') as r, open(outfile2, "w") as fout2 :
    for line in sorted(r):
        print(line, end='',file=fout2)

它按计划切割*.,排序列表,但不删除重复的行

我曾建议使用 re.sub 而不是 replace 来使模式更严格(就像在 sed 中我从行首开始做的那样),但是当我尝试这个时:

import re

infile = "domains.txt"
outfile = "2"
outfile2 = "3"
with open(infile) as fin, open(outfile, "w+") as fout:
    for line in fin:
       newline = re.sub('^*.', '', line)
       fout.write(newline)
with open('2', 'r') as r, open(outfile2, "w") as fout2 :
    for line in sorted(r):
        print(line, end='',file=fout2)

它只是不适用于错误,我不明白。

标签: pythonsedre

解决方案


在正则表达式*中,.等等都是特殊字符。您应该转义它们才能使用它们。

import re

s = "*.example.com"
re.sub(r'^\*\.', '', s)

> 'example.com'

推荐阅读