python - 替换文本文件中标记列表的最佳方法
问题描述
我有一个文本文件(没有标点符号),文件大小约为 100MB - 1GB,这是一些示例行:
please check in here
i have a full hd movie
see you again bye bye
press ctrl c to copy text to clipboard
i need your help
...
并带有替换标记列表,如下所示:
check in -> check_in
full hd -> full_hd
bye bye -> bye_bye
ctrl c -> ctrl_c
...
在文本文件上替换后我想要的输出如下:
please check_in here
i have a full_hd movie
see you again bye_bye
press ctrl_c to copy text to clipboard
i need your help
...
我目前的做法
replace_tokens = {'ctrl c': 'ctrl_c', ...} # a python dictionary
for line in open('text_file'):
for token in replace_tokens:
line = re.sub(r'\b{}\b'.format(token), replace_tokens[token])
# Save line to file
此解决方案有效,但对于大量替换标记和大型文本文件来说,这非常慢。有没有更好的解决方案?
解决方案
使用二进制文件和字符串替换如下
- 将文件处理为二进制文件以减少文件转换的开销
- 使用字符串替换而不是正则表达式
代码
def process_binary(filename):
""" Replace strings using binary and string replace
Processing follows original code flow except using
binary files and string replace """
# Map using binary strings
replace_tokens = {b'ctrl c': b'ctrl_c', b'full hd': b'full_hd', b'bye bye': b'bye_bye', b'check in': b'check_in'}
outfile = append_id(filename, 'processed')
with open(filename, 'rb') as fi, open(outfile, 'wb') as fo:
for line in fi:
for token in replace_tokens:
line = line.replace(token, replace_tokens[token])
fo.write(line)
def append_id(filename, id):
" Convenience handler for generating name of output file "
return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])
性能比较
在 124 MB 文件上(通过复制发布的字符串生成):
- 发布解决方案:82.8 秒
- 避免正则表达式中的内循环(DAWG 帖子):28.2 秒
- 当前解决方案:9.5 秒
当前解决方案:
- 比已发布的解决方案改进了约 8.7 倍,并且
- ~3X 超过正则表达式(避免内循环)
总体趋势
测试代码
# Generate Data by replicating posted string
s = """please check in here
i have a full hd movie
see you again bye bye
press ctrl c to copy text to clipboard
i need your help
"""
with open('test_data.txt', 'w') as fo:
for i in range(1000000): # Repeat string 1M times
fo.write(s)
# Time Posted Solution
from time import time
import re
def posted(filename):
replace_tokens = {'ctrl c': 'ctrl_c', 'full hd': 'full_hd', 'bye bye': 'bye_bye', 'check in': 'check_in'}
outfile = append_id(filename, 'posted')
with open(filename, 'r') as fi, open(outfile, 'w') as fo:
for line in fi:
for token in replace_tokens:
line = re.sub(r'\b{}\b'.format(token), replace_tokens[token], line)
fo.write(line)
def append_id(filename, id):
return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])
t0 = time()
posted('test_data.txt')
print('Elapsed time: ', time() - t0)
# Elapsed time: 82.84100198745728
# Time Current Solution
from time import time
def process_binary(filename):
replace_tokens = {b'ctrl c': b'ctrl_c', b'full hd': b'full_hd', b'bye bye': b'bye_bye', b'check in': b'check_in'}
outfile = append_id(filename, 'processed')
with open(filename, 'rb') as fi, open(outfile, 'wb') as fo:
for line in fi:
for token in replace_tokens:
line = line.replace(token, replace_tokens[token])
fo.write(line)
def append_id(filename, id):
return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])
t0 = time()
process_binary('test_data.txt')
print('Elapsed time: ', time() - t0)
# Elapsed time: 9.593998670578003
# Time Processing using Regex
# Avoiding inner loop--see dawg posted answer
import re
def process_regex(filename):
tokens={"check in":"check_in", "full hd":"full_hd",
"bye bye":"bye_bye","ctrl c":"ctrl_c"}
regex=re.compile("|".join([r"\b{}\b".format(t) for t in tokens]))
outfile = append_id(filename, 'regex')
with open(filename, 'r') as fi, open(outfile, 'w') as fo:
for line in fi:
line = regex.sub(lambda m: tokens[m.group(0)], line)
fo.write(line)
def append_id(filename, id):
return "{0}_{2}.{1}".format(*filename.rsplit('.', 1) + [id])
t0 = time()
process_regex('test_data.txt')
print('Elapsed time: ', time() - t0)
# Elapsed time: 28.27900242805481
推荐阅读
- selenium-webdriver - 安装 chai 后,我将 chai-webdriverio 作为开发依赖项安装,但出现此错误
- python - 有效地对现有 csv 进行样式化和附加新数据帧
- javascript - 如何使 SVG 地图可缩放和可滚动?
- botframework - Bot Framework - 网络聊天 - 始终显示发送框
- ios - Flutter Firebase_messaging onBackgroundMessage 不会在较旧的 ios 设备上调用(iPhone 7 iPhone 8)
- javascript - 玩笑中的手动模拟通过不被调用的 const 传递?
- asp.net-core - 从 web 服务返回的 .net 5.0 SwaggerClient 字符串产生异常
- java - WifiConfiguration 设置 wifi 以编程方式在 ide 中工作,但不是作为应用程序
- cloud - 云计算 - PRC391
- javascript - 在最新的mui 中LoadingButton 不见了?