regex - optimizing regex substitution runtime
问题描述
I have some code which does a very large number of regex substitutions. Essentially it boils down to this regex which happens around 50K times on a small example testcase I have:
$string=~s/$pattern/$replacement/g ;
I am already pre-compiling the pattern via qr// across all patterns (there are around 2.5K patterns in this testcase). I have profiled this with NYTProf, and see that the time used by the subroutines which the regex engine is as follows:
# spent 39.7s making 49461026 calls to CORE:regcomp, avg 802ns/call
# spent 7.94s making 49461026 calls to CORE:subst, avg 161ns/call
# spent 6.61s making 49461026 calls to CORE:match, avg 134ns/call
However the time taken for this line across the ~50K calls is ~300s according to the profiler. So this essentially means ~53s is used by regex engine, whereas there is a ~250s overhead?? What would this overhead comprise off? I guess the string needs to be dynamically re-allocated in memory after it is modified, but in reality the regex matches only a very few times, so I don't think thats where the overhead is.
Also what else can I do to reduce this runtime? Both pattern and replacement are simple strings which use none of the power of regexes as such (the only real regex character used is word boundary - \b at the beginning and end of $pattern, otherwise just a sequence of fixed word characters)
EDIT: I actually realized a solution after I asked the question here. Let me clarify how the original code looked like, and then explain the solution if it will help anyone in future.
Simplified original code:
foreach my $string (@strings) {
foreach my $pattern (@patterns) {
my $replacement = $pattern2replacement{$pattern} ;
my $compiled_pattern = $pattern2compiled{$pattern} ;
$string=~s/$compiled_pattern/$replacement/g ;
# do something with $string
}
}
In the actual code, the inner foreach was in a subroutine with other checks/preprocessing going on before entering the foreach. Also the outer foreach wasn't really a single one but interspersed at many places in code.
Solution:
The key point here is that $string only contained real sub-strings ($pattern) which needed to be replaced with other sub-strings ($replacement). The regex was probably overkill. Though I did have multiple sub-strings which I needed to replace, they were guaranteed to be on a word boundary. Also one thing to note is that a replacement may have itself have a sub-string that is a previous pattern in @patterns. eg:
@patterns = ('small', 'blue') ;
%pattern2replacement = ( 'small' => 'big and blue', 'blue' => 'black') ;
i.e. we expect the string small pox
to be replaced by big and black pox
So the following alternative provides a huge runtime improvement:
#Step1: Build complete replacement hash:
my %oneshot_replacement ;
foreach my $pattern (@patterns) {
my $replacement = $pattern2replacement{$pattern} ;
my @splits = split(/\b/, $replacement) ;
@splits = map {exists $oneshot_replacement{$_} ? $oneshot_replacement{$_} : $_} @splits ;
$oneshot_replacement{$pattern} = join("", @splits) ;
}
#Step2: do substitution without regex:
foreach my $string (@strings) {
my @splits = split(/\b/, $string) ;
@splits = map {exists $oneshot_replacement{$_} ? $oneshot_replacement{$_} : $_} @splits ;
$string = join("", @splits) ;
# do something with $string
}
This helped reduce the runtime from ~300s to ~20s.
解决方案
关于您的问题,其余的 300 件用在哪里,答案可能是:在探查器中。
假设模式的数量“相对较少”,并且每个都是一个完整的单词,我猜这个代码比没有正则表达式的替换要快得多:
my $or = join("|",@patterns);
$string =~ s/\b($or)\b/$oneshot_replacement{$1}/g;
#print "$string";
无论如何,在第一个代码部分(#Step1:构建完整的替换哈希:)上面你犯了2个错误:
#my @splits = split(/\b/, $pattern) ;
my @splits = split(/\b/, $replacement) ;
如果您只想对模式数组进行一次迭代,则必须以正确的顺序执行此操作(如果可能的话)。
一种适用于您的示例的解决方案(允许某种扩展)是
#@splits = map {exists $oneshot_replacement{$_} ? $oneshot_replacement{$_} : $_} @splits ;
@splits = map {exists $oneshot_replacement{$_} ? $oneshot_replacement{$_} : exists $pattern2replacement{$_} ? $pattern2replacement{$_} : $_} @splits ;
推荐阅读
- javascript - 为什么这种添加事件监听器的“递归”方式不起作用?
- javascript - 与容器 div 重叠的传单地图
- react-native - React Native - 用户成功登录后从左侧菜单栏中隐藏登录按钮
- c# - “该进程无法访问该文件,因为它正被另一个进程使用” c#
- architecture - 提高基于微服务的系统的可扩展性
- android - Android SQLite 更新表不返回
- python - 如何使用 Django(不使用 Javascript)访问选择标签中的选定选项?
- reactjs - 打字稿:编译失败
- reactjs - 如何摆脱这个 - 消息:{'你没有订阅这个 API。'}。?
- php - Symfony 5 控制器不会验证错误