首页 > 解决方案 > optimizing regex substitution runtime

问题描述

I have some code which does a very large number of regex substitutions. Essentially it boils down to this regex which happens around 50K times on a small example testcase I have:

$string=~s/$pattern/$replacement/g ;

I am already pre-compiling the pattern via qr// across all patterns (there are around 2.5K patterns in this testcase). I have profiled this with NYTProf, and see that the time used by the subroutines which the regex engine is as follows:

# spent  39.7s making 49461026 calls to CORE:regcomp, avg 802ns/call
# spent  7.94s making 49461026 calls to CORE:subst, avg 161ns/call
# spent  6.61s making 49461026 calls to CORE:match, avg 134ns/call

However the time taken for this line across the ~50K calls is ~300s according to the profiler. So this essentially means ~53s is used by regex engine, whereas there is a ~250s overhead?? What would this overhead comprise off? I guess the string needs to be dynamically re-allocated in memory after it is modified, but in reality the regex matches only a very few times, so I don't think thats where the overhead is.

Also what else can I do to reduce this runtime? Both pattern and replacement are simple strings which use none of the power of regexes as such (the only real regex character used is word boundary - \b at the beginning and end of $pattern, otherwise just a sequence of fixed word characters)

EDIT: I actually realized a solution after I asked the question here. Let me clarify how the original code looked like, and then explain the solution if it will help anyone in future.

Simplified original code:

foreach my $string (@strings) {
  foreach my $pattern (@patterns) {
    my $replacement = $pattern2replacement{$pattern} ;
    my $compiled_pattern = $pattern2compiled{$pattern} ;
    $string=~s/$compiled_pattern/$replacement/g ;
    # do something with $string
  }
}

In the actual code, the inner foreach was in a subroutine with other checks/preprocessing going on before entering the foreach. Also the outer foreach wasn't really a single one but interspersed at many places in code.

Solution:

The key point here is that $string only contained real sub-strings ($pattern) which needed to be replaced with other sub-strings ($replacement). The regex was probably overkill. Though I did have multiple sub-strings which I needed to replace, they were guaranteed to be on a word boundary. Also one thing to note is that a replacement may have itself have a sub-string that is a previous pattern in @patterns. eg:

 @patterns = ('small', 'blue') ; 
 %pattern2replacement = ( 'small' => 'big and blue', 'blue' => 'black') ;

i.e. we expect the string small pox to be replaced by big and black pox So the following alternative provides a huge runtime improvement:

#Step1: Build complete replacement hash:
my %oneshot_replacement ; 
foreach my $pattern (@patterns) {
  my $replacement = $pattern2replacement{$pattern} ;
  my @splits = split(/\b/, $replacement) ;
  @splits = map {exists $oneshot_replacement{$_} ? $oneshot_replacement{$_} : $_} @splits ;
  $oneshot_replacement{$pattern} = join("", @splits) ; 
}

#Step2: do substitution without regex:
foreach my $string (@strings) {
  my @splits = split(/\b/, $string) ;
  @splits = map {exists $oneshot_replacement{$_} ? $oneshot_replacement{$_} : $_} @splits ;
  $string = join("", @splits) ;
  # do something with $string
}

This helped reduce the runtime from ~300s to ~20s.

标签: regexperloptimization

解决方案


关于您的问题,其余的 300 件用在哪里,答案可能是:在探查器中。

假设模式的数量“相对较少”,并且每个都是一个完整的单词,我猜这个代码比没有正则表达式的替换要快得多:

my $or = join("|",@patterns);
$string =~ s/\b($or)\b/$oneshot_replacement{$1}/g;
#print "$string";

无论如何,在第一个代码部分(#Step1:构建完整的替换哈希:)上面你犯了2个错误:

#my @splits = split(/\b/, $pattern) ;
my @splits = split(/\b/, $replacement) ;

如果您只想对模式数组进行一次迭代,则必须以正确的顺序执行此操作(如果可能的话)。

一种适用于您的示例的解决方案(允许某种扩展)是

#@splits = map {exists $oneshot_replacement{$_} ? $oneshot_replacement{$_} : $_} @splits ;
@splits = map {exists $oneshot_replacement{$_} ? $oneshot_replacement{$_} : exists $pattern2replacement{$_} ? $pattern2replacement{$_} : $_} @splits ;

推荐阅读