首页 > 解决方案 > 为什么我的正则表达式会随着更多的运营商而呈指数级增长?

问题描述

所以我正在用 perl 正则表达式构建一个解析器。我在模式中间使用代码来实现优先级。我还有一个构建系统,可以自动生成称为构面的“影子”正则表达式,并且不包含任何代码,因此当我乱序匹配时,我可以填充到完全匹配。

无论如何,现在我已经创建了一个手写示例,并显示了我的二进制操作匹配代码。无论表达式中是否存在运算符,它都会以指数方式变慢,我们走的链越高。

例如,如果我从 shift 运算符开始,它将需要大约 1 秒,但如果我从 and 逻辑开始,则需要一整分钟或其他时间(我没有测量时间,但它明显慢得多),因为我'我之前提到过,实际的运算符不包含任何更高优先级的运算符——只有链中的最低优先级——比如加法和乘法。

这是 perl 示例(我来自 pcre2,当像这样启动时它完全失败,匹配计数耗尽):

#!/usr/bin/perl

$text = "7*7*4*5*6*7+(4*4*4+3*7*4);7*7*4*5*6*7+(4*4*4+3*7*4)";

$result = $text =~/((*F) (?# << clever way of declaring subroutines in a regex without actually executing them)

(?<orlogicplusrest>(?>(?&andlogicplusrest))(?>(?<orlogicop>(?<orlogicopraw>\|\|)(?>(?&orlogicplusrest))))
    |(?>(?&andlogicplusrestfacet))(?{print("$+{orlogicopraw}\n")if($+{orlogicopraw});}))

(?<andlogicplusrest>(?>(?&orplusrest))(?>(?<andlogicop>(?<andlogicopraw>&&)(?>(?&andlogicplusrest))))
    |(?>(?&orplusrestfacet))(?{print("$+{andlogicopraw}\n")if($+{andlogicopraw});}))

(?<orplusrest>(?>(?&xorplusrest))(?>(?<orop>(?!\|\|)(?<oropraw>\|)(?>(?&orplusrest))))
    |(?>(?&xorplusrestfacet))(?{print("$+{oropraw}\n")if($+{oropraw});}))

(?<xorplusrest>(?>(?&andplusrest))(?>(?<xorop>(?<xoropraw>\^)(?>(?&xorplusrest))))
    |(?>(?&andplusrestfacet))(?{print("$+{xoropraw}\n")if($+{xoropraw});}))

(?<andplusrest>(?>(?&eqplusrest))(?>(?<andop>(?!&&)(?<andopraw>&)(?>(?&andplusrest))))
    |(?>(?&eqplusrestfacet))(?{print("$+{andopraw}\n")if($+{andopraw});}))

(?<eqplusrest>(?>(?&relplusrest))(?>(?<eqop>(?<eqopraw>==|!=)(?>(?&eqplusrest))))
    |(?>(?&relplusrestfacet))(?{print("$+{eqopraw}\n")if($+{eqopraw});}))

(?<relplusrest>(?>(?&shiftplusrest))(?>(?<relop>(?<relopraw>(?!<<)<|(?!>>)>|<=|>=)(?>(?&relplusrest))))
    |(?>(?&shiftplusrestfacet))(?{print("$+{relopraw}\n")if($+{relopraw});}))

(?<shiftplusrest>(?>(?&addmulplusrest))(?>(?<shiftop>(?<shiftopraw><<|>>)(?>(?&shiftplusrest))))
    |(?>(?&addmulplusrestfacet))(?{print("$+{shiftopraw}\n")if($+{shiftopraw});}))

(?<addmulplusrest>(?>(?&muloprest))(?>(?<addop>(?!\+\+|--)(?<addopraw>[\+\-])(?>(?&addmulplusrest))))
    |(?>(?&muloprestfacet))(?{print("$+{addopraw}\n") if($+{addopraw});}))

(?<muloprest>(?>(?&testxpr))(?>(?<mulop>(?<mulopraw>[\*\/\%])(?>(?&muloprest))))
    |(?>(?&testxprfacet))(?{print("$+{mulopraw}\n") if($+{mulopraw});}))

(?# entry point - replace here)

(?<testentry>(?<innertest>(?>(?&orlogicplusrest)));(?&innertest);)

(?# either in parnetheses or without)

(?<testxpr>(?&inner)|(?&inparenths))

(?# testexpr facet - replace here)

(?<testxprfacet>(?&innerraw)|\s*[(](?&orlogicplusrestfacet)[)]\s*)

(?<inparenths>\s*[(](?&innertest)[)]\s*)


(?# a digit)

(?<inner>(?<innerraw>\d++)(?{print("$+{innerraw}\n") if($+{innerraw});}))


(?#following are facets - you can safely ignore them - they are the same as the above except they don't print anything)

(?<orlogicplusrestfacet>(?>(?&andlogicplusrestfacet))(?>((\|\|)(?>(?&orlogicplusrestfacet))))
    |(?>(?&andlogicplusrestfacet)))

(?<andlogicplusrestfacet>(?>(?&orplusrestfacet))(?>((&&)(?>(?&andlogicplusrestfacet))))
    |(?>(?&orplusrestfacet)))

(?<orplusrestfacet>(?>(?&xorplusrestfacet))(?>((?!\|\|)(\|)(?>(?&orplusrestfacet))))
    |(?>(?&xorplusrestfacet)))

(?<xorplusrestfacet>(?>(?&andplusrestfacet))(?>((\^)(?>(?&xorplusrestfacet))))
    |(?>(?&andplusrestfacet)))

(?<andplusrestfacet>(?>(?&eqplusrestfacet))(?>((?!&&)(&)(?>(?&andplusrestfacet))))
    |(?>(?&eqplusrestfacet)))

(?<eqplusrestfacet>(?>(?&relplusrestfacet))(?>((==|!=)(?>(?&eqplusrestfacet))))
    |(?>(?&relplusrestfacet)))

(?<relplusrestfacet>(?>(?&shiftplusrestfacet))(?>(((?!<<)<|(?!>>)>|<=|>=)(?>(?&relplusrestfacet))))
    |(?>(?&shiftplusrestfacet)))

(?<shiftplusrestfacet>(?>(?&addmulplusrestfacet))(?>((<<|>>)(?>(?&shiftplusrestfacet))))
    |(?>(?&addmulplusrestfacet)))

(?<addmulplusrestfacet>(?>(?&muloprestfacet))(?>((?!\+\+|--)([\+\-])(?>(?&addmulplusrestfacet))))
    |(?>(?&muloprestfacet)))

(?<muloprestfacet>(?>(?&testxprfacet))(?>(([\*\/\%])(?>(?&muloprestfacet))))
    |(?>(?&testxprfacet)))

(?#end facets - follows the entry point)

)|(^(?&testentry))/xx;
                        
                        
print $&;

在此处显示替换的地方,您可以orlogicplusrest使用链中的任何其他运算符更改 以检查性能 - 例如,如果您放置shiftplusrest(and shiftplusrestfacet),它将几乎立即执行。

所以我的问题是 - 有什么方法可以加快速度,同时以正确的顺序考虑每个运营商。我已经尝试在任何地方添加原子组,但它仍然是一样的。

标签: regexperformanceperl

解决方案


所以实际上,老实说,我在以前的版本中有一个实际工作的解决方案——我只需要稍微调整一下。

我不知道它是如何做到的,但它更快并且它正在工作。唯一需要注意的是,您需要指定表达式终止的位置,否则它将缩小到第一个操作。

嗯,您基本上可以在一种情况下对它们进行比较-例如,我们拥有的原始版本:

~/(?<orlogicplusrest>(?>(?&andlogicplusrest))(?>(?<orlogicop>(?<orlogicopraw>\|\|)(?>(?&orlogicplusrest))))
    |(?>(?&andlogicplusrestfacet))(?{print("$+{orlogicopraw}\n")if($+{orlogicopraw});}))/

正如您在此处看到的,我们匹配整个事物 - 右侧和左侧,但另一方面在我更快的解决方案中:

~/(?<orlogicplusrest>(?<orlogicop>(?<orlogicopraw>\|\|)(?&primexprnoternary)(?!(?&andlogicplusrestfacet))
    (?{print("$+{orlogicopraw}\n")if($+{orlogicopraw});}))|(?&andlogicplusrest)(?&orlogicop)?)/

仅作为后缀匹配某物,并用于primexprnoternary制作循环:

~/(?<primexprnoternary>(?&testxpr)((?&orlogicplusrest)?+)*?)/

Asprimexprnoternary也用于运算符内部的右侧。

在另一个解决方案中,我们只写:

~/(?<innertest>(?>(?&orlogicplusrest)))/

这是完整的代码:

#!/usr/bin/perl

$text = "7*7*4*5*6*7+(4*4*4+3*7*4);";

$result = $text =~/((*F) (?# << clever way of declaring subroutines in a regex without actually executing them)


(?<orlogicplusrest>(?<orlogicop>(?<orlogicopraw>\|\|)(?&primexprnoternary)(?!(?&andlogicplusrestfacet))
    (?{print("$+{orlogicopraw}\n")if($+{orlogicopraw});}))|(?&andlogicplusrest)(?&orlogicop)?)

(?<andlogicplusrest>(?<andlogicop>(?<andlogicopraw>&&)(?&primexprnoternary)(?!(?&orplusrestfacet))
    (?{print("$+{andlogicopraw}\n")if($+{andlogicopraw});}))|(?&orplusrest)(?&andlogicop)?)

(?<orplusrest>(?<orop>(?!\|\|)(?<oropraw>\|)(?&primexprnoternary)(?!(?&xorplusrestfacet))
    (?{print("$+{oropraw}\n")if($+{oropraw});}))|(?&xorplusrest)(?&orop)?)

(?<xorplusrest>(?<xorop>(?<xoropraw>\^)(?&primexprnoternary)(?!(?&andplusrestfacet))
    (?{print("$+{xoropraw}\n")if($+{xoropraw});}))|(?&andplusrest)(?&xorop)?)

(?<andplusrest>(?<andop>(?!&&)(?<andopraw>&)(?&primexprnoternary)(?!(?&eqplusrestfacet))
    (?{print("$+{andopraw}\n")if($+{andopraw});}))|(?&eqplusrest)(?&andop)?)

(?<eqplusrest>(?<eqop>(?<eqopraw>==|!=)(?&primexprnoternary)(?!(?&relplusrestfacet))
    (?{print("$+{eqopraw}\n")if($+{eqopraw});}))|(?&relplusrest)(?&eqop)?)

(?<relplusrest>(?<relop>(?<relopraw>(?!<<)<|(?!>>)>|<=|>=)(?&primexprnoternary)(?!(?&shiftplusrestfacet))
    (?{print("$+{relopraw}\n")if($+{relopraw});}))|(?&shiftplusrest)(?&relop)?)

(?<shiftplusrest>(?<shiftop>(?<shiftopraw><<|>>)(?&primexprnoternary)(?!(?&addmulplusrestfacet))
    (?{print("$+{shiftopraw}\n")if($+{shiftopraw});}))|(?&addmulplusrest)(?&shiftop)?)
    
(?<addmulplusrest>(?<addop>(?<addopraw>[\+\-])((?&primexprnoternary)(?!(?&mulopfacet))
    (?{print("$+{addopraw}\n")if($+{addopraw});})))|(?&mulop)(?&addop)?)

(?<mulop>(?<mulopraw>[\*\/\%])(?&primexprnoternary)
    (?{print("$+{mulopraw}\n")if($+{mulopraw});}))

(?# entry point)

(?<testentry>(?<innertest>(?&primexprnoternary));)

(?# replace here)

(?<primexprnoternary>(?&testxpr)((?&orlogicplusrest)?+)*?)

(?# either in parnetheses or without)

(?<testxpr>(?&inner)|(?&inparenths))

(?# primexprnoternary facet - replace here)

(?<primexprnoternaryfacet>(?&testxprfacet)((?&orlogicplusrestfacet)?+)*?)

(?<testxprfacet>(?&innerraw)|\s*[(](?&primexprnoternaryfacet)[)]\s*)

(?<inparenths>\s*[(](?&primexprnoternary)[)]\s*)


(?# a digit)

(?<inner>(?<innerraw>\d++)(?{print("$+{innerraw}\n") if($+{innerraw});}))


(?#following are facets - you can safely ignore them - they are the same as the above except they don't print anything)

(?<orlogicplusrestfacet>(?&andlogicplusrest)((\|\|)(?&primexprnoternaryfacet)(?!(?&andlogicplusrestfacet)))?)

(?<andlogicplusrestfacet>(?&orplusrest)((&&)(?&primexprnoternaryfacet)(?!(?&orplusrestfacet)))?)

(?<orplusrestfacet>(?&xorplusrest)((?!\|\|)(\|)(?&primexprnoternaryfacet)(?!(?&xorplusrestfacet)))?)

(?<xorplusrestfacet>(?&andplusrest)((\^)(?&primexprnoternaryfacet)(?!(?&andplusrestfacet)))?)

(?<andplusrestfacet>(?&eqplusrest)((?!&&)(&)(?&primexprnoternaryfacet)(?!(?&eqplusrestfacet)))?)

(?<eqplusrestfacet>(?&relplusrest)((==|!=)(?&primexprnoternaryfacet)(?!(?&relplusrestfacet)))?)

(?<relplusrestfacet>(?&shiftplusrest)(((?!<<)<|(?!>>)>|<=|>=)(?&primexprnoternaryfacet)(?!(?&shiftplusrestfacet)))?)

(?<shiftplusrestfacet>(?&addmulplusrest)((<<|>>)(?&primexprnoternaryfacet)(?!(?&addmulplusrestfacet)))?)
    
(?<addmulplusrestfacet>(?&mulop)(([\+\-])((?&primexprnoternaryfacet)(?!(?&mulopfacet))))?)

(?<mulopfacet>([\*\/\%])(?&primexprnoternaryfacet))

(?#end facets - follows the entry point)

)|(^(?&testentry))/xx;
                        
                        
print $&;

推荐阅读