首页 > 解决方案 > Scalability of Regular Expressions (MarkLogic)

问题描述

I have been looking around for ways to do regular expression in MarkLogic for XQuery and SPARQL. But it seems that for XQuery fn:match is the only way to approach this. It also seems that it is recommended to scale down the data with queries before running it through a for loop which can be seen in this thread. However, what If I am unable to scale it down and there is a need to loop through millions of data, is there a more scalable way to do this? I'm unsure if task bot is the option I should be looking at.

On the other hand in SPARQL there are two ways to approach this.

First Method

SELECT ?s ?p ?o
WHERE {?s ?p ?o
  FILTER (regex (?o, ".*Name.*", "i"))
}

Second Method

PREFIX fn: <http://www.w3.org/2005/xpath-functions#>
SELECT ?s ?p ?o
WHERE {?s ?p ?o
  FILTER (fn:matches(?o, ".*Name.*"))
}

Among these two options to take in SPARQL are they the same or one of them is slightly better then the other? I would also greatly appreciate any advise or better ways to approach this for both SPARQL and XQuery

标签: regexmarklogic

解决方案


基本上你正在用你的搜索字符串做一个子字符串匹配"Name",对于那个fn:contains就足够了

fn:contains(?o, "Name")

一些忠告:

如果可以通过用简单的字符串搜索过滤器替换,请避免使用正则表达式

我曾经不得不使用不那么复杂的正则表达式在 Java 中重做整个项目,但即使是这几个环顾四周也让它变得非常慢。我不得不将这些正则表达式分解为多个级别的字符串搜索过滤器,这有什么不同。在 MarkLogic 中,fn:substring-beforefn:substring-after等函数可以帮助您在遍历字符串搜索过滤器级别时减少文本长度。

尽管如此,如果您必须使用正则表达式并且遇到性能问题,那么除了并行计算之外,最好将正则表达式匹配的责任委托给像 Perl 这样最擅长的语言/技术。


推荐阅读