首页 > 解决方案 > 突出显示正则表达式匹配中的单词

问题描述

我正在尝试使用Regex. 我希望现实主义者在前后返回 X 个单词,并在所有出现的文本周围添加高亮显示。

例如:考虑以下段落。结果前后应至少有 10 个字符,并且没有单词被截断。搜索词是“狗”。

狗是一种宠物动物。它是最听话的动物之一。世界上有很多种狗。其中一些非常友好,而另一些则很危险。狗有不同的颜色,如黑色、红色、白色和棕色。有些老人的皮肤光滑而有光泽,有些人的皮肤粗糙。狗是肉食性动物。他们喜欢吃肉。它们有四条腿、两只耳朵和一条尾巴。训练狗执行不同的任务。他们保护我们免受小偷 b) 守卫我们的房子。他们是充满爱的动物。狗被称为人类最好的朋友。他们被警察用来寻找隐藏的东西。它们是世界上最有用的动物之一。狗哥尼特!

我想要的结果是一个如下所示的数组:

我有什么:

我四处搜索,发现以下正则表达式完美地返回了所需的结果,但没有添加额外的格式。我创建了几种方法来促进每个功能:

private List<List<string>> Search(string text, string searchTerm, bool searchEntireWord) {
    var result = new List<List<string>>();
    var searchTerms = searchTerm.Split(' ');
        foreach (var word in searchTerms) {
            var searchResults = ExtractParagraph(text, word, sizeOfResult, searchEntireWord);
            result.Add(searchResults);
            if (searchResults.Count > 0) {
                foreach (var searchResult in searchResults) {
                    Response.Write("<strong>Result:</strong> " + searchResult + "<br>");
                }
            }
        }
    return result;
}

private List<string> ExtractParagraph(string text, string searchTerm, sizeOfResult, bool searchEntireWord) {
    var result = new List<string>();
    searchTerm = searchEntireWord ? @"\b" + searchTerm + @"\b" : searchTerm;
    //var expression = @"((^.{0,30}|\w*.{30})\b" + searchTerm + @"\b(.{30}\w*|.{0,30}$))";
    var expression = @"((^.{0," + sizeOfResult + @"}|\w*.{" + sizeOfResult + @"})" + searchTerm + @"(.{" + sizeOfResult + @"}\w*|.{0," + sizeOfResult + @"}$))";
    var wordMatch = new Regex(expression, RegexOptions.IgnoreCase | RegexOptions.Singleline);

    foreach (Match m in wordMatch.Matches(text)) {
        result.Add(m.Value);
    }
    return result;
}

我可以这样称呼它:

var text = "The Dog is a pet animal. It is one of...";
var searchResults = Search(text, "dog", 10);
if (searchResults.Count > 0) {
    foreach (var searchResult in searchResults) {
        foreach (var result in searchResult) {
            Response.Write("<strong>Result:</strong> " + result + "<br>");
        }
    }
}

我还不知道在 10 个字符中多次出现该词的结果或如何处理。即:如果一个句子有“狗当然是狗!”。我想我以后可以处理。

测试:

var searchResults = Search(text, "dog", 0, false); // should include only the matched word
var searchResults = Search(text, "dog", 1, false); // should include the matched word and only one word preceding and following the matched word (if any)
var searchResults = Search(text, "dog", 10, false); // should include the matched word and up to 10 characters (but not cutting off words in the middle) preceding and following it (if any)
var searchResults = Search(text, "dog", 50, false); // should include the matched word and up to 50 characters (but not cutting off words in the middle) preceding and following it (if any)

问题:

我创建的函数允许搜索将 searchTerm 仅作为整个单词或单词的一部分来查找。

我所做的是Replace(word, "<strong>" + word "</strong>")在显示结果时对结果进行简单的处理。如果我正在搜索单词的一部分,这非常有用。但是在搜索整个单词时,如果结果中包含 searchTerm 作为单词的一部分,则该部分单词会突出显示。

例如:如果我搜索“狗”,结果是:“所有狗都去狗天堂。” 突出显示为“所有都去天堂”。但我想要“所有的狗都去天堂”。

问题:

问题是我怎样才能得到匹配的单词用一些 HTML<strong>或任何我想要的东西包装?

标签: c#regex

解决方案


您的解决方案应该能够做两件主要的事情:1) 提取匹配项,即关键字/短语加上围绕它们的额外的左右上下文,2) 用标签包装搜索词。

提取正则表达式(例如,左右各 10 个字符)是

(?si)(?<!\S).{0,10}(?<!\S)\S*dog\S*(?!\S).{0,10}(?!\S)

请参阅正则表达式演示

细节

  • (?si)- 启用SinglelineIgnoreCase修饰符(.将匹配所有字符并且模式将不区分大小写)
  • (?<!\S)- 左侧空白边界
  • .{0,10}- 0 到 10 个字符
  • (?<!\S)- 左侧空白边界
  • \S*dog\S*-dog周围有任何 0+ 个非空白字符(注意:如果searchEntireWordfalse,则需要\S*从此模式部分中删除)
  • (?!\S)- 右侧空白边界
  • .{0,10}- 0 到 10 个字符
  • (?!\S)- 右侧空白边界。

在 C# 中,它将被定义为

var expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S)\S*{1}\S*(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
if (searchEntireWord) { 
    expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S){1}(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
} 

请注意,这{{实际上是一个文字{,并且}}}格式化字符串中的文字。

用强标签包装关键术语的第二个正则表达式要简单得多:

Regex.Replace(x.Value, 
            searchEntireWord ? 
                string.Format(@"(?i)(?<!\S){0}(?!\S)", Regex.Escape(searchTerm)) : 
                string.Format(@"(?i){0}", Regex.Escape(searchTerm)), 
            "<strong>$&</strong>")

请注意,$&替换模式中指的是整个匹配值。

C#代码:

public static List<string> ExtractTexts(string text, string searchTerm, int sizeOfResult, bool searchEntireWord) 
{
    var expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S)\S*{1}\S*(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
    if (searchEntireWord) { 
        expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S){1}(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
    } 
    return Regex.Matches(text, expression) 
        .Cast<Match>() 
        .Select(x => Regex.Replace(x.Value, 
            searchEntireWord ? 
                string.Format(@"(?i)(?<!\S){0}(?!\S)", Regex.Escape(searchTerm)) : 
                string.Format(@"(?i){0}", Regex.Escape(searchTerm)), 
            "<strong>$&</strong>"))
        .ToList();
}

示例用法(参见演示)

var text = "The Dog is a real-pet animal. There's an undogging dog that only undogs non-dogs. It is one of the most obedient animals. There are many kinds of dogs in the world. Some of the are very friendly while some of them a dangerous. Dogs are of different color like black, red, white and brown. Some old them have slippery shiny skin and some have rough skin. Dogs are carnivorous animals. They like eating meat. They have four legs, two ears and a tail. Dogs are trained to perform different tasks. They protect us from thieves b) guarding our house. They are loving animals. A dog is called man's best friend. They are used by the police to find hidden things. They are one of the most useful animals in the world. Doggonit!";
var searchTerm = "dog";
var searchEntireWord = false;
Console.WriteLine("======= 10 ========");
var results = ExtractTexts(text, searchTerm, 10, searchEntireWord);
foreach (var result in results)
    Console.WriteLine(result);

输出:

======= 10 ========
(?si)(?<!\S).{0,10}(?<!\S)\S*dog\S*(?!\S).{0,10}(?!\S)
The <strong>Dog</strong> is a
an un<strong>dog</strong>ging <strong>dog</strong> that
only un<strong>dog</strong>s non-<strong>dog</strong>s.
kinds of <strong>dog</strong>s in the
<strong>Dog</strong>s are of
skin. <strong>Dog</strong>s are
a tail. <strong>Dog</strong>s are
A <strong>dog</strong> is called
world. <strong>Dog</strong>gonit!

另一个例子:

Console.WriteLine("======= 15 ========");
results = ExtractTexts(text, searchTerm, 15, searchEntireWord);
foreach (var result in results)
    Console.WriteLine(result);

输出:

======= 15 ========
(?si)(?<!\S).{0,15}(?<!\S)\S*dog\S*(?!\S).{0,15}(?!\S)
The <strong>Dog</strong> is a real-pet
There's an un<strong>dog</strong>ging <strong>dog</strong> that only
un<strong>dog</strong>s non-<strong>dog</strong>s. It is one of
many kinds of <strong>dog</strong>s in the world.
a dangerous. <strong>Dog</strong>s are of
rough skin. <strong>Dog</strong>s are
and a tail. <strong>Dog</strong>s are trained to
animals. A <strong>dog</strong> is called
in the world. <strong>Dog</strong>gonit!

推荐阅读