首页 > 解决方案 > 使用 VB.net 使用正则表达式计算文本摘要中的单词组合

问题描述

我有一本医学术语词典,这些术语可以是以下单词的组合:[乳腺癌前列腺癌]或单个单词,例如 [乳腺癌前列腺癌癌症],甚至 [胰腺 β 细胞肿瘤]。

我需要计算文章摘要中字典中的单词而不计算两次,所以如果我将乳腺癌计算为

  1. 当它们一起出现时,我不应该单独计算乳腺癌癌症
  2. 我从 MS SQL 数据库中提取单词,在该数据库中添加了一个列,该列计算单词之间的空格,并按最大到最小排序,然后是单词。

我需要做的是在计算单词时将其替换为空白或“”,这样就无法单独计算。我不担心摘要中的文本可以随时更新。

我在 .Net WEB API 中的 VB.net 代码是:

While reader.Read '- -pulling words from database

    word = reader("Word").ToString

    Dim regex As Regex

    If word.StartsWith("ER") Then
        regex = New Regex("\s" + word + "\s", RegexOptions.None)
    Else
        regex = New Regex("\s" + word + "\s", RegexOptions.IgnoreCase)
    End If

    Dim regex As Regex = New Regex("\b(" + word + ")\b", RegexOptions.IgnoreCase)
    Dim match As Match = regex.Match(abstractText)
    If match.Success Then
        TotalAbstractCount += regex.Matches(abstractText).Count
        abstractCount += 1
        abstractWords.Add(word)
        abstractWordsCount.Add(word + " (" + count.ToString + ")")
        ' new code added to replace word/word string with blank
        Dim regex2 = New Regex(word, RegexOptions.IgnoreCase)
        abstractText = regex2.Replace(abstractText, " ")

    End If
    match = match.NextMatch()

End While

使用此代码是否可以将匹配更新为空字符串?还是我需要建立一个循环?

更新:我刚刚添加了 regex2 的新代码,但是因为它在每个单词上调用一个新的正则表达式,它似乎减慢了整个过程。最终用户正在实时等待结果。整个过程我没有计时,但似乎从1-1.5秒变成了3-4秒。

此外,如果在 MS SQL 2016 服务器中有更快的方法来执行此操作,我对此持开放态度。

标签: sql-serverregexvb.net

解决方案


这是我的(相对未经测试的)答案。

算法是:

  • 获取短语
  • 获取文本
  • 清理文本将非单词字符的所有内容变成空格
  • 循环短语,查找" phrase "(长度 8)并用空格替换" "(长度 7) - 长度的变化是出现次数
Imports System.Text.RegularExpressions
Imports System.Text
Imports System.Collections.Generic
Imports System
Imports System.Linq

Public Module Module1
    Public Sub Main()

        Dim phrases() as String  = { "brEast", "bREast canCer", "caNCer" }

        Dim text as String = "Breast- cAncer is Cancer!! of .the breAst. we need to keep aBREAST of it as it is CANCERous. Breast Cancer is bad cancer"

        Dim cleaner  = new Regex("\W+")

        'remove all non word characters, replacing them with a single space
        Dim cleanText = cleaner.Replace(text, " ")

        'put the text into a stringbuilder for much faster string manipulation
        'add space at the start and end - spaces delimit words for us
        Dim textSb as New StringBuilder(" " & cleanText.ToLower() & " ")

        'something to hold the counts of phrases
        Dim counts = New Dictionary(Of String, Integer)

        'Sort phrases from long to short, prevents "breast" ruining "breast cancer"
        Dim orderedPhrases = phrases.OrderByDescending(Function(p As String) p.Length)

        For Each phrase as String in orderedPhrases

            'capture the old length - we'll need this
            Dim prevLen as Integer = textSb.Length

            'replace all occurrences of the phrase in the text.
            'tack a space onto either end of the phrase to find whole words only

            'because the replacement str is 1 shorter than the find
            'the count of replacements is simply the change in length

            'also we need the replacement string to be spaces
            'because we rely on spaces at the start and end of a
            'find string to delimit a phrase. removing all spaces
            'could break our logic. If we replace with nothing:
            '"type 1 breast cancer cancer is bad" -> "type 1cancer is bad"
            'then we cannot now find " cancer "

            Dim findPhrase = " " & phrase.ToLower() & " "
            Dim replPhrase = new String(" ", findPhrase.Length - 1)
            textSb.Replace(findPhrase, replPhrase)

            'store the count of occurrences of this phrase
            counts(phrase) = prevLen - textSb.Length

        Next phrase

        'let's print our counts as proof it works
        For Each key as String in counts.Keys
            Console.Out.WriteLine(key  & " count is " & counts(key))
            Next key


    End Sub
End Module

我没有大型数据集可供尝试,但我在 2.5 秒内运行了整个方法 100,000 次

https://dotnetfiddle.net/xxlpde

注意:如果您对大量文本执行此操作(例如,作为文本集合的循环),您可以对短语进行预排序:

Dim orderedPhrases = phrases.OrderByDescending(Function(p As String) p.Length).Select(Function(p as String) " " & s.ToLower() & " ").ToArray()

自然,然后findPhrase就变成phrase了,因为我们已经向它添加了空格并 ToLower()ed 它


推荐阅读