sql-server - 使用 VB.net 使用正则表达式计算文本摘要中的单词组合
问题描述
我有一本医学术语词典,这些术语可以是以下单词的组合:[乳腺癌或前列腺癌]或单个单词,例如 [乳腺癌、前列腺癌和癌症],甚至 [胰腺 β 细胞肿瘤]。
我需要计算文章摘要中字典中的单词而不计算两次,所以如果我将乳腺癌计算为
- 当它们一起出现时,我不应该单独计算乳腺癌或癌症
- 我从 MS SQL 数据库中提取单词,在该数据库中添加了一个列,该列计算单词之间的空格,并按最大到最小排序,然后是单词。
我需要做的是在计算单词时将其替换为空白或“”,这样就无法单独计算。我不担心摘要中的文本可以随时更新。
我在 .Net WEB API 中的 VB.net 代码是:
While reader.Read '- -pulling words from database
word = reader("Word").ToString
Dim regex As Regex
If word.StartsWith("ER") Then
regex = New Regex("\s" + word + "\s", RegexOptions.None)
Else
regex = New Regex("\s" + word + "\s", RegexOptions.IgnoreCase)
End If
Dim regex As Regex = New Regex("\b(" + word + ")\b", RegexOptions.IgnoreCase)
Dim match As Match = regex.Match(abstractText)
If match.Success Then
TotalAbstractCount += regex.Matches(abstractText).Count
abstractCount += 1
abstractWords.Add(word)
abstractWordsCount.Add(word + " (" + count.ToString + ")")
' new code added to replace word/word string with blank
Dim regex2 = New Regex(word, RegexOptions.IgnoreCase)
abstractText = regex2.Replace(abstractText, " ")
End If
match = match.NextMatch()
End While
使用此代码是否可以将匹配更新为空字符串?还是我需要建立一个循环?
更新:我刚刚添加了 regex2 的新代码,但是因为它在每个单词上调用一个新的正则表达式,它似乎减慢了整个过程。最终用户正在实时等待结果。整个过程我没有计时,但似乎从1-1.5秒变成了3-4秒。
此外,如果在 MS SQL 2016 服务器中有更快的方法来执行此操作,我对此持开放态度。
解决方案
这是我的(相对未经测试的)答案。
算法是:
- 获取短语
- 获取文本
- 清理文本将非单词字符的所有内容变成空格
- 循环短语,查找
" phrase "
(长度 8)并用空格替换" "
(长度 7) - 长度的变化是出现次数
Imports System.Text.RegularExpressions
Imports System.Text
Imports System.Collections.Generic
Imports System
Imports System.Linq
Public Module Module1
Public Sub Main()
Dim phrases() as String = { "brEast", "bREast canCer", "caNCer" }
Dim text as String = "Breast- cAncer is Cancer!! of .the breAst. we need to keep aBREAST of it as it is CANCERous. Breast Cancer is bad cancer"
Dim cleaner = new Regex("\W+")
'remove all non word characters, replacing them with a single space
Dim cleanText = cleaner.Replace(text, " ")
'put the text into a stringbuilder for much faster string manipulation
'add space at the start and end - spaces delimit words for us
Dim textSb as New StringBuilder(" " & cleanText.ToLower() & " ")
'something to hold the counts of phrases
Dim counts = New Dictionary(Of String, Integer)
'Sort phrases from long to short, prevents "breast" ruining "breast cancer"
Dim orderedPhrases = phrases.OrderByDescending(Function(p As String) p.Length)
For Each phrase as String in orderedPhrases
'capture the old length - we'll need this
Dim prevLen as Integer = textSb.Length
'replace all occurrences of the phrase in the text.
'tack a space onto either end of the phrase to find whole words only
'because the replacement str is 1 shorter than the find
'the count of replacements is simply the change in length
'also we need the replacement string to be spaces
'because we rely on spaces at the start and end of a
'find string to delimit a phrase. removing all spaces
'could break our logic. If we replace with nothing:
'"type 1 breast cancer cancer is bad" -> "type 1cancer is bad"
'then we cannot now find " cancer "
Dim findPhrase = " " & phrase.ToLower() & " "
Dim replPhrase = new String(" ", findPhrase.Length - 1)
textSb.Replace(findPhrase, replPhrase)
'store the count of occurrences of this phrase
counts(phrase) = prevLen - textSb.Length
Next phrase
'let's print our counts as proof it works
For Each key as String in counts.Keys
Console.Out.WriteLine(key & " count is " & counts(key))
Next key
End Sub
End Module
我没有大型数据集可供尝试,但我在 2.5 秒内运行了整个方法 100,000 次
https://dotnetfiddle.net/xxlpde
注意:如果您对大量文本执行此操作(例如,作为文本集合的循环),您可以对短语进行预排序:
Dim orderedPhrases = phrases.OrderByDescending(Function(p As String) p.Length).Select(Function(p as String) " " & s.ToLower() & " ").ToArray()
自然,然后findPhrase
就变成phrase
了,因为我们已经向它添加了空格并 ToLower()ed 它