首页 > 解决方案 > vb.net 检查字符串中是否存在单词并采取相应措施

问题描述

我阅读了一个文本文件,删除了所有标点符号,然后阅读了String(). 我想计算单词,所以我需要String()两个字段,单词和频率。在我添加一个单词之前,我会计算它在文本中出现的次数Function CountMyWords。如果这个词已经在String()我不想再添加它,只需增加它的频率。

Private Sub CreateWordList()

        Dim text As String = File.ReadAllText("C:\Users\Gebruiker\Downloads\shakespear.txt")

        text = Regex.Replace(text, "[^A-Za-z']+", " ")
        Dim words As String() = text.Split(New Char() {" "c})
        Dim i As Integer

        For Each word As String In words
            If Len(word) > 5 Then
                word = word.ToLower()
                'now check if the word already exists
                If words.Contains(word) = True Then

                End If
                i = CountMyWords(text, word)
                Console.WriteLine("{0}", word + " " + i.ToString)
            End If
        Next

    End Sub
    Private Function CountMyWords(input As String, phrase As String) As Integer

        Dim Occurrences As Integer = 0
        Dim intCursor As Integer = 0
        Do Until intCursor >= input.Length

            Dim strCheckThisString As String = Mid(LCase(input), intCursor + 1, (Len(input) - intCursor))
            Dim intPlaceOfPhrase As Integer = InStr(strCheckThisString, phrase)
            If intPlaceOfPhrase > 0 Then
                Occurrences += 1
                intCursor += (intPlaceOfPhrase + Len(phrase) - 1)
            Else
                intCursor = input.Length
            End If
        Loop
        CountMyWords = Occurrences

    End Function

任何想法如何做到这一点?

标签: stringvb.net

解决方案


史蒂夫的回答一样,我建议使用Dictionary,但您可能不需要将类作为字典中的值的开销。

此外,如果您使用的是相当大的文件,您可以使用File.ReadLines方法一次处理一行,而不是将整个文件读入 RAM。

您可以使用一些 LINQ 使文本的处理更加简洁,如下所示:

Imports System.IO
Imports System.Text.RegularExpressions

Module Module1

    Sub Main()
        ' using https://raw.githubusercontent.com/brunoklein99/deep-learning-notes/master/shakespeare.txt
        Dim src = "C:\temp\TheSonnets.txt"
        Dim wordsWithCounts As New Dictionary(Of String, Integer)

        For Each line In File.ReadLines(src)
            Dim text = Regex.Replace(line, "[^A-Za-z']+", " ")
            Dim words = text.Split({" "c}).
                Where(Function(s) s.Length > 5).
                Select(Function(t) t.ToLower())

            For Each w In words
                If wordsWithCounts.ContainsKey(w) Then
                    wordsWithCounts(w) += 1
                Else
                    wordsWithCounts.Add(w, 1)
                End If
            Next

        Next

        ' extracting some data as an example...
        Dim mostUsedFirst = wordsWithCounts.
            Where(Function(x) x.Value > 18).
            OrderByDescending(Function(y) y.Value)

        For Each w As KeyValuePair(Of String, Integer) In mostUsedFirst
            Console.WriteLine(w.Key & " " & w.Value)
        Next

        Console.ReadLine()

    End Sub

End Module

使用示例文本,输出:

beauty 52
should 44
though 33
praise 28
love's 26
nothing 19
better 19

推荐阅读