首页 > 解决方案 > 在大文本文件中查找数百万个项目

问题描述

如何查找一个大字符串(超过 2 MB)是否包含任何项目列表?

我试过这个

Dim Lit as New List (of String)
For x as Integer = 0 To 20000
    Lit.Add(x)
Next
If Lit.Any(Function(y) mytext.IndexOf(y, StringComparison.InvariantCulture) >= 0) Then
    'Code
End If

但这需要10秒。我怎样才能加快速度?

标签: vb.net

解决方案


这会更快。Lit是要在 中搜索的字符串的哈希集mytext。该mytext字符串仅从索引 0 开始扫描一次。从mytext搜索字符串的所有可能长度中提取子字符串,并对每个子字符串进行哈希集查找。

Dim Lit As New HashSet(Of String)
For x As Integer = 0 To 20000
    Lit.Add(x)
Next
' Build a list of the lengths of the Lit strings.
Dim lengths As New HashSet(Of Integer)
For Each s As String In Lit
    lengths.Add(s.Length)
Next
Dim counts As List(Of Integer) = lengths.OrderByDescending(Of Integer)(Function(x) x).ToList
' Scan mytext from index 0, extract substrings of all possible counts, and see if the string is Lit dictionary.
For i As Integer = 0 To mytext.Length - counts.First
    Dim search As String = mytext.Substring(i, counts.First)
    For Each c In counts
        search = search.Substring(0, c)
        If Lit.Contains(search) Then
            ' Found search in mytext.
        End If
    Next
Next

推荐阅读