首页 > 解决方案 > 如何以最快的方式计算文本文件中某个字符组合的出现次数?

问题描述

我想制作一种方法来计算 .txt 文件(C#)中一系列字符的出现次数。我在这里找到了一些相关的问题,这些问题有有效的答案。但是,某些情况会限制可能的解决方案:

谢谢您的帮助。

标签: c#stringcharactertext-files

解决方案


The method has to work quite fast, because I have to use it more hundred times in the program.

According to recent benchmarks, SequenceEqual of Span<T> tends to be the fastest way to compare array slices in .NET nowadays (except for unsafe or P/Invoke approaches).

The text in the file is overlong to be read in a string.

This issue can easily be tackled using FileStream or StreamReader.

In a nutshell, you need to read the file chunked: read a fixed size part from the file, look for occurences in it, read the next part, look for occurences, and so on. This can be coded without moving back the cursor, just the leftover of each part needs to be taken into account when dealing with the next part.

Here is my approach using FileStream and Span<T>:

public static int CountOccurences(Stream stream, string searchString, Encoding encoding = null, int bufferSize = 4096)
{
    if (stream == null)
        throw new ArgumentNullException(nameof(stream));

    if (searchString == null)
        throw new ArgumentNullException(nameof(searchString));

    if (!stream.CanRead)
        throw new ArgumentException("Stream must be readable.", nameof(stream));

    if (bufferSize <= 0)
        throw new ArgumentException("Buffer size must be a positive number.", nameof(bufferSize));

    // detecting encoding
    Span<byte> bom = stackalloc byte[4];

    var actualLength = stream.Read(bom);
    if (actualLength == 0)
        return 0;

    bom = bom.Slice(0, actualLength);

    Encoding detectedEncoding;
    if (bom.StartsWith(Encoding.UTF8.GetPreamble()))
        detectedEncoding = Encoding.UTF8;
    else if (bom.StartsWith(Encoding.UTF32.GetPreamble()))
        detectedEncoding = Encoding.UTF32;
    else if (bom.StartsWith(Encoding.Unicode.GetPreamble()))
        detectedEncoding = Encoding.Unicode;
    else if (bom.StartsWith(Encoding.BigEndianUnicode.GetPreamble()))
        detectedEncoding = Encoding.BigEndianUnicode;
    else
        detectedEncoding = null;

    if (detectedEncoding != null)
    {
        if (encoding == null)
            encoding = detectedEncoding;

        if (encoding == detectedEncoding)
            bom = bom.Slice(detectedEncoding.GetPreamble().Length);
    }
    else if (encoding == null)
        encoding = Encoding.ASCII;

    // acquiring a buffer
    ReadOnlySpan<byte> searchBytes = encoding.GetBytes(searchString);

    bufferSize = Math.Max(Math.Max(bufferSize, searchBytes.Length), 128);

    var bufferArray = ArrayPool<byte>.Shared.Rent(bufferSize);
    try
    {
        var buffer = new Span<byte>(bufferArray, 0, bufferSize);

        // looking for occurences
        bom.CopyTo(buffer);
        actualLength = bom.Length + stream.Read(buffer.Slice(bom.Length));
        var occurrences = 0;
        do
        {
            var index = 0;
            var endIndex = actualLength - searchBytes.Length;
            for (; index <= endIndex; index++)
                if (buffer.Slice(index, searchBytes.Length).SequenceEqual(searchBytes))
                    occurrences++;

            if (actualLength < buffer.Length)
                break;

            ReadOnlySpan<byte> leftover = buffer.Slice(index);
            leftover.CopyTo(buffer);
            actualLength = leftover.Length + stream.Read(buffer.Slice(leftover.Length));
        }
        while (true);

        return occurrences;
    }
    finally { ArrayPool<byte>.Shared.Return(bufferArray); }
}

This code requires C# 7.2 to compile. You may have to include the System.Buffers and System.Memory NuGet packages, as well. If you use .NET Core version lower than 2.1 or another platform than .NET Core, you need to include this "polyfill", as well:

static class Compatibility
{
    public static int Read(this Stream stream, Span<byte> buffer)
    {
        // copied over from corefx sources (https://github.com/dotnet/corefx/blob/master/src/Common/src/CoreLib/System/IO/Stream.cs)
        byte[] sharedBuffer = ArrayPool<byte>.Shared.Rent(buffer.Length);
        try
        {
            int numRead = stream.Read(sharedBuffer, 0, buffer.Length);
            if ((uint)numRead > buffer.Length)
                throw new IOException("Stream was too long.");

            new Span<byte>(sharedBuffer, 0, numRead).CopyTo(buffer);
            return numRead;
        }
        finally { ArrayPool<byte>.Shared.Return(sharedBuffer); }
    }
}

Usage:

using (var fs = new FileStream(@"path-to-file", FileMode.Open, FileAccess.Read, FileShare.Read))
    Console.WriteLine(CountOccurences(fs, "string to search"));

When you don't specify the encoding argument, the encoding will be auto-detected by examining the BOM of the file. If BOM is not present, ASCII encoding is assumed as a fallback.


推荐阅读