c# - 如何以最快的方式计算文本文件中某个字符组合的出现次数?
问题描述
我想制作一种方法来计算 .txt 文件(C#)中一系列字符的出现次数。我在这里找到了一些相关的问题,这些问题有有效的答案。但是,某些情况会限制可能的解决方案:
- 该方法必须非常快,因为我必须在程序中使用它超过一百次。
- 文件中的文本过长,无法以字符串形式读取。
谢谢您的帮助。
解决方案
The method has to work quite fast, because I have to use it more hundred times in the program.
According to recent benchmarks, SequenceEqual of Span<T> tends to be the fastest way to compare array slices in .NET nowadays (except for unsafe or P/Invoke approaches).
The text in the file is overlong to be read in a string.
This issue can easily be tackled using FileStream or StreamReader.
In a nutshell, you need to read the file chunked: read a fixed size part from the file, look for occurences in it, read the next part, look for occurences, and so on. This can be coded without moving back the cursor, just the leftover of each part needs to be taken into account when dealing with the next part.
Here is my approach using FileStream and Span<T>:
public static int CountOccurences(Stream stream, string searchString, Encoding encoding = null, int bufferSize = 4096)
{
if (stream == null)
throw new ArgumentNullException(nameof(stream));
if (searchString == null)
throw new ArgumentNullException(nameof(searchString));
if (!stream.CanRead)
throw new ArgumentException("Stream must be readable.", nameof(stream));
if (bufferSize <= 0)
throw new ArgumentException("Buffer size must be a positive number.", nameof(bufferSize));
// detecting encoding
Span<byte> bom = stackalloc byte[4];
var actualLength = stream.Read(bom);
if (actualLength == 0)
return 0;
bom = bom.Slice(0, actualLength);
Encoding detectedEncoding;
if (bom.StartsWith(Encoding.UTF8.GetPreamble()))
detectedEncoding = Encoding.UTF8;
else if (bom.StartsWith(Encoding.UTF32.GetPreamble()))
detectedEncoding = Encoding.UTF32;
else if (bom.StartsWith(Encoding.Unicode.GetPreamble()))
detectedEncoding = Encoding.Unicode;
else if (bom.StartsWith(Encoding.BigEndianUnicode.GetPreamble()))
detectedEncoding = Encoding.BigEndianUnicode;
else
detectedEncoding = null;
if (detectedEncoding != null)
{
if (encoding == null)
encoding = detectedEncoding;
if (encoding == detectedEncoding)
bom = bom.Slice(detectedEncoding.GetPreamble().Length);
}
else if (encoding == null)
encoding = Encoding.ASCII;
// acquiring a buffer
ReadOnlySpan<byte> searchBytes = encoding.GetBytes(searchString);
bufferSize = Math.Max(Math.Max(bufferSize, searchBytes.Length), 128);
var bufferArray = ArrayPool<byte>.Shared.Rent(bufferSize);
try
{
var buffer = new Span<byte>(bufferArray, 0, bufferSize);
// looking for occurences
bom.CopyTo(buffer);
actualLength = bom.Length + stream.Read(buffer.Slice(bom.Length));
var occurrences = 0;
do
{
var index = 0;
var endIndex = actualLength - searchBytes.Length;
for (; index <= endIndex; index++)
if (buffer.Slice(index, searchBytes.Length).SequenceEqual(searchBytes))
occurrences++;
if (actualLength < buffer.Length)
break;
ReadOnlySpan<byte> leftover = buffer.Slice(index);
leftover.CopyTo(buffer);
actualLength = leftover.Length + stream.Read(buffer.Slice(leftover.Length));
}
while (true);
return occurrences;
}
finally { ArrayPool<byte>.Shared.Return(bufferArray); }
}
This code requires C# 7.2 to compile. You may have to include the System.Buffers and System.Memory NuGet packages, as well. If you use .NET Core version lower than 2.1 or another platform than .NET Core, you need to include this "polyfill", as well:
static class Compatibility
{
public static int Read(this Stream stream, Span<byte> buffer)
{
// copied over from corefx sources (https://github.com/dotnet/corefx/blob/master/src/Common/src/CoreLib/System/IO/Stream.cs)
byte[] sharedBuffer = ArrayPool<byte>.Shared.Rent(buffer.Length);
try
{
int numRead = stream.Read(sharedBuffer, 0, buffer.Length);
if ((uint)numRead > buffer.Length)
throw new IOException("Stream was too long.");
new Span<byte>(sharedBuffer, 0, numRead).CopyTo(buffer);
return numRead;
}
finally { ArrayPool<byte>.Shared.Return(sharedBuffer); }
}
}
Usage:
using (var fs = new FileStream(@"path-to-file", FileMode.Open, FileAccess.Read, FileShare.Read))
Console.WriteLine(CountOccurences(fs, "string to search"));
When you don't specify the encoding argument, the encoding will be auto-detected by examining the BOM of the file. If BOM is not present, ASCII encoding is assumed as a fallback.
推荐阅读
- android - App closes when item on bottom navigation view is selected to replace fragment
- python - 从网页抓取数据后无法生成一些自定义输出
- sql - SQL Server 2016:将 FOR XML 和 STUFF 用于多连接查询
- excel - 在 Excel 2019 中打开“管理数据建模”时出错
- python - 运行功能后数据框未更新
- elasticsearch - Kafka 连接器将主题键映射为 ElasticSearch 中的文档 ID
- python - 将向量/csr矩阵行乘以numpy / scipy中的“梯度下降”权重
- python - 图像未保存在数据库 django 中
- ios - 如何在 App Store 中临时隐藏 IOS App
- datetime - 谷歌脚本删除重复行,根据时间戳保持最新