c# - 如何使用 C# 在给定文本中用忽略空格、回车或换行符替换字符串
问题描述
我想替换给定文本中的特定字符串(每次都会不同,所以不是这个问题中给定的特定示例),但遵循以下规则:它将忽略空格字符、回车符或换行符
这可能吗?
以下面的 HTML 文档为例。
<tr>
<td colspan="2" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial; border-top-left-radius: 3px; border-top-right-radius: 3px;">
<b>
<a rel=\"nofollow\" target=\"_blank\" href=\"https://www.monstermmorpg.com\"
style=\"color: rgb(6, 69, 173);
text-decoration-line: none; background: none;\"
title=\"Calyrex (Pokémon)\"><span style=\"color: rgb(0, 0, 0);\">←</span></a>
</b>
</td>
</tr>
目标是将上面文档中的以下字符串替换为其他内容,比如说AAA
.
<td style="text-align: right;"><a rel="nofollow" target="_blank" href="https://www.monstermmorpg.com" style="color: rgb(6, 69, 173); text-decoration-line: none; background: none;" title="Calyrex (Pokémon)"><span style="color: rgb(0, 0, 0);">←</span></a></td>
预期的结果应该是
<tr>
<td colspan="2" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial; border-top-left-radius: 3px; border-top-right-radius: 3px;">
<b>
AAA
</b>
</td>
</tr>
我试过的
我尝试过使用 htmlagilitypack,不幸的是在我的情况下不起作用,因为我没有尝试替换单个 HTML 节点。我需要替换可能跨越或不跨越多个节点的部分文档。我无法让 htmlagilitypack 做到这一点。
解决方案
既然您说传统的 html 解析方法似乎不适用于您的用例,您是否考虑过为这个特定用例编写手动解析器?
我写了一个简短但有效的示例,说明您可能要考虑的内容。
但是要记住几件事,这个实现写得很快,没有错误处理,假设目标字符串可以放在内存中,并且缺少关键的边缘情况。如果你想考虑这个解决方案,你应该投入时间来填补空白。
此解决方案仅解析整个文档,忽略并写入非目标字符,当目标字符串被识别时,改为写入替代字符串。
这可能不是最好的解决方案,我鼓励您寻找提供更多功能的现成 HTML 解析器。
public static void RemoveTargetString(TextReader Reader, TextWriter Writer, string TargetString, char[] CharactersToIgnore, string ReplacementString)
{
HashSet<char> IgnoreCases = CharactersToIgnore.ToHashSet();
// our buffer need only be the size of the target string
char[] buffer = new char[TargetString.Length];
int currentIndex = 0;
while (Reader.Peek() > -1)
{
// read one character to the end of the buffer marked by index
if (Reader.Read(buffer, currentIndex, 1) != 0)
{
// get the last char in the buffer
ref char firstChar = ref buffer[currentIndex];
// if the char is on the ignore list blindly write it and continue
// dont change index so we overwrite the char in the last spot of the buffer
if (IgnoreCases.Contains(firstChar))
{
// write the char and ignore
Writer.Write(firstChar);
continue;
}
// check to see if the char is in the right order as the target string
if (firstChar == TargetString[currentIndex])
{
// if it is don't write the buffer, increment index so we keep the char without back tracking
currentIndex++;
// if we have found the entire string dump the buffer, write the replacement string
if (currentIndex == TargetString.Length)
{
// write replacement string instead
Writer.Write(ReplacementString);
// reset index so we overwrite the buffer
currentIndex = 0;
}
}
else
{
// check to see if the target string is within something that starts with a partial piece of the target string
// we should not implicitly assume the character we fail at isn't the start of the target as well
// if it is we should avoid writing it
if (firstChar == TargetString[0])
{
Writer.Write(buffer, 0, currentIndex);
buffer[0] = buffer[currentIndex];
// reset index and start searching for start of target
currentIndex = 1;
}
else
{
// since the char at the last position of the buffer wasn't
// either the start or within the target string
// write the buffer from 0 - last index
Writer.Write(buffer, 0, currentIndex + 1);
// reset index and start searching for start of target
currentIndex = 0;
}
}
}
}
// if for some reason the target string is at the end, but was not complete, we should write the characters in the buffer to the target
if (currentIndex > 0)
{
Writer.Write(buffer, 0, currentIndex);
}
}
char[] IgnoreCharacters = new char[] { '\n', '\r', ' ' };
string target = "<td style=\"text - align: right; \">\n\r<a rel=\"nofollow\" target=\"_blank\" href=\"https://www.monstermmorpg.com\"\n\r style=\"color: rgb(6, 69, 173);\n\r text-decoration-line: none;\n\r background: none;\"\n\r title=\"Calyrex (Pokémon)\"><span style=\"color: rgb(0, 0, 0);\">←</span></a></td>";
StringReader reader = new($"<tr>\n\r<td colspan=\"2\" style=\"background - image: initial; background - position: initial; background - size: initial; background - repeat: initial; background - attachment: initial; background - origin: initial; background - clip: initial; border - top - left - radius: 3px; border - top - right - radius: 3px; \">\n\r<b>{target}</b>\n\r</td>\n\r</tr>");
foreach (char item in IgnoreCharacters)
{
target = target.Replace(item.ToString(), "");
}
StringWriter writer = new();
RemoveTargetString(reader, writer, target, IgnoreCharacters, "AAA");
Console.WriteLine(writer.ToString());
如果您不熟悉TextReader
或者TextWriter
这些是常见 IO 功能的基类,例如StreamReader
和StreamWriter
. 您可以使用它来简化文件中的行查找信息,如下所示:
char[] IgnoreCharacters = new char[] { '\n', '\r', ' ' };
string target = "Hello World";
string replacement = "Hello Globe";
using StreamReader reader = new("Test.txt");
using StreamWriter writer = new("Output.txt");
RemoveTargetString(reader, writer, target, IgnoreCharacters, replacement);
编辑:
修复了如果目标正在被识别但失败,则单个字符未写入输出流导致有损转录的问题。为常见的边缘案例创建了测试用例。
对于那些对此解决方案的性能感兴趣的人,处理一个 1,278,518,583 字节(1.19GB)的文本文件需要大约 35 秒,并使用 9 MB 内存。如果需要额外的性能,请考虑将其替换IgnoreCases.Contains(firstChar)
为Char.IsWhiteSpace(firstChar)
快约 33%。
static char[] IgnoreCharacters = new char[] { '\n', '\r', ' ', '\t' };
[Theory]
[InlineData("1234", "<div>1234</div>", "<div></div>")]
[InlineData("1234", "<div>1\n\r2\t3\n4\r</div>", "<div>\n\r\t\n\r</div>")]
[InlineData("1234", "\n\r\t1\n\r\t2\r\n\t3\n\t\r4", "\n\r\t\n\r\t\r\n\t\n\t\r")]
[InlineData("1234", "1 2 3 4", " ")]
[InlineData("1234", " \n\r\t1 \n\r\t2 \n\r\t3 \n\r\t4 \n\r\t", " \n\r\t \n\r\t \n\r\t \n\r\t \n\r\t")]
[InlineData("1234", "123412341234", "")]
[InlineData("1234", "4321", "4321")]
[InlineData("1234", "Hello", "Hello")]
[InlineData("1234", "", "")]
[InlineData("1234", "1/2/3/4", "1/2/3/4")]
[InlineData("1", "1111", "")]
[InlineData("1", "12131415", "2345")]
[InlineData("Abcde", "AbcdAbcde", "Abcd")]
[InlineData("Abcde", "AbcdAbcdeAbcd", "AbcdAbcd")]
[InlineData("12345", "121231234123451234123121", "1212312341234123121")]
public void CommonEdgeCases(string Target, string Input, string Expected)
{
foreach (char item in IgnoreCharacters)
{
Target = Target.Replace(item.ToString(), "");
}
StringReader reader = new(Input);
StringWriter writer = new();
RemoveTargetString(reader, writer, Target, IgnoreCharacters, string.Empty);
Assert.Equal(Expected, writer.ToString());
}
推荐阅读
- android - “无法打开应用程序‘Android Studio’。255”在 Mac 上
- playframework - 在 Play Framework 中为不同的请求使用不同的端口
- javascript - Chrome 扩展 - 通知上的按钮侦听器执行多次
- angular - 防止在 mat-dialog 中检测父组件的更改
- arrays - 关于使用 Schwartzian 变换的 Perl 排序的建议
- html - 滚动时页脚不停留在底部
- python - Python regex 如何找到匹配的(子)字符串出现次数?
- php - MYSQL - 从多个查询创建单个 sql 查询
- django - Django Admin自定义用户密码没有得到散列+加盐
- java - 在 Springboot 中显示 rownum