首页 > 解决方案 > 如何使用 C# 在给定文本中用忽略空格、回车或换行符替换字符串

问题描述

我想替换给定文本中的特定字符串(每次都会不同,所以不是这个问题中给定的特定示例),但遵循以下规则:它将忽略空格字符、回车符或换行符

这可能吗?

以下面的 HTML 文档为例。

<tr>
   <td colspan="2" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial; border-top-left-radius: 3px; border-top-right-radius: 3px;">
      <b>
         <a rel=\"nofollow\" target=\"_blank\" href=\"https://www.monstermmorpg.com\" 
         style=\"color: rgb(6, 69, 173); 
         text-decoration-line: none;    background: none;\" 
         title=\"Calyrex (Pokémon)\"><span style=\"color: rgb(0, 0, 0);\">&larr;</span></a>
      </b> 
   </td>
</tr>

目标是将上面文档中的以下字符串替换为其他内容,比如说AAA.

<td style="text-align: right;"><a rel="nofollow" target="_blank" href="https://www.monstermmorpg.com" style="color: rgb(6, 69, 173); text-decoration-line: none; background: none;" title="Calyrex (Pokémon)"><span style="color: rgb(0, 0, 0);">&larr;</span></a></td>

预期的结果应该是

<tr>
   <td colspan="2" style="background-image: initial; background-position: initial; background-size: initial; background-repeat: initial; background-attachment: initial; background-origin: initial; background-clip: initial; border-top-left-radius: 3px; border-top-right-radius: 3px;">
      <b>
         AAA
      </b> 
   </td>
</tr>

我试过的

我尝试过使用 htmlagilitypack,不幸的是在我的情况下不起作用,因为我没有尝试替换单个 HTML 节点。我需要替换可能跨越或不跨越多个节点的部分文档。我无法让 htmlagilitypack 做到这一点。

标签: c#regexreplace

解决方案


既然您说传统的 html 解析方法似乎不适用于您的用例,您是否考虑过为这个特定用例编写手动解析器?

我写了一个简短但有效的示例,说明您可能要考虑的内容。

但是要记住几件事,这个实现写得很快,没有错误处理,假设目标字符串可以放在内存中,并且缺少关键的边缘情况。如果你想考虑这个解决方案,你应该投入时间来填补空白。

此解决方案仅解析整个文档,忽略并写入非目标字符,当目标字符串被识别时,改为写入替代字符串。

这可能不是最好的解决方案,我鼓励您寻找提供更多功能的现成 HTML 解析器。

public static void RemoveTargetString(TextReader Reader, TextWriter Writer, string TargetString, char[] CharactersToIgnore, string ReplacementString)
{
    HashSet<char> IgnoreCases = CharactersToIgnore.ToHashSet();

    // our buffer need only be the size of the target string
    char[] buffer = new char[TargetString.Length];

    int currentIndex = 0;

    while (Reader.Peek() > -1)
    {
        // read one character to the end of the buffer marked by index
        if (Reader.Read(buffer, currentIndex, 1) != 0)
        {
            // get the last char in the buffer
            ref char firstChar = ref buffer[currentIndex];

            // if the char is on the ignore list blindly write it and continue
            // dont change index so we overwrite the char in the last spot of the buffer
            if (IgnoreCases.Contains(firstChar))
            {
                // write the char and ignore
                Writer.Write(firstChar);
                continue;
            }

            // check to see if the char is in the right order as the target string
            if (firstChar == TargetString[currentIndex])
            {
                // if it is don't write the buffer, increment index so we keep the char without back tracking
                currentIndex++;

                // if we have found the entire string dump the buffer, write the replacement string
                if (currentIndex == TargetString.Length)
                {
                    // write replacement string instead
                    Writer.Write(ReplacementString);

                    // reset index so we overwrite the buffer
                    currentIndex = 0;
                }
            }
            else
            {
                // check to see if the target string is within something that starts with a partial piece of the target string
                // we should not implicitly assume the character we fail at isn't the start of the target as well
                // if it is we should avoid writing it
                if (firstChar == TargetString[0])
                {
                    Writer.Write(buffer, 0, currentIndex);

                    buffer[0] = buffer[currentIndex];

                    // reset index and start searching for start of target
                    currentIndex = 1;
                }
                else
                {
                    // since the char at the last position of the buffer wasn't
                    // either the start or within the target string
                    // write the buffer from 0 - last index
                    Writer.Write(buffer, 0, currentIndex + 1);

                    // reset index and start searching for start of target
                    currentIndex = 0;
                }
            }
        }
    }

    
    // if for some reason the target string is at the end, but was not complete, we should write the characters in the buffer to the target
    if (currentIndex > 0)
    {
        Writer.Write(buffer, 0, currentIndex);
    }
}

char[] IgnoreCharacters = new char[] { '\n', '\r', ' ' };

string target = "<td style=\"text - align: right; \">\n\r<a rel=\"nofollow\" target=\"_blank\" href=\"https://www.monstermmorpg.com\"\n\r style=\"color: rgb(6, 69, 173);\n\r text-decoration-line: none;\n\r background: none;\"\n\r title=\"Calyrex (Pokémon)\"><span style=\"color: rgb(0, 0, 0);\">&larr;</span></a></td>";

StringReader reader = new($"<tr>\n\r<td colspan=\"2\" style=\"background - image: initial; background - position: initial; background - size: initial; background - repeat: initial; background - attachment: initial; background - origin: initial; background - clip: initial; border - top - left - radius: 3px; border - top - right - radius: 3px; \">\n\r<b>{target}</b>\n\r</td>\n\r</tr>");

foreach (char item in IgnoreCharacters)
{
    target = target.Replace(item.ToString(), "");
}

StringWriter writer = new();

RemoveTargetString(reader, writer, target, IgnoreCharacters, "AAA");

Console.WriteLine(writer.ToString());

如果您不熟悉TextReader或者TextWriter这些是常见 IO 功能的基类,例如StreamReader StreamWriter. 您可以使用它来简化文件中的行查找信息,如下所示:

char[] IgnoreCharacters = new char[] { '\n', '\r', ' ' };

string target = "Hello World";

string replacement = "Hello Globe";

using StreamReader reader = new("Test.txt");
using StreamWriter writer = new("Output.txt");

RemoveTargetString(reader, writer, target, IgnoreCharacters, replacement);

编辑:
修复了如果目标正在被识别但失败,则单个字符未写入输出流导致有损转录的问题。为常见的边缘案例创建了测试用例。

对于那些对此解决方案的性能感兴趣的人,处理一个 1,278,518,583 字节(1.19GB)的文本文件需要大约 35 秒,并使用 9 MB 内存。如果需要额外的性能,请考虑将其替换IgnoreCases.Contains(firstChar)Char.IsWhiteSpace(firstChar)快约 33%。

static char[] IgnoreCharacters = new char[] { '\n', '\r', ' ', '\t' };

[Theory]
[InlineData("1234", "<div>1234</div>", "<div></div>")]
[InlineData("1234", "<div>1\n\r2\t3\n4\r</div>", "<div>\n\r\t\n\r</div>")]
[InlineData("1234", "\n\r\t1\n\r\t2\r\n\t3\n\t\r4", "\n\r\t\n\r\t\r\n\t\n\t\r")]
[InlineData("1234", "1 2 3 4", "   ")]
[InlineData("1234", " \n\r\t1 \n\r\t2 \n\r\t3 \n\r\t4 \n\r\t", " \n\r\t \n\r\t \n\r\t \n\r\t \n\r\t")]
[InlineData("1234", "123412341234", "")]
[InlineData("1234", "4321", "4321")]
[InlineData("1234", "Hello", "Hello")]
[InlineData("1234", "", "")]
[InlineData("1234", "1/2/3/4", "1/2/3/4")]
[InlineData("1", "1111", "")]
[InlineData("1", "12131415", "2345")]
[InlineData("Abcde", "AbcdAbcde", "Abcd")]
[InlineData("Abcde", "AbcdAbcdeAbcd", "AbcdAbcd")]
[InlineData("12345", "121231234123451234123121", "1212312341234123121")]
public void CommonEdgeCases(string Target, string Input, string Expected)
{
    foreach (char item in IgnoreCharacters)
    {
        Target = Target.Replace(item.ToString(), "");
    }

    StringReader reader = new(Input);

    StringWriter writer = new();

    RemoveTargetString(reader, writer, Target, IgnoreCharacters, string.Empty);

    Assert.Equal(Expected, writer.ToString());
}

推荐阅读