首页 > 解决方案 > 尽可能快地提取 C 中两个相似(或不同)字符串之间的字符串

问题描述

我用 C 语言编写了一个程序,可以找到两个相似或不同的字符串并提取它们之间的字符串。这类程序有很多用途,而且一般你使用这样的程序时,你有很多信息,所以它需要快速。我想要关于如何使这个程序尽可能快速和高效的提示。

我正在寻找不会让我求助于繁重的库(例如正则表达式)的建议。

代码必须:

下面是我的代码。我对 C 很陌生,来自 C++,所以我可能会提出一些建议,尤其是关于有效/正确使用“malloc”命令的建议:

fast_strbetween.c

/*
   Compile with:
   gcc -Wall -O3 fast_strbetween.c -o fast_strbetween
*/

#include <stdio.h>   // printf
#include <stdlib.h>  // malloc

// inline function if it pleases the compiler gods
inline size_t fast_strlen(char *str)
{
    int i;   // Cannot return 'i' if inside for loop
    for(i = 0; str[i] != '\0'; ++i);

    return i;
}

char *fast_strbetween(char *str, char *str1, char *str2)
{
    // size_t segfaults when incorrect length strings are entered (due to going below 0), so use int instead for increased robustness
    int str0len    = fast_strlen(str);
    int str1len    = fast_strlen(str1);
    int str1pos    = 0;
    int charsfound = 0;

    // Find str1
    do {
        charsfound = 0;

        while (str1[charsfound] == str[str1pos + charsfound])
            ++charsfound; 

    } while (++str1pos < str0len - str1len && charsfound < str1len);

    // '++str1pos' increments past by 1: needs to be set back by one
    --str1pos;

    // Whole string not found or logical impossibilty
    if (charsfound < str1len)
        return NULL;

    /* Start searching 2 characters after last character found in str1. This will ensure that there will be space, and logical possibility, for the extracted text to exist or not, and allow immediate bail if the latter case; str1 cannot possibly have anything between it if str2 is right next to it!

       Example:

       str      = 'aa'
       str1     = 'a'
       str2     = 'a'
       returned = '' (should be NULL)

       Without such preventative, str1 and str2 would would be found and '' would be returned, not NULL. This also saves 1 do/while loop, one check pertaining to returning null, and two additional calculations:

       Example, if you didn't add +1 str2pos, you would need to change the code to:

       if (charsfound < str2len || str2pos - str1pos - str1len < 1)
           return NULL;

       It also allows for text to be found between three similar strings—what??? I can feel my brain going fuzzy!

       Let this example explain:

       str    = 'aaa'
       str1   = 'a'
       str2   = 'a'
       result = '' (should be 'a')

       Without the aforementioned preventative, the returned string is '', not 'a'; the program takes the first 'a' for str1 and the second 'a' for str2, and tries to return what is between them (nothing).

    */
    int str2pos = str1pos + str1len + 1; // the '1' added to str2pos
    int str2len = fast_strlen(str2);

    // Find str2
    do {
        charsfound = 0;

        while (str2[charsfound] == str[str2pos + charsfound])
            ++charsfound;

    } while (++str2pos < str0len - str2len + 1 && charsfound < str2len);

    // Deincrement due to '++str2pos' over-increment
    --str2pos;

    if (charsfound < str2len)
        return NULL;

    // Only allocate what is needed
    char *strbetween = (char *)malloc(sizeof(char) * str2pos - str1pos - str1len);

    unsigned int tmp = 0;
    for (unsigned int i = str1pos + str1len; i < str2pos; i++)
        strbetween[tmp++] = str[i];

    return strbetween;
}

int main() {
    char str[30] =  { "abaabbbaaaabbabbbaaabbb" };
    char str1[10] = { "aaa" };
    char str2[10] = { "bbb" };

    //Result should be: 'abba' 

    printf("The string between is: \'%s\'\n", fast_strbetween(str, str1, str2));

    // free malloc as we go
    for (int i = 10000000; --i;)
        free(fast_strbetween(str, str1, str2));

    return 0;
}

为了有一些衡量进度的方法,我已经对上面的代码进行了计时(提取一个小字符串 10000000 次):

$ time fast_strbetween                                                  
The string between is: 'abba'
    0m11.09s real     0m11.09s user     0m00.00s system

根据“top”命令(Linux),进程使用了​​ 99.3 - 100% CPU。运行时使用的内存:3.7Mb 可执行文件大小:8336 字节

在 Raspberry Pi 3B+(4 x 1.4Ghz,Arm 6)上运行

如果有人想提供代码、提示、指针……我将不胜感激。我还将实施更改并为您的麻烦提供定时结果。

哦,我学到的一件事是总是取消分配malloc;我在发布之前运行了上面的代码(带有额外的循环)。我的电脑内存满了,电脑死机了。幸运的是,Stack 制作了一份备用草稿!学过的知识!

* 编辑 *

这是尽可能使用 chqrlie 建议的修改后的代码。添加了对字符串结尾的额外检查,这最终花费了测试短语大约一秒钟的时间,但如果没有找到第一个字符串,现在可以非常快速地退出。希望使用 null 或不合逻辑的字符串不会导致错误。代码中有很多注释,可以更好地理解它们。如果我遗漏了什么或做错了什么,请让我知道;这不是故意的。

fast_strbetween2.c

/*
   Compile with:
   gcc -Wall -O3 fast_strbetween2.c -o fast_strbetween2

   Corrections and additions courtesy of:
   https://stackoverflow.com/questions/55308295/extracting-a-string-between-two-similar-or-different-strings-in-c-as-fast-as-p

*/

#include<stdio.h>  // printf
#include<stdlib.h> // malloc, free

// Strings now set to 'const'
char * fast_strbetween(const char *str, const char *str1, const char *str2)
{
    // string size will now be calculated by the characters picked up
    size_t str1pos    = 0;
    size_t str1chars;

    // Find str1
    do{

        str1chars = 0;

        // Will the do/while str1 check for '\0' suffice?
        // I haven't seen any issues yet, but not sure.
        while(str1[str1chars] == str[str1pos + str1chars]  && str1[str1chars] != '\0')
        {
            //printf("Found str1 char: %i num: %i pos: %i\n", str1[str1chars], str1chars + 1, str1pos);

            ++str1chars;
        }

        // Incrementing whilst not in conditional expression tested faster
        ++str1pos;

    /* There are two checks for "str1[str1chars] != '\0'". Trying to find
       another efficient way to do it in one. */
    }while(str[str1pos] != '\0' && str1[str1chars] != '\0');

    --str1pos;

    //For testing:
    //printf("str1pos: %i str1chars: %i\n", str1pos, str1chars);

    // exit if no chars were found or if didn't reach end of str1
    if(!str1chars || str1[str1chars] != '\0')
    {
        //printf("Bailing from str1 result\n");
        return '\0';
    }

    /* Got rid of the '+1' code which didn't allow for '' returns.
       I agree with your logic of <tag></tag> returning ''. */
    size_t str2pos = str1pos + str1chars;
    size_t str2chars;

    //printf("Starting pos for str2: %i\n", str1pos + str1chars);

    // Find str2
    do{

        str2chars = 0;

        while(str2[str2chars] == str[str2pos + str2chars] && str2[str2chars] != '\0')
        {
            //printf("Found str2 char: %i num: %i pos: %i \n", str2[str2chars], str2chars + 1, str2pos);
            ++str2chars;
        }

        ++str2pos;

    }while(str[str2pos] != '\0' && str2[str2chars] != '\0');

    --str2pos;

    //For testing:
    //printf("str2pos: %i str2chars: %i\n", str2pos, str2chars);

    if(!str2chars || str2[str2chars] != '\0')
    {
        //printf("Bailing from str2 result!\n");
        return '\0';
    }

    /* Trying to allocate strbetween with malloc. Is this correct? */
    char * strbetween = malloc(2);

    // Check if malloc succeeded:
    if (strbetween == '\0') return '\0';

    size_t tmp = 0;

    // Grab and store the string between!
    for(size_t i = str1pos + str1chars; i < str2pos; ++i)
    {
        strbetween[tmp] = str[i];
        ++tmp;
    }

    return strbetween;
}

int main() {

    char str[30]  = { "abaabbbaaaabbabbbaaabbb" };
    char str1[10] = { "aaa" };
    char str2[10] = { "bbb" };

    printf("Searching \'%s\' for \'%s\' and \'%s\'\n", str, str1, str2);
    printf("           0123456789\n\n"); // Easily see the elements
    printf("The word between is: \'%s\'\n", fast_strbetween(str, str1, str2));

    for(int i = 10000000; --i;)
        free(fast_strbetween(str, str1, str2));

    return 0;
}

** 结果 **

$ time fast_strbetween2                                                 
Searching 'abaabbbaaaabbabbbaaabbb' for 'aaa' and 'bbb'
           0123456789

The word between is: 'abba'
    0m10.93s real     0m10.93s user     0m00.00s system

根据“top”命令(Linux),进程使用了​​ 99.0 - 100% CPU。运行时使用的内存:1.8Mb 可执行文件大小:8336 字节 在 Raspberry Pi 3B+(4 x 1.4Ghz,Arm 6)上运行

chqrlie 的回答

我知道这只是一些显示正确编程实践的示例代码。尽管如此,它可以在测试中进行适当的控制。

请注意,我不知道如何在您的代码中释放 malloc,因此这不是一个公平的测试。结果,内存使用量增加,仅此过程就占用了 130Mb+。我仍然能够运行完整的 10000000 个循环的测试。我会说我尝试按照执行代码的方式解除分配此代码(通过将函数“simple_strbetween”放入 main 并使用“free(strndup(p, q - p));”解除分配),结果不是t 与不解除分配有很大不同。

** simple_strbetween.c **

/*
   Compile with:
   gcc -Wall -O3 simple_strbetween.c -o simple_strbetween

   Courtesy of:
   https://stackoverflow.com/questions/55308295/extracting-a-string-between-two-similar-or-different-strings-in-c-as-fast-as-p

*/

#include<string.h>
#include<stdio.h>

char *simple_strbetween(const char *str, const char *str1, const char *str2) {

    const char *q;
    const char *p = strstr(str, str1);

    if (p) {
        p += strlen(str1);
        q = *str2 ? strstr(p, str2) : p + strlen(p);
        if (q)
            return strndup(p, q - p);
    }

    return NULL;
}

int main() {

    char str[30] =  { "abaabbbaaaabbabbbaaabbb" };
    char str1[10] = { "aaa" };
    char str2[10] = { "bbb" };

    printf("Searching \'%s\' for \'%s\' and \'%s\'\n", str, str1, str2);
    printf("           0123456789\n\n"); // Easily see the elements
    printf("The word between is: \'%s\'\n", simple_strbetween(str, str1, str2));

    for(int i = 10000000; --i;)
        simple_strbetween(str, str1, str2);

    return 0;
}

$ time simple_strbetween                                                
Searching 'abaabbbaaaabbabbbaaabbb' for 'aaa' and 'bbb'
           0123456789

The word between is: 'abba'
    0m19.68s real     0m19.34s user     0m00.32s system

根据“top”命令(Linux),进程使用了​​ 100% CPU。运行时使用的内存:130Mb(由于我缺乏知识而泄漏)可执行文件大小:8380 字节在 Raspberry Pi 3B+(4 x 1.4Ghz,Arm 6)上运行

上述代码的结果使用此备用 strndup 运行:

char *alt_strndup(const char *s, size_t n)
{   
    size_t i;
    char *p; 
    for (i = 0; i < n && s[i] != '\0'; i++)
        continue;
    p = malloc(i + 1);
    if (p != NULL) { 
        memcpy(p, s, i);
        p[i] = '\0';
    }
    return p;
}

$ time simple_strbetween                                                
Searching 'abaabbbaaaabbabbbaaabbb' for 'aaa' and 'bbb'
           0123456789

The word between is: 'abba'
    0m20.99s real     0m20.54s user     0m00.44s system

我恳请在代码正确运行之前,没有人对结果做出判断。我会尽快修改结果。

* 编辑 *

能够将时间减少 25% 以上(11.93 秒对 8.7 秒)。这是通过使用指针来增加位置来完成的,而不是 size_t。在检查最后一个字符串的同时收集返回字符串可能是导致最大变化的原因。我觉得还有很多需要改进的地方。一个很大的损失来自于必须释放 malloc。如果有更好的方法,我想知道。

fast_strbetween3.c:

/*

 gcc -Wall -O3 fast_strbetween.c -o fast_strbetween

*/

#include<stdio.h>  // printf
#include<stdlib.h> // malloc, free

char * fast_strbetween(const char *str, const char *str1, const char *str2)
{
    const char *sbegin = &str1[0];    // String beginning
    const char *spos;

    // Find str1
    do{

        spos = str;
        str1 = sbegin;

        while(*spos == *str1 && *str1)
        {
            ++spos;
            ++str1;
        }

        ++str;

    }while(*str1 && *spos); 

    // Nothing found if spos hasn't advanced
    if (spos == str)
        return NULL;

    char *strbetween = malloc(1);
    if (!strbetween)
        return '\0';

    str = spos;
    int i = 0;
    //char *p = &strbetween[0];   // Alt. for advancing strbetween (slower) 
    sbegin = &str2[0];     // Recycle sbegin

    // Find str2
    do{

        str2 = sbegin;
        spos = str;

        while(*spos == *str2 && *str2)
        {
            ++str2;
            ++spos;
        }

        //*p = *str;
        //++p;

        strbetween[i] = *str;
        ++str;
        ++i;

    }while(*str2 && *spos);

    if (spos == str)
        return NULL;

    //*--p = '\0';

    strbetween[i - 1] = '\0';

    return strbetween;
}

int main() {

    char s[100]  = "abaabbbaaaabbabbbaaabbb";
    char s1[100] = "aaa";
    char s2[100] = "bbb";

    printf("\nString: \'%s\'\n", fast_strbetween(s, s1, s2));

    for(int i = 10000000; --i; )
      free(fast_strbetween(s, s1, s2));

    return 0;
  }

字符串:'abba' 0m08.70s 真实 0m08.67s 用户 0m00.01s 系统

根据“top”命令(Linux),进程使用了​​ 99.0 - 100% CPU。运行时使用的内存:1.8Mb 可执行文件大小:8336 字节 在 Raspberry Pi 3B+(4 x 1.4Ghz,Arm 6)上运行

* 编辑 *

这并不算数,因为它没有“返回”一个值,因此违反了我自己的规则,但它确实传递了一个变量,该变量被更改并带回 main。它运行 1 个库,耗时 3.6 秒。摆脱 malloc 是关键。

/*

 gcc -Wall -O3 fast_strbetween.c -o fast_strbetween

*/

#include<stdio.h>  // printf

unsigned int fast_strbetween(const char *str, const char *str1, const char *str2, char *strbetween)
{
    const char *sbegin = &str1[0];    // String beginning
    const char *spos;

    // Find str1
    do{

        spos = str;
        str1 = sbegin;

        while(*spos == *str1 && *str1)
        {
            ++spos;
            ++str1;
        }

        ++str;

    }while(*str1 && *spos); 

    // Nothing found if spos hasn't advanced
    if (spos == str)
    {
        strbetween[0] = '\0';
        return 0;
    }

    str = spos;
    sbegin = &str2[0];     // Recycle sbegin

    // Find str2
    do{

        str2 = sbegin;
        spos = str;

        while(*spos == *str2 && *str2)
        {
            ++str2;
            ++spos;
        }

        *strbetween = *str;
        ++strbetween;
        ++str;

    }while(*str2 && *spos);

    if (spos == str)
    {
        strbetween[0] = '\0';
        return 0;
    }

    *--strbetween = '\0';

    return 1;  // Successful (found text)
}

int main() {

    char s[100]  = "abaabbbaaaabbabbbaaabbb";
    char s1[100] = "aaa";
    char s2[100] = "bbb";
    char sret[100];

    fast_strbetween(s, s1, s2, sret);
    printf("String: %s\n", sret);

    for(int i = 10000000; --i; )
      fast_strbetween(s, s1, s2, sret);

    return 0;
}

标签: cstringfind

解决方案


您的代码存在多个问题,并且可能没有应有的效率:

  • 您使用类型intunsigned int索引到字符串中。这些类型可能小于 的范围size_t。您应该修改代码以使用size_t并避免在比较中混合有符号和无符号类型。

  • 您的函数的字符串参数应声明为const char *您不修改字符串,并且应该能够在没有警告的情况下传递 const 字符串。

  • 重新定义strlen是一个坏主意:您的版本将比系统的优化、汇编编码和很可能是内联版本慢。

  • 计算 的长度str是不必要的,而且可能代价高昂:两者str1和都str2可能出现在接近 的开头str,扫描结尾str将是浪费的。

  • 第一个/循环while内的循环不正确:可能会访问超出结尾的字符,因为循环不会在空终止符处停止。如果仅出现在 末尾,则您的行为未定义。dowhilewhile(str1[charsfound] == str[str1pos + charsfound]) charsfound++;strstr1str1str

  • ifstr1是一个空字符串,你会在结尾str而不是开头找到它。

  • 你为什么初始化str2posint str2pos = str1pos + str1len + 1;?如果str2紧跟str1inside str,则应分配并返回一个空字符串。您对这种情况的评论是不可读的,您应该打破如此长的行以适应典型的屏幕宽度,例如 80 列。是否strbetween("aa", "a", "a")应该回归""还是值得商榷的NULLstrbetween("<name></name>", "<name>", "</name>")恕我直言,它应该返回一个分配的空字符串,这与or上的预期行为一致strbetween("''", "'", "'")。您的规范防止strbetween返回空字符串会产生违反直觉的边界情况。

  • 第二个扫描循环与第一个扫描循环有相同的问题。

  • 该行char *strbetween = (char *) malloc(sizeof(char) * str2pos - str1pos - str1len);有多个问题:在 C 中不需要强制转换,如果您坚持指定元素大小sizeof(char)(根据定义为 1),您应该用括号括起来元素的数量,最后但并非最不重要的一点是,您必须为空终止符。

  • 你不测试是否malloc()成功。如果它返回NULL,您将有未定义的行为,而您应该只返回NULL

  • 复制循环混合使用有符号和无符号类型,导致溢出时可能违反直觉的行为。

  • 忘记设置空终止符,这与分配大小错误一致,但不正确。

在尝试和优化代码之前,您必须确保正确性!您的代码太复杂并且有多个缺陷。优化是一个有争议的问题。

您应该首先使用标准 C 字符串函数尝试一个非常简单的实现:在另一个字符串中搜索一个字符串可以通过strstr.

strstr这是一个使用and的简单实现strndup(),它应该在您的系统上可用:

#include <string.h>

char *simple_strbetween(const char *str, const char *str1, const char *str2) {
    const char *q;
    const char *p = strstr(str, str1);
    if (p) {
        p += strlen(str1);
        q = *str2 ? strstr(p, str2) : p + strlen(p);
        if (q)
            return strndup(p, q - p);
    }
    return NULL;
}

strndup()在 POSIX 中定义,是C 库扩展第二部分:动态分配函数,ISO/IEC TR 24731-2:2010的一部分。如果它在您的系统上不可用,则可以将其重新定义为:

#include <stdlib.h>
#include <string.h>

char *strndup(const char *s, size_t n) {
    size_t i;
    char *p;
    for (i = 0; i < n && s[i] != '\0'; i++)
        continue;
    p = malloc(i + 1);
    if (p != NULL) {
        memcpy(p, s, i);
        p[i] = '\0';
    }
    return p;
}

为确保正确性,编写多个测试用例,带有边界用例,例如空字符串和相同字符串的所有组合。

一旦你彻底了解了你的strbetween功能,你就可以编写一个基准测试框架来测试性能。要获得可靠的性能数据并不容易,如果您尝试就会体验到。例如,请记住配置您的编译器以选择适当的优化-O3

只有这样你才能进入下一步:如果你真的被限制使用标准 C 库函数,你可以先重新编码你的版本strstr并且strlen仍然使用相同的方法。测试这个新版本的正确性和性能。

冗余部分是strlen(str1)必须strstr在找到匹配项时确定的计算。并且扫描是不必要的,因为和strndup()之间不存在空字节。如果您有时间浪费,您可以尝试以牺牲可读性为代价删除这些冗余,从而冒着不一致的风险。如果您在各种测试用例中平均得到任何改进,我会感到惊讶。20% 将是了不起的。pq


推荐阅读