首页 > 解决方案 > 突出显示 4 个连续匹配的单词

问题描述

我有两个字符串,一个是模态答案,另一个是学生给出的答案。我想用学生给出的答案中的模态答案突出显示 4 个连续匹配的单词。

我写了下面的函数来匹配和突出答案字符串中的单词。

function getCopiedText($modelAnswer, $answer) {
    $modelAnsArr = explode(' ', $modelAnswer);
    $answerArr = explode(' ', $answer);
    $common = array_intersect($answerArr, $modelAnsArr);
    if (isset($common) && !empty($common)) {
        $common[max(array_keys($common)) + 2] = '';
        $count = 0;
        $word = '';
        for ($i = 0; $i <= max(array_keys($common)); $i++) {
            if (isset($common[$i])) {
                $count++;
                $word .= $common[$i] . ' ';
            } else {
                if ($count >= 4) {
                    $answer = preg_replace("@($word)@i", '<span style="color:blue">$1</span>', $answer);
                }
                $count = 0;
                $word = '';
            }
        }
    }
    return $answer;
}

示例字符串

$modelAnswer = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry`s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';

$answer ='Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry`s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';

函数调用

echo getCopiedText($modelAnswer, $answer);

问题:当$answer字符串超过 300 个字符时,函数将不会返回突出显示的字符串。如果假设$answer字符串少于 300 个字符,那么它将返回突出显示的字符串。例如,假设$answer字符串是否Lorem Ipsum is simply dummy text of the printing and typesetting industry.返回突出显示的字符串。但不适用于超过 300 个字符。

我不确定,但似乎preg_replace功能存在问题。也许模式(第一个参数preg_replace)长度超出限制。

标签: php

解决方案


我正在添加一个单独的答案,因为 OP 之后评论说他们真的希望匹配 4 个或更多单词的短语。我最初的答案是基于 OP 最初希望匹配 4 个单词短语集的评论。

我重构了原始答案以使用CachingIterator迭代每个单词而不是仅包含 4 个单词的集合。以及指定每个短语中的最小单词数(默认为 4)的能力,处理缩短的重复短语并在遇到部分匹配时倒回。

例子:

Model: "one two three four one two three four five six seven"
Answer:
    "two three four five two three four five six seven"
Shortened Duplicate:: 
    "[two three four five] [[two three four five] six seven]"

Answer: 
    "one one two three four"
Partial Match Rewind:
    "one [one two three four]"

来源https://3v4l.org/AKRTQ


示例:https ://3v4l.org/5P2L6

此解决方案不区分大小写,并考虑特殊@ (, )字符和不可打印字符\n\r\t

我建议从答案和模型中删除所有非字母数字字符,以清理它们以进行比较并使检测算法更可预测。

preg_replace(['/[^[:alnum:][:space:]]/u', '/[[:space:]]{2,}/u'], ['', ' '], $answer); https://3v4l.org/Pn6CT

或者,explode您可以使用str_word_count($answer, 1, '1234567890') https://3v4l.org/cChjo而不是使用它,这将实现相同的结果,同时保留连字符和撇号的单词。

function getCopiedText($model, $answer, $min = 4)
{
    //ensure there are not double spaces
    $model = str_replace('  ', ' ', $model);
    $answer = str_replace('  ', ' ', $answer);
    $test = new CachingIterator(new ArrayIterator(explode(' ', $answer)));
    $words = $matches = [];
    $p = $match = null;
    //test each word
    foreach($test as $i => $word) {
        $words[] = $word;
        $count = count($words);
        if ($count === 2) {
            //save pointer at second word
            $p = $i;
        }
        //check if the phrase of words exists in the model
        if (false !== stripos($model, $phrase = implode(' ', $words))) {
            //only match phrases with the minimum or more words
            if ($count >= $min) {
                //reset back to here for more matches
                $match = $phrase;
                if (!$test->hasNext()) {
                    //add the the last word to the phrase
                    $matches[$match] = true;
                    $p = null;
                }
            }
        } else {
            //the phrase of words was no longer found
            if (null !== $match && !isset($matches[$match])) {
                //add the matched phrase to the list of matches
                $matches[$match] = true;
                $p = null;
                $iterator = $test->getInnerIterator();
                if ($iterator->valid()) {
                    //rewind pointer back to the current word since the current word may be part of the next phrase
                    $iterator->seek($i);
                }
            } elseif (null !== $p) {
                //match not found, determine if we need to rewind the pointer
                $iterator = $test->getInnerIterator();
                if ($iterator->valid()) {
                    //rewind pointer back to second word since a partial phrase less than 4 words was matched
                    $iterator->seek($p);
                }
                $p = null;
            }
            //reset testing
            $words = [];
            $match = null;
        }
    }

    //highlight the matched phrases in the answer
    if (!empty($matches)) {
        $phrases = array_keys($matches);
        //sort phrases by the length
        array_multisort(array_map('strlen', $phrases), $phrases);

        //filter the matches as regular expression patterns
        //order by longest phrase first to ensure double highlighting of smaller phrases
        $phrases  = array_map(function($phrase) {
            return '/(' . preg_quote($phrase, '/') . ')/iu';
        }, array_reverse($phrases));

        $answer = preg_replace($phrases, '<span style="color:blue">$0</span>', $answer);
    }

    return $answer;
}
$modelAnswer = 'Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry`s standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.';

$answer ='NOT IN is simply dummy text NOT in when an unknown printer took a galley -this- is simply dummy text of the printing and typesetting industry';

echo getCopiedText($modelAnswer, $answer);

结果:

NOT IN <span style="color:blue">is simply dummy text</span> NOT in <span style="color:blue">when an unknown printer took a galley</span> -this- <span style="color:blue"><span style="color:blue">is simply dummy text</span> of the printing and typesetting industry</span>


推荐阅读