首页 > 解决方案 > 查找重复项的最佳性能算法是什么?

问题描述

让我们看看我的代码:

function checkForDuplicates() {            
           $data = $this->input->post();
           $project_id = $data['project_id'];

           $this->db->where('project_id', $project_id);
           $paper = $this->db->get('paper')->result();

           $paper2 = $paper; //duplica o array de papers
           $duplicatesCount = 0;

           foreach($paper as $p){
               $similarity = null;

                foreach($paper2 as $p2){
                    if($p -> status_selection_id !== 4 && $p2 -> status_selection_id !== 4){ 
                        if($p -> paper_id !== $p2 -> paper_id){ 
                            similar_text($p -> title, $p2 -> title, $similarity);

                            if ($similarity > 90) { 
                                $p -> status_selection_id = 4;
                                $this->db->where('paper_id', $p -> paper_id);
                                $this->db->update('paper', $p);
                                $duplicatesCount ++;
                            }
                        }
                    }
                }
            }

            $data = array(
                'duplicatesCount' => $duplicatesCount,
                'message' => 'Duplicates where found!'
            );
            echo json_encode($data);
        }
  1. similar_text 需要 180 秒来检查 1500 条记录。
  2. levenshtein 需要 101 秒来检查 1500 条记录。
  3. if($pp1 === $pp2) 检查 1500 条记录需要 45 秒。

检查重复记录并更改其状态的最快方法是什么?

标签: phpmysqlcodeigniter

解决方案


优化往往是减少IO。

在您的情况下,减少 SQL 查询的数量应该会提高处理时间。

如果您需要处理大量记录,则应将其拆分为块。每个块应该包含一批可以放入内存 (RAM) 的记录。

从数据库中检索您的块。处理您的块(即使用循环)并使用数组(即)跟踪您需要在数据库中进行的更改。最后用尽可能少的查询批量更新数据库。

       $data = $this->input->post();
       $project_id = $data['project_id'];

       $this->db->where('project_id', $project_id);
       $paper = $this->db->get('paper')->result();

       $paper2 = $paper; //duplica o array de papers
       $duplicatesCount = 0;

       // keep track of updates
       $updates = [];

       foreach($paper as $p){
           $similarity = null;

            foreach($paper2 as $p2){
                if($p -> status_selection_id !== 4 && $p2 -> status_selection_id !== 4){ 
                    if($p -> paper_id !== $p2 -> paper_id){ 
                        similar_text($p -> title, $p2 -> title, $similarity);

                        if ($similarity > 90) { 

                            $updates[] = [
                                'paper_id' => $p -> paper_id,
                                'status_selection_id' => 4,
                            ];

                            $duplicatesCount ++;
                        }
                    }
                }
            }
        }

        if ($duplicatesCount > 0) {
             // here you have to create a big SQL request with all the updates
             // maybe your DB adaptor can do it for you ?
             $query = $this->db->somethingToCreateABulkQuery();
             foreach ($updates as $update) {
                 // stuff 
                 $query->somethingToAddAndUpdate($update);
             }
             $this->db->somethingToExecuteTheQuery($query);

        }

推荐阅读