首页 > 解决方案 > N个大小为M的字符串与K个大小为L的子字符串的比较

问题描述

    Strings:
    "Today is a good day, I would like to go to supermarket for an errand. I would be able to cook for my husband tonight."
    "Today is a rainy day, I would like to go for an errand. Jason would like to cook for me tonight."
    "Yesterday is a good day, I would like to go to Cosco for an errand. I will like to cook for my family tonight."
    "I would like to go to Lawson for an errand. I will try to cook for your family tonight."
    "Today is a good day, she goes to supermarket for an errand. She cook for me tonight."

    Substrings:
    "Today is a good day"
    "I would like to"
    "for my family"

正如我们所看到的,有 N = 5 个字符串,每个字符串最多有 M = 21 个单词,而有 K = 3 个子字符串,每个字符串最多有 L = 6 个单词。我想在比较中得到以下信息:

  1. 哪些字符串不包含特定的子字符串或包含带有“typo”的子字符串(例如,“I would like to”:3 个字符串,第 2 和第 4 个是错字,第 5 个是找到 0)
  2. 找到与给定子串相似的现有子串,相似度百分比和出现百分比,哪个字符串(例如“我想”-> a.“我想”:(相似度:100%), (出现:4/8=50%),(字符串:除最后一个) b.“我能”:(相似度:3/4=75%),(出现:1/8=12.5%), (String: 1) c. "Jason would like to": (相似度: 3/4=75%),(Occurrence:1/8=12.5%), (String: 2) d. "I will like to": (相似度:3/4=75%),(出现次数:1/8=12.5%),(字符串:3) e.“我会尝试”:(相似度:2/4=50%),(出现次数: 1/8=12.5%)), (字符串: 4)

我想到的解决方案:

  1. 对于每个字符串,将其与每个子字符串逐字进行比较,但最差的运行时间将是 N M K*L
  2. 将字符串变成字典树,每个节点包含单词和该节点处的字符串数,通过向下遍历树并逐字比较每个子字符串,运行时间将为 N 形成树 + M K L 但起点不同(第 4 个字符串没有“Today .... day”),并且在某些字符串中缺少部分(“在第 2 个字符串中找不到“去超市”)。其次,无法跟踪哪些字符串找到了 0 或带有“错字” ”</li>

似乎第一个解决方案将是最理想的解决方案,但如果 N 的数量增加,则需要更长的时间。这个问题有更好的解决方案吗?

标签: pythonalgorithm

解决方案


在 python 中,大多数解决方案似乎更适用于 hashmap。

概念伪代码

use a hashmap to convert words to numbers // avoids further factor P for word length
make dict from wordnumber to list of occurances in strings, position in strings

convert substrings to numbers

for substring in substrings // O(K)
  for subword in substring // O(L) - not used in pt. 1.
    for curr in dict[subword] // O(Q) occurrances of word in all strings
      for word in strings[curr.stringNo][cur.position-subword.position...cur.position+substringlength-1-subword.position] // O(L)
        compare with substring // O(1), for pt.1. actually we can start from the next position as we already know the first is a match.
          count match // O(1) or break in case of pt.1. lowering the L to the actual match number.
  • pt。1 : QKL 与 NMKL
  • pt。2 : QKL^2 与 NMKL^2

那么 Q < NM 吗?如果所有字符串仅由 'I's 组成,它们是相等的,在大多数情况下,Q 将比 NM 小一个或两个数量级,但代价是 en hashmap 查找。因此,平均而言,在最坏的情况下,Q 解决方案会更好。

如果您不转换为数字,则所有解决方案都需要额外的因子 P 字长。

部分匹配的额外 L 是因为您需要检查子字符串中的每个单词以找到部分匹配。


推荐阅读