首页 > 解决方案 > 最快的字符串过滤算法

问题描述

我有 5,000,000 个以这种方式格式化的无序字符串(Name.Name.Day-Month-Year 24hrTime):

"John.Howard.12-11-2020 13:14"
"Diane.Barry.29-07-2020 20:50"
"Joseph.Ferns.08-05-2020 08:02"
"Joseph.Ferns.02-03-2020 05:09"
"Josephine.Fernie.01-01-2020 07:20"
"Alex.Alexander.06-06-2020 10:10"
"Howard.Jennings.07-07-2020 13:17"
"Hannah.Johnson.08-08-2020 00:49"
...

找到时间 t 在某个 n 和 m 之间的所有字符串的最快方法是什么?(即删除所有时间 < n || m < time 的字符串的最快方法)

这种过滤将在不同的范围内进行多次。时间范围必须始终在同一天,并且开始时间始终早于结束时间。

在 java 中,这是我目前的方法,给出了一些时间字符串 M 和 N 以及 500 万个字符串列表:

ArrayList<String> finalSolution = new ArrayList<>();

String[] startingMtimeArr = m.split(":");
String[] startingNtimeArr = n.split(":");
Integer startingMhour = Integer.parseInt(startingMtimeArr[0]);
Integer startingMminute = Integer.parseInt(startingMtimeArr[1]);
Integer endingNhour = Integer.parseInt(startingNtimeArr[0]);
Integer endingNminute = Integer.parseInt(startingNtimeArr[1]);

for combinedString in ArraySizeOf5Million{
  String[] arr = combinedString.split(".");
  String[] subArr = arr[2].split(" ");
  String[] timeArr = subArr[1].split(":");
  String hour = timeArr[0];
  String minute = timeArr[1];

   If hour >= startingMhour 
        && minute >= startingMminute 
        && hour <= endingNhour 
        && minute <= endingNminute {
    finalSolution.add(hour)
   } 
}

Java 是我的母语,但任何其他语言也可以。更好/更快的逻辑是我所追求的

标签: algorithmsortingdata-sciencemathematical-optimizationdata-scrubbing

解决方案


正如@Paddy3118 已经指出的那样,二进制搜索可能是要走的路。

  1. (如果您的数据在磁盘上):加载输入数据并按日期/时间排序。
  2. i0 是结果集的开始索引,i1 是结果集的结束索引(都是从二进制搜索获得的):枚举结果条目。

我使用的代码(在 Lisp 中)显示在此答案的末尾。它没有丝毫优化(我想通过一些优化工作可以使加载和初始排序更快)。

这就是我的交互式会话的样子(包括时间信息,我的 foo.txt 输入文件包含 500 万个条目)。

rlwrap sbcl --dynamic-space-size 2048
这是 SBCL 2.1.1.debian,ANSI Common Lisp 的实现。有关 SBCL 的更多信息,请访问http://www.sbcl.org/。SBCL 是免费软件,按原样提供,绝对不提供任何担保。它主要在公共领域;有些部分是在 BSD 风格的许可证下提供的。有关详细信息,请参阅发行版中的 CREDITS 和 COPYING 文件。
(ql:quickload :cl-ppcre)
加载“cl-ppcre”:
加载 1 个 ASDF 系统:
cl-ppcre
;加载“cl-ppcre”
..
(:CL-PPCRE)
(load "fivemillion.lisp")
T
(time (defparameter data (load-input-for-queries "foo.txt")))
"sorting..."
评估花费:
实时
32.091 秒总运行时间 32.090620 秒(31.386722 用户,0.703898 系统)
[运行时间包括 2.641 秒 GC 时间和 29.450 秒非 GC 时间。]
100.00% CPU
15 lambdas 转换
115,308,171,684 个处理器周期
6,088,198,752 字节 consed
DATA
(time (defparameter output (query-interval data '(2018 1 1) '(2018 1 2))))
评估花费:
0.000 秒的实时
0.000111 秒总运行时间(0.000109 个用户,0.000002 个系统)
100.00% CPU
395,172 个处理器周期
65,536 个字节 consed
OUTPUT
(时间(defparameteroutput (query-interval data '(2018 1 1) '(2018 1 2 8))))
评估时间:
0.000 秒的实时时间
0.000113 秒的总运行时间(0.000110 用户,0.000003 系统)
100.00% CPU
399,420 处理器周期
65,536字节 consed
OUTPUT
(time (defparameter output (query-interval data '(2018 1 1) '(2019 1 1))))
评估花费:
0.020 秒的实时
0.022469 秒的总运行时间(0.022469 用户,0.000000 系统)
110.00 % CPU
80,800,092 个处理器周期
15,958,016 字节 consed
OUTPUT

因此,虽然加载和排序时间(一次完成)没什么好写的(但可以优化),但(query-interval ...)调用速度非常快。查询的结果集越大,函数返回的列表越长(conses 越多,运行时间越长)。我本可以更聪明,只返回结果集的开始和结束索引,并将条目的收集留给调用者。

这里是源代码,其中还包括生成我使用的测试数据集的代码:

(defun random-uppercase-character ()
  (code-char (+ (char-code #\A) (random 26))))
(defun random-lowercase-character ()
  (code-char (+ (char-code #\a) (random 26))))
(defun random-name-part (nchars)
  (with-output-to-string (stream)
    (write-char (random-uppercase-character) stream)
    (loop repeat (- nchars 1) do
      (write-char (random-lowercase-character) stream))))
(defun random-day-of-month ()
  "Assumes every month has 31 days, because it does not matter
for this exercise."
  (+ 1 (random 31)))
(defun random-month-of-year ()
  (+ 1 (random 12)))
(defun random-year ()
  "Some year between 2017 and 2022"
  (+ 2017 (random 5)))
(defun random-hour-of-day ()
  (random 24))
(defun random-minute-of-hour ()
  (random 60))
(defun random-entry (stream)
  (format stream "\"~a.~a.~d-~d-~d ~d:~d\"~%"
      (random-name-part 10)
      (random-name-part 10)
      (random-day-of-month)
      (random-month-of-year)
      (random-year)
      (random-hour-of-day)
      (random-minute-of-hour)))
(defun generate-input (entry-count file-name)
  (with-open-file (stream
           file-name
           :direction :output
           :if-exists :supersede)
    (loop repeat entry-count do
      (random-entry stream))))

(defparameter *line-scanner*
  (ppcre:create-scanner
   "\"(\\w+).(\\w+).(\\d+)-(\\d+)-(\\d+)\\s(\\d+):(\\d+)\""))
;;      0       1      2      3      4        5      6
;;      fname   lname  day    month  year     hour   minute

(defun decompose-line (line)
  (let ((parts (nth-value
        1
        (ppcre:scan-to-strings
         *line-scanner*
         line))))
    (make-array 7 :initial-contents
        (list (aref parts 0)
              (aref parts 1)
              (parse-integer (aref parts 2))
              (parse-integer (aref parts 3))
              (parse-integer (aref parts 4))
              (parse-integer (aref parts 5))
              (parse-integer (aref parts 6))))))
(defconstant +fname-index+ 0)
(defconstant +lname-index+ 1)
(defconstant +day-index+ 2)
(defconstant +month-index+ 3)
(defconstant +year-index+ 4)
(defconstant +hour-index+ 5)
(defconstant +minute-index+ 6)
(defvar *compare-<-criteria*
  (make-array 5 :initial-contents
          (list +year-index+
            +month-index+
            +day-index+
            +hour-index+
            +minute-index+)))

(defun compare-< (dl1 dl2)
  (labels ((comp (i)
         (if (= i 5)
         nil
         (let ((index (aref *compare-<-criteria* i)))
           (let ((v1 (aref dl1 index))
             (v2 (aref dl2 index)))
             (cond
               ((< v1 v2) t)
               ((= v1 v2) (comp (+ i 1)))
               (t nil)))))))
    (comp 0)))
           
(defun time-stamp-to-index (hours minutes)
  (+ minutes (* 60 hours)))

(defun load-input-for-queries (file-name)
  (let* ((decomposed-line-list
       (with-open-file (stream file-name :direction :input)
         (loop for line = (read-line stream nil nil)
           while line
           collect (decompose-line line))))
     (number-of-lines (length decomposed-line-list))
     (decomposed-line-array (make-array number-of-lines
                        :initial-contents
                        decomposed-line-list)))
    (print "sorting...") (terpri)
    (sort decomposed-line-array #'compare-<)))

(defun unify-date-list (date)
  (let ((date-length (length date)))
    (loop
      for i below 5
      collecting (if (> date-length i) (nth i date) 0))))

(defun decomposed-line-date<date-list (decomposed-line date-list)
  (labels ((comp (i)
         (if (= i 5)
         nil
         (let ((index (aref *compare-<-criteria* i)))
           (let ((v1 (aref decomposed-line index))
             (v2 (nth i date-list)))
             (cond
               ((< v1 v2) t)
               ((= v1 v2) (comp (+ i 1)))
               (t nil)))))))
    (comp 0)))

(defun index-before (data key predicate
             &key (left 0) (right (length data)))
  (if (and (< left right) (> (- right left) 1))
      (if (funcall predicate (aref data left) key)
      (let ((mid (+ left (floor (- right left) 2))))
        (if (funcall predicate (aref data mid) key)
        (index-before data key predicate
                  :left mid
                  :right right)
        (index-before data key predicate
                  :left left
                  :right mid)))
      left)
      right))

(defun query-interval (data start-date end-date)
  "start-date and end-date are given as lists of the form:
'(year month day hour minute) or shorter versions e.g.
'(year month day hour), omitting trailing values which will be
appropriately defaulted."
  (let ((d0 (unify-date-list start-date))
    (d1 (unify-date-list end-date)))
    (let* ((start-index (index-before
             data
             d0
             #'decomposed-line-date<date-list))
       (end-index (index-before
               data
               d1
               #'decomposed-line-date<date-list
               :left (cond
                   ((< start-index 0) 0)
                   ((>= start-index (length data))
                (length data))
                   (t start-index)))))
      (loop for i from start-index below end-index
        collecting (aref data i)))))


推荐阅读