首页 > 解决方案 > Chicken Scheme 读取行耗时过长

问题描述

有没有一种快速的方法来读取和标记大型语料库?我正在尝试读取一个中等大小的文本文件,编译后的 CHICKEN 似乎只是挂起(我在大约 2 分钟后终止了该过程),而例如,Racket 的性能可以接受(大约 20 秒)。我能做些什么来在 CHICKEN 上获得相同的性能吗?这是我用来读取文件的代码。欢迎所有建议。

(define *corpus*
  (call-with-input-file "largeish_file.txt"
    (lambda (input-file)
      (let loop ([line (read-line input-file)]
                 [tokens '()])
        (if (eof-object? line)
            tokens
            (loop (read-line input-file)
                  (append tokens (string-split line))))))))

标签: ioracketchicken-scheme

解决方案


如果您有能力一次性将整个文件读入内存,您可以使用类似以下代码的代码,它应该会更快:

(let loop ((lines (with-input-from-file "largeish_file.txt"
                    read-lines)))
  (if (null? lines)
      '()
      (append (string-split (car lines))
              (loop (cdr lines)))))

这是一些快速基准代码:

(import (chicken io)
        (chicken string))

;; Warm-up
(with-input-from-file "largeish_file.txt" read-lines)

(time
 (with-output-to-file "a.out"
   (lambda ()
     (display
      (call-with-input-file "largeish_file.txt"
        (lambda (input-file)
          (let loop ([line (read-line input-file)]
                     [tokens '()])
            (if (eof-object? line)
                tokens
                (loop (read-line input-file)
                      (append tokens (string-split line)))))))))))

(time
 (with-output-to-file "b.out"
   (lambda ()
     (display
      (let loop ((lines (with-input-from-file "largeish_file.txt"
                          read-lines)))
        (if (null? lines)
            '()
            (append (string-split (car lines))
                    (loop (cdr lines)))))))))

这是我系统上的结果:

$ csc bench.scm && ./bench
28.629s CPU time, 13.759s GC time (major), 68772/275 mutations (total/tracked), 4402/14196 GCs (major/minor), maximum live heap: 4.63 MiB
0.077s CPU time, 0.033s GC time (major), 68778/292 mutations (total/tracked), 10/356 GCs (major/minor), maximum live heap: 3.23 MiB

只要确保我们从两个代码片段中得到相同的结果:

$ cmp a.out b.out && echo They contain the same data
They contain the same data

largeish_file.txt是通过 cat'ing 一个 ~100KB 系统日志文件直到它得到 ~10000 行生成的(提到这一点,以便您了解输入文件的配置文件):

$ ls -l largeish_file.txt
-rw-r--r-- 1 mario mario 587340 Aug  2 11:55 largeish_file.txt

$ wc -l largeish_file.tx
5790 largeish_file.txt

我在 Debian 系统上使用 CHICKEN 5.2.0 得到的结果。


推荐阅读