首页 > 解决方案 > 如何根据字数在第一个句点字符处拆分行并在结果行中重复该过程(在模式空间中)

问题描述

我正在尝试拆分一个文本文档,其中任何超过 10 个单词的行(定义为两边空格之间的任何单词)都应该在从左到右出现的第一个句点字符处拆分。任何超过 10 个单词的结果行也应该被拆分。

样本输入数据:

1I got from Dr. Smith, the OK to keep working.
2I got from Dr. Smith, the O.K. to keep working.
3I got from Dr. Smith, the OK to keep working more.
4I got from Dr. Smith, the O.K. to keep working more.
5I got from Dr. Smith, the O.K. to keep working more, although I'm sick.
6I got from Dr. Smith, the O.K. to keep working more, although I'm so sick.

所需的输出数据:

1I got from Dr. Smith, the OK to keep working.
2I got from Dr. Smith, the O.K. to keep working.
3I got from Dr.
Smith, the OK to keep working more.
4I got from Dr.
Smith the O.K. to keep working more.
5I got from Dr.
Smith, the O.K. to keep working more, although I'm sick.
6I got from Dr.
Smith, the O.K.
to keep working more, although I'm so sick.

我试过以下代码:

sed -r ':a; /((\w)+[., ]+){11}/s/\./\r\n/; ta' grab.txt | tr '\r' '.' > output.txt

该代码会产生以下不准确的结果:

1I got from Dr. Smith, the OK to keep working.
2I got from Dr.
 Smith, the O.K. to keep working.
3I got from Dr.
 Smith, the OK to keep working more.
4I got from Dr.
 Smith, the O.K. to keep working more.
5I got from Dr.
 Smith, the O.K. to keep working more, although I'm sick.
6I got from Dr.
 Smith, the O.K. to keep working more, although I'm so sick.

注意第 1 行和第 2 行都有 10 个单词,但第 2 行被拆分(似乎在单词中添加句点......例如单词 OK.. 使它认为该行中的单词比实际单词多)。

注意第 6 行实际上应该分成 3 行,因为第二行有 11 个单词,但由于某种原因它没有。

我正在寻找一种可以通过管道输入和输出的解决方案。

谢谢你。

标签: linuxsed

解决方案


使用 awk 的简单解决方案:

awk '{
  while (NF>10) {
    if (!(i=index($0,".")))
      break
    print substr($0,1,i)
    $0=substr($0,i+1)
    # trim leading blank(s)
    $1=$1
  }
  if ($0!="")
    print
}' file

只要一行超过十个,就被第一个句号一分为二;打印第一部分,并使用第二部分更新该行,依此类推。

顺便说一句,用 sed 来做这件事,根本不是一个好主意。


推荐阅读