linux - 如何根据字数在第一个句点字符处拆分行并在结果行中重复该过程(在模式空间中)
问题描述
我正在尝试拆分一个文本文档,其中任何超过 10 个单词的行(定义为两边空格之间的任何单词)都应该在从左到右出现的第一个句点字符处拆分。任何超过 10 个单词的结果行也应该被拆分。
样本输入数据:
1I got from Dr. Smith, the OK to keep working.
2I got from Dr. Smith, the O.K. to keep working.
3I got from Dr. Smith, the OK to keep working more.
4I got from Dr. Smith, the O.K. to keep working more.
5I got from Dr. Smith, the O.K. to keep working more, although I'm sick.
6I got from Dr. Smith, the O.K. to keep working more, although I'm so sick.
所需的输出数据:
1I got from Dr. Smith, the OK to keep working.
2I got from Dr. Smith, the O.K. to keep working.
3I got from Dr.
Smith, the OK to keep working more.
4I got from Dr.
Smith the O.K. to keep working more.
5I got from Dr.
Smith, the O.K. to keep working more, although I'm sick.
6I got from Dr.
Smith, the O.K.
to keep working more, although I'm so sick.
我试过以下代码:
sed -r ':a; /((\w)+[., ]+){11}/s/\./\r\n/; ta' grab.txt | tr '\r' '.' > output.txt
该代码会产生以下不准确的结果:
1I got from Dr. Smith, the OK to keep working.
2I got from Dr.
Smith, the O.K. to keep working.
3I got from Dr.
Smith, the OK to keep working more.
4I got from Dr.
Smith, the O.K. to keep working more.
5I got from Dr.
Smith, the O.K. to keep working more, although I'm sick.
6I got from Dr.
Smith, the O.K. to keep working more, although I'm so sick.
注意第 1 行和第 2 行都有 10 个单词,但第 2 行被拆分(似乎在单词中添加句点......例如单词 OK.. 使它认为该行中的单词比实际单词多)。
注意第 6 行实际上应该分成 3 行,因为第二行有 11 个单词,但由于某种原因它没有。
我正在寻找一种可以通过管道输入和输出的解决方案。
谢谢你。
解决方案
使用 awk 的简单解决方案:
awk '{
while (NF>10) {
if (!(i=index($0,".")))
break
print substr($0,1,i)
$0=substr($0,i+1)
# trim leading blank(s)
$1=$1
}
if ($0!="")
print
}' file
只要一行超过十个字,就被第一个句号一分为二;打印第一部分,并使用第二部分更新该行,依此类推。
顺便说一句,用 sed 来做这件事,根本不是一个好主意。
推荐阅读
- macos - 使用 Macbook 命令行重新映射键
- html - 居中多个背景
- arrays - 在文本文件中创建新行 X 次
- windows - Process Explorer & Process Monitor:写入字节报告的差异
- reactjs - 如何从 React 调用 Azure API 管理?
- python - 获取 FB 令牌 Python
- javascript - 我的代码在不刷新页面的情况下不起作用。我们装箱
- class - 如何正确命名包含返回从此类中设置的源中选择的值并以相同方式设置源值的属性的类?
- hiveql - Count(*) 返回值 0(即使它应该至少为 1+)
- node.js - PM2集群模式找不到模块