shell - 从命令行快速查找结构化文本数据?
问题描述
假设我有一个可预测的文本文档,其中包含一些称为 IDX:
和已知属性组合的结构,例如Y:
具有已知实例数量的类别(例如,在系列中的Y:
每个之后总是只有 1 个X:
):
X:37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
我想检索所有蓝色事物的项目 ID 列表。我不在乎是否有重复的 ID,只关心文档中有哪些 ID 值。然后我想对列表进行排序并与另一个具有完全相同结构的结构化文本文档中的蓝色事物 ID 列表进行比较(“两个文档共有哪些蓝色事物?”“哪些蓝色事物在文档 1 中但不在文档 2 中?”)。
我知道我可以很容易地grep
处理所有Y:BLUE
行,但是我需要什么额外的命令来找到X:
每个此类实例的前一个,并将排序的结果列表传递给一个diff
?自从 AmiShell 以来,我没有大量使用命令行...抱歉 :-( 是否有针对此类用例的在线食谱?
解决方案
假设您有以下 2 个输入文档:
$ more doc*
::::::::::::::
doc1
::::::::::::::
doc 1
X:1
# more data pertaining to item 37
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:2
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:3
# more data pertaining to item 37
# more data pertaining to item 37
Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:4
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
::::::::::::::
doc2
::::::::::::::
doc 2
X:4
# more data pertaining to item 37
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:3
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:2
# more data pertaining to item 37
# more data pertaining to item 37
Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
X:1
# more data pertaining to item 37
# more data pertaining to item 37
Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per preceding "X:"
您可以在每个文档上使用以下awk
命令来获取 ID:
$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1
1
2
4
$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2
1
3
4
说明:
-F':'
定义:
为字段分隔符:/X:[0-9]+$/{tmp=$2}
将在tmp
变量中保存 id 的值(假设 id 仅由数字组成,并且没有其他内容),如果您不是这种情况,您可以调整过滤正则表达式/X:[0-9]+$/
以满足您的需求/Y:BLUE$/{a[NR]=tmp}
当我们到达带有模式的一行时Y:BLUE
(假设:EOL 紧随其后BLUE
),我们将保存在 tmp 中的值添加到数组中- 在处理结束时,我们对数组进行排序并打印它,请注意您更改了
awk
命令awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{print tmp}' | sort -n
然后您可以通过以下方式将它们组合起来,以找出 2 个文档之间蓝色 id 术语的差异:
$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)
2c2
< 2
---
> 3
或找到它们之间共同的蓝色 id:
$ comm -1 -2 <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[
0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)
1
4
推荐阅读
- bash - 通过“docker exec”执行 bash 脚本时,Docker 在主机上查找脚本,但不在容器中
- node.js - 分页真的能缓解 mongodb 的压力吗?如果是,使用 mongoose 和 nodejs 实现它的最佳方法是什么
- python-3.x - 如何从文本数据中只准备测试数据集?
- html - 我的网站的电子邮件按钮有问题
- javascript - 如何动态移除围绕文本输入的焦点
- sed - sed 替换正则表达式正确的语法
- snowflake-cloud-data-platform - 来自 S3(COPY)的雪花负载数据与来自外部表的负载
- vb.net - 为 VSTO 插件编写异常
- c - C Program that Removes Comments from a String
- python - 在 Python 中为短语 (bigram,n-gram) 创建词云