首页 > 解决方案 > 从命令行快速查找结构化文本数据?

问题描述

假设我有一个可预测的文本文档,其中包含一些称为 IDX:和已知属性组合的结构,例如Y:具有已知实例数量的类别(例如,在系列中的Y:每个之后总是只有 1 个X:):

  X:37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

我想检索所有蓝色事物的项目 ID 列表。我不在乎是否有重复的 ID,只关心文档中有哪些 ID 值。然后我想对列表进行排序并与另一个具有完全相同结构的结构化文本文档中的蓝色事物 ID 列表进行比较(“两个文档共有哪些蓝色事物?”“哪些蓝色事物在文档 1 中但不在文档 2 中?”)。

我知道我可以很容易地grep处理所有Y:BLUE行,但是我需要什么额外的命令来找到X:每个此类实例的前一个,并将排序的结果列表传递给一个diff?自从 AmiShell 以来,我没有大量使用命令行...抱歉 :-( 是否有针对此类用例的在线食谱?

标签: shellawkdata-structuresgrep

解决方案


假设您有以下 2 个输入文档:

$ more doc*
::::::::::::::
doc1
::::::::::::::
doc 1
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
::::::::::::::
doc2
::::::::::::::
doc 2
  X:4
#  more data pertaining to item 37
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:3
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:2
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:RED
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
  X:1
#  more data pertaining to item 37
#  more data pertaining to item 37
  Y:BLUE
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"
# more serialized data items including exactly 1 occurrence of "Y:" per   preceding "X:"

您可以在每个文档上使用以下awk命令来获取 ID:

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1
1
2
4

$ awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2
1
3
4

说明:

  • -F':'定义:为字段分隔符:
  • /X:[0-9]+$/{tmp=$2}将在tmp变量中保存 id 的值(假设 id 仅由数字组成,并且没有其他内容),如果您不是这种情况,您可以调整过滤正则表达式/X:[0-9]+$/以满足您的需求
  • /Y:BLUE$/{a[NR]=tmp}当我们到达带有模式的一行时Y:BLUE(假设:EOL 紧随其后BLUE),我们将保存在 tmp 中的值添加到数组中
  • 在处理结束时,我们对数组进行排序并打印它,请注意您更改了awk命令awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{print tmp}' | sort -n

然后您可以通过以下方式将它们组合起来,以找出 2 个文档之间蓝色 id 术语的差异:

$ diff <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                                    
2c2
< 2
---
> 3

或找到它们之间共同的蓝色 id:

$ comm -1 -2 <(awk -F':' '/X:[0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc1) <(awk -F':' '/X:[
0-9]+$/{tmp=$2}/Y:BLUE$/{a[NR]=tmp}END{asort(a); for(i in a){print a[i]}}' doc2)                                              
1
4

推荐阅读