首页 > 解决方案 > pyparsing : 在日期之间分组文本

问题描述

我的日志文件包含一个日期/时间,在下一个日期/时间之间有不同的行数

前任。

时间-日期
2/07/18 13:55:00.983

msecVal = pyparsing.Word(pyparsing.nums, max=3)
numPair = pyparsing.Word(pyparsing.nums, exact=2)
dateStr = pyparsing.Combine(numPair + '/' + numPair + '/' + numPair)

timeString = pyparsing.Combine(numPair + ':' + numPair + ':' +     numPair\               
       + '.' + msecVal)

日志文件将是

time:date:  line of text
    possible 2nd line of text
    possible 3rd line of text...
    time:date:  line of text
time:date: line of text
    possible 2nd line of text
    possible 3rd line of text...
    possible <n> line of text...
time:date:  line of text

输入将是上述格式的大型文本日志文件。我想生成一个分组元素列表

[[time],[all text until next time]],[[time],[all text until next time]...

如果每个时间/日期条目都是单行,我可以这样做。它跨越随机的多行 # 直到我遇到问题的下一个时间/日期条目。

标签: pyparsing

解决方案


以下是我如何解释您对日志实体的定义:

“行首的日期时间,后跟冒号,然后是行首的下一个日期时间之前的所有内容,即使行中可能嵌入了日期时间。”

您需要两个 pyparsing 功能来解决此问题:

  • LineStart - 区分行首的日期时间与行主体的日期时间

  • SkipTo - 跳过非结构化文本直到找到匹配表达式的快速方法

我将这些表达式添加到您的代码中(我将 pyparsing 导入为“pp”,因为我是一个懒惰的打字员):

dateTime = dateStr + timeString

# log entry date-time keys only match if they are at the start of the line
dateTimeKey = pp.LineStart() + dateTime

# define a log entry as a date-time key, followed by everything up to the next 
# date-time key, or to the end of the input string
# (use results names to make it easy to get at the parts of the log entry)
logEntry = pp.Group(dateTimeKey("time") + ':' + pp.Empty()
                    + pp.SkipTo(dateTimeKey | pp.StringEnd())("body"))

我将您的样本转换为具有不同的日期时间以进行测试,我们得到了这个:

sample = """\
2/07/18 13:55:00.983:  line of text
    possible 2nd line of text
    possible 3rd line of text...
    2/07/19 13:55:00.983:  line of text
2/07/20 13:55:00.983: line of text
    possible 2nd line of text
    possible 3rd line of text...
    possible <n> line of text...
2/07/21 13:55:00.983:  line of text
"""

print(pp.OneOrMore(logEntry).parseString(sample).dump())

给出:

[['2/07/18', '13:55:00.983', ':', 'line of text\n    possible 2nd line of text\n    possible 3rd line of text...\n    2/07/19 13:55:00.983:  line of text'], ['2/07/20', '13:55:00.983', ':', 'line of text\n    possible 2nd line of text\n    possible 3rd line of text...\n    possible <n> line of text...'], ['2/07/21', '13:55:00.983', ':', 'line of text']]
[0]:
  ['2/07/18', '13:55:00.983', ':', 'line of text\n    possible 2nd line of text\n    possible 3rd line of text...\n    2/07/19 13:55:00.983:  line of text']
  - body: 'line of text\n    possible 2nd line of text\n    possible 3rd line of text...\n    2/07/19 13:55:00.983:  line of text'
  - time: ['2/07/18', '13:55:00.983']
[1]:
  ['2/07/20', '13:55:00.983', ':', 'line of text\n    possible 2nd line of text\n    possible 3rd line of text...\n    possible <n> line of text...']
  - body: 'line of text\n    possible 2nd line of text\n    possible 3rd line of text...\n    possible <n> line of text...'
  - time: ['2/07/20', '13:55:00.983']
[2]:
  ['2/07/21', '13:55:00.983', ':', 'line of text']
  - body: 'line of text'
  - time: ['2/07/21', '13:55:00.983']

我还必须将您转换num_pair为:

numPair = pp.Word(pp.nums, max=2)

否则它将与您的示例日期中的前一位数“2”不匹配。


推荐阅读