首页 > 解决方案 > 使用 Regex 和 Python 测试和提取多行文本

问题描述

我想基于某个特性进行测试,并在python中使用正则表达式获取包含的数据块。简而言之,这个伪代码解释了我想要实现的目标。

If (Color feature is in the block message):
   bring that block

这是我在 str.txt 文件中的数据样本

.
.
This file contains various types of data formats and blocks

Country of the survey
CONTRY CODE: AAAA
POPULATION: 11111
GDP RANK: 22222

.
BLOCK MESSAGE
      BLOCK A:
LENGTH(M): 1.6
WEIGHT(KG):    76
    DISSABLITIY STATUS(Y/N): N
CHRONIC DISEASE: NONE

FAMILY MEMBERS: 3

END BLOCK

BLOCK MESSAGE

    BLOCK B:
EYE COLOR: BLACK

LENGTH(M): 1.9
     WEIGHT(KG): 89
DISSABLITIY STATUS(Y/N): N
   CHRONIC DISEASE: NONE
           FAMILY MEMBERS: 1
END BLOCK
BLOCK MESSAGE
BLOCK C:
     LENGTH(M): 17
WEIGHT(KG): 90
        DISSABLITIY STATUS(Y/N): Y

CHRONIC DISEASE: Yes
FAMILY MEMBERS: 4
END BLOCK

BLOCK MESSAGE
   BLOCK D:
   LENGTH(M): 195
   WEIGHT(KG): 90
   EYE COLOR: BROWN
DISSABLITIY STATUS(Y/N): N
CHRONIC DISEASE: NONE
FAMILY MEMBERS: 2
END BLOCK

.
.

我期望得到的是

BLOCK MESSAGE
BLOCK B:
EYE COLOR: BLACK
LENGTH(M): 1.9
WEIGHT(KG): 89
DISSABLITIY STATUS(Y/N): N
CHRONIC DISEASE: NONE
FAMILY MEMBERS: 1
END BLOCK

BLOCK MESSAGE
BLOCK D:
LENGTH(M): 195
WEIGHT(KG): 90
EYE COLOR: BROWN
DISSABLITIY STATUS(Y/N): N
CHRONIC DISEASE: NONE
FAMILY MEMBERS: 2
END BLOCK

我的问题是,我怎样才能获得从“BLOCK MESSAGE”到“END BLOCK”的具有眼睛颜色功能的块消息?考虑到以下标准:

  1. 文本可能有不同的数据块。
  2. 可能包含许多空格和换行符。
  3. 所需的特征“眼睛颜色”在消息中可能有不同的位置。

如果对此问题的想法和代码有任何解释,我将受到高度重视。

谢谢大家。

标签: pythonregex

解决方案


一种简单的方法是使用循环:

  1. 打开文本文件并开始每行读取文件
  2. 读取行直到找到块的开头
  3. 读取行直到此块的末尾
  4. 检查此块是否包含颜色
  5. 如果验证了 4,则将块添加到输出
  6. 返回 2

笔记:

  • 您可以使用 operator 简单地检查一行是否包含字符串in
  • 我使用正则表达式模块来替换行首的空格(只是为了更漂亮的输出)。

代码:

# Import regex module
import re

# Save block in a list
output = []
# Open file
with open("../temp.txt", "r")  as f:
    # Read file line per line
    line = f.readline()
    # While not at the end of file
    while line:
        # Search beginning block with "BLOCK MESSAGE"
        if "BLOCK MESSAGE" in line:
            # Init block variable
            block = ""

            # Loop till the string "END BLOCK"
            while line and "END BLOCK" not in line:
                # Add line
                block += line
                # Read next line
                line = f.readline()

            # If COLOR is in the block
            if "COLOR" in block:
                # Add the last line ("END BLOCK")
                block += line
                # Remove space begining line
                block = re.sub(r'\n\s+', '\n', block)
                # Add block to the outputs
                output.append(block)
        # Read next line
        line = f.readline()

输出:


print(output)
# ['BLOCK MESSAGE\nBLOCK B:\nEYE COLOR: BLACK\nLENGTH(M): 1.9\nWEIGHT(KG): 89\nDISSABLITIY STATUS(Y/N): N\nCHRONIC DISEASE: NONE\nFAMILY MEMBERS: 1\nEND BLOCK\n',
#  'BLOCK MESSAGE\nBLOCK D:\nLENGTH(M): 195\nWEIGHT(KG): 90\nEYE COLOR: BROWN\nDISSABLITIY STATUS(Y/N): N\nCHRONIC DISEASE: NONE\nFAMILY MEMBERS: 2\nEND BLOCK\n']

[ print(o) for o in output]
# BLOCK MESSAGE
# BLOCK B:
# EYE COLOR: BLACK
# LENGTH(M): 1.9
# WEIGHT(KG): 89
# DISSABLITIY STATUS(Y/N): N
# CHRONIC DISEASE: NONE
# FAMILY MEMBERS: 1
# END BLOCK

# BLOCK MESSAGE
# BLOCK D:
# LENGTH(M): 195
# WEIGHT(KG): 90
# EYE COLOR: BROWN
# DISSABLITIY STATUS(Y/N): N
# CHRONIC DISEASE: NONE
# FAMILY MEMBERS: 2
# END BLOCK

推荐阅读