首页 > 解决方案 > 使用 Python reg exp 提取模式

问题描述

示例字符串 1:

::SCOPE:Confidentiality:SCOPE:Access Control:SCOPE:AuthorizationTECHNICAL IMPACT:Gain Privileges::

示例字符串 2:

::SCOPE:ConfidentialityTECHNICAL IMPACT:Read Data::

示例字符串 3:

::SCOPE:AvailabilityTECHNICAL IMPACT:Unreliable Execution::SCOPE:Confidentiality:SCOPE:Integrity:SCOPE:AvailabilityTECHNICAL IMPACT:Execute Unauthorized Commands:NOTE:Confidentiality Integrity Availability Execute Unauthorized Commands Run Arbitrary Code::SCOPE:ConfidentialityTECHNICAL IMPACT:Read Data::SCOPE:IntegrityTECHNICAL IMPACT:Modify Data::SCOPE:Confidentiality:SCOPE:Access Control:SCOPE:AuthorizationTECHNICAL IMPACT:Gain Privileges::

对于示例字符串 1,我想提取 -

Confidentiality
Access Control
Authorization
Gain Privileges

对于示例字符串 2,我想提取 -

Confidentiality
Read Data

对于示例字符串 3:我想提取 -

1 - 
Availability
Unreliable Execution

2- 机密性 完整性 可用性 执行未经授权的命令

3-保密读取数据

4-完整性修改数据

5-保密访问控制授权获得特权

我开始写一个简单的 reg exp -

::SCOPE:([\w\s]+)TECHNICAL IMPACT:([\w\s]+) 

这将提取字符串 2。

然后我写了 reg exp -

::SCOPE:([\w\s]+):SCOPE:([\w\s]+):SCOPE:([\w\s]+)TECHNICAL IMPACT:([\w\s]+)

这将提取字符串 3。

但是,这些表达式是静态的。

我看到的一般情况是 - ::SCOPE: [part 1 to extract] TECHNICAL IMPACT: [part 2 to extract] 这种一般模式可能在给定字符串的多个部分中,但是[part 1 to extract]是可变的是要提取的。

如何在字符串中多次找到这种一般模式,然后使用 reg exp 从每个模式中提取?

标签: pythonregex

解决方案


我会按照以下方式使用 re.split 来完成该任务:

import re
s1 = '::SCOPE:Confidentiality:SCOPE:Access Control:SCOPE:AuthorizationTECHNICAL IMPACT:Gain Privileges::'
s2 = '::SCOPE:ConfidentialityTECHNICAL IMPACT:Read Data::'
s3 = '::SCOPE:AvailabilityTECHNICAL IMPACT:Unreliable Execution::SCOPE:Confidentiality:SCOPE:Integrity:SCOPE:AvailabilityTECHNICAL IMPACT:Execute Unauthorized Commands:NOTE:Confidentiality Integrity Availability Execute Unauthorized Commands Run Arbitrary Code::SCOPE:ConfidentialityTECHNICAL IMPACT:Read Data::SCOPE:IntegrityTECHNICAL IMPACT:Modify Data::SCOPE:Confidentiality:SCOPE:Access Control:SCOPE:AuthorizationTECHNICAL IMPACT:Gain Privileges::'
ext1 = [i for i in re.split(r'[:A-Z ]*:', s1) if i]
ext2 = [i for i in re.split(r'[:A-Z ]*:', s2) if i]
ext3 = [i for i in re.split(r'[:A-Z ]*:', s3) if i]

然后:

  • ext1 是['Confidentiality', 'Access Control', 'Authorization', 'Gain Privileges']
  • ext2 是['Confidentiality', 'Read Data']
  • ext3 是 ['Availability', 'Unreliable Execution', 'Confidentiality', 'Integrity', 'Availability', 'Execute Unauthorized Commands', 'Confidentiality Integrity Availability Execute Unauthorized Commands Run Arbitrary Code', 'Confidentiality', 'Read Data', 'Integrity', 'Modify Data', 'Confidentiality', 'Access Control', 'Authorization', 'Gain Privileges']

我只是在寻找由:, 空格和大写字母组成并以:to split at 结尾的子字符串,然后strlists 产生的re.splits 中删除空 s


推荐阅读