首页 > 解决方案 > Python文本提取到字典

问题描述

我是编码新手,任何人都可以帮助我使用正则表达式或任何其他技术将以下文本集转换为字典。

Bus Number: Departure, 将在所有消息/块中通用

KPN_Sleeper: Bus Number: Departure 
Bus code: Kpn-866489 KA-01-7233 Bangalore 
AC Sleeper/56 Seats
24 Seats booked 

SRS: Bus Number: Departure 
Bus code: SRS-5858 KA-31-5985 Bangalore 


SAM: Bus Number: Departure 
Bus code: SAM-0077 TN-23-0777 Chennai 
{0:{
  "Bus_name": "KPN_Sleeper",
  "Bus code":"Kpn-866489",
  "Bus Number": "KA-01-7233",
  "Departure": "Bangalore",
  "others": "AC Sleeper/56 Seats 24 Seats booked "
},
1:{
  "Bus_name": "SRS",
  "Bus code":"SRS-5858",
  "Bus Number": "KA-31-5985",
  "Departure": "Bangalore",
  "others": ""
}}

由于我是编码和正则表达式的新手,我觉得很难构建。

标签: pythonregex

解决方案


鉴于您的评论,我认为您可以尝试以下操作:

^(.*):\s*Bus Number: Departure\s*\nBus code:\s*([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*(?:\n|$)((?:[^\n]+(?:\n|$))+)?

正则表达式演示

示例代码(在此处运行):

regex = r"^(.*):\s*Bus Number: Departure\s*\nBus code:\s*([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*(?:\n|$)((?:[^\n]+(?:\n|$))+)?"

test_str = ("KPN_Sleeper: Bus Number: Departure \n"
    "Bus code: Kpn-866489 KA-01-7233 Bangalore dfdf\n"
    "AC Sleeper/56 Seats\n"
    "24 Seats booked \n\n"
    "SRS: Bus Number: Departure \n"
    "Bus code: SRS-5858 KA-31-5985 Bangalore dfdf dfd\n\n\n"
    "SAM: Bus Number: Departure \n"
    "Bus code: SAM-0077 TN-23-0777 Chennai \n"
    "asdfadf ;kasdjlfads;f lkadsjf")

matches = re.finditer(regex, test_str, re.MULTILINE)


for match in matches:
    print("Bus Name: "+match.group(1)+"Bus Code: "+match.group(2)+" Bus No: "+match.group(3)+" Departure: "+match.group(4))


#you can have other's value in match.group(5) , however, having it is conditional

解释:

  1. ^(.*):\s* (.*)--> 第一个捕获组以获取总线名称。\s*覆盖空白空间

  2. Bus Number: Departure\s*\n--> Bus Number: 出发后跟空格和换行符

  3. Bus code:\s*下一行以 Bus Code 一个冒号和选项空格开头

  4. ([^ ]+)\s([^ ]+)\s([^\n]+)[ \t]*

    a) ([^ ]+)--> 总线代码 \s --> 空格

    b) ([^ ]+)--> 总线号 \s--> 空格

    c) ([^\n]+)--> Departure ,可能有多个单词

    d) [ \t]* --> 出发后覆盖尾随空格

  5. (?:\n|$)--> 它覆盖了换行符或字符串的结尾

  6. ((?:[^\n]+(?:\n|$))+)?

    a) [^\n]+(?:\n|$--> 匹配除换行符后跟换行符或字符串结尾之外的任何内容

    b)?:使其成为非捕获组

    c)+表示可以有多行

    d)组中所有行的最终() 总和other

    e)?使整个other过程可选


推荐阅读