首页 > 解决方案 > 解析生成的文件 Python

问题描述

我正在尝试将生成的文件解析为对象列表。

不幸的是,生成的文件的结构并不总是相同的,但它们包含相同的字段(以及许多其他垃圾)。

例如:

    function foo();              # Don't Care
    function maybeanotherfoo();  # Don't Care
    int maybemoregarbage;        # Don't Care

    
    product_serial = "CDE1102"; # I want this <---------------------
    unnecessary_info1 = 10;     # Don't Care
    unnecessary_info2 = "red"   # Don't Care
    product_id = 1134412;       # I want this <---------------------
    unnecessary_info3 = "88"    # Don't Care

    product_serial = "DD1232";  # I want this <---------------------
    product_id = 3345111;       # I want this <---------------------
    unnecessary_info1 = "22"    # Don't Care
    unnecessary_info2 = "panda" # Don't Care

    product_serial = "CDE1102"; # I want this <---------------------
    unnecessary_info1 = 10;     # Don't Care
    unnecessary_info2 = "red"   # Don't Care
    unnecessary_info3 = "bear"  # Don't Care
    unnecessary_info4 = 119     # Don't Care
    product_id = 1112331;       # I want this <---------------------
    unnecessary_info5 = "jj"    # Don't Care

我想要一个对象列表(每个对象都有:序列号和 ID)。

我尝试了以下方法:


import re

class Product:
    def __init__(self, id, serial):
        self.product_id = id
        self.product_serial = serial

linenum = 0
first_string = "product_serial"
second_string = "product_id"
with open('products.txt', "r") as products_file:
    for line in products_file:
        linenum += 1
        if line.find(first_string) != -1:
            product_serial = re.search('\"([^"]+)', line).group(1)
            #How do I proceed?                


任何建议将不胜感激!谢谢!

标签: pythonparsing

解决方案


我在这里使用 内联数据io.StringIO(),但您可以替换data您的products_file.

这个想法是我们将键/值收集到current_object中,一旦我们拥有了我们知道的单个对象(两个键)所需的所有数据,我们就将其推送到一个列表中objects并启动一个新的current_object.

您可以使用类似的东西if line.startswith('product_serial')来代替公认的复杂的正则表达式。

import io
import re

data = io.StringIO("""
    function foo();             
    function maybeanotherfoo(); 
    int maybemoregarbage;       

    
    product_serial = "CDE1102"; 
    unnecessary_info1 = 10;     
    unnecessary_info2 = "red"   
    product_id = 1134412;       
    unnecessary_info3 = "88"    

    product_serial = "DD1232";  
    product_id = 3345111;       
    unnecessary_info1 = "22"    
    unnecessary_info2 = "panda" 

    product_serial = "CDE1102"; 
    unnecessary_info1 = 10;     
    unnecessary_info2 = "red"   
    unnecessary_info3 = "bear"  
    unnecessary_info4 = 119     
    product_id = 1112331;       
    unnecessary_info5 = "jj"    
""")

objects = []

current_object = {}
for line in data:
    line = line.strip()  # Remove leading and trailing whitespace
    m = re.match(r'^(product_id|product_serial)\s*=\s*(\d+|"(?:.+?)");?$', line)

    if m:
        key, value = m.groups()
        current_object[key] = value.strip('"')
        if len(current_object) == 2:  # Got the two keys we want, ship the object
            objects.append(current_object)
            current_object = {}

print(objects)

推荐阅读