首页 > 解决方案 > 如何在python中使用正则表达式捕获和分离文本

问题描述

我正在尝试从文本格式的数据集中生成数据框。文本文件格式如下

product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A1RXYH9ROBAKEZ
review/profileName: A. Igoe
review/helpfulness: 0/0
review/score: 1.0
review/time: 1233360000
review/summary: Don't buy!
review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!

product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A7L6E1KSJTAJ6
review/profileName: Steven Martz
review/helpfulness: 0/0
review/score: 5.0
review/time: 1191456000
review/summary: Mobile Action Bluetooth Mobile Phone Tool Software MA-730
review/text: Great product- tried others and this is a ten compared to them. Real easy to use and sync's easily. Definite recommended buy to transfer data to and from your Cell.

所以我需要生成一个数据框,其中包含所有 ProductID、Title、Price 等作为列标题和每条记录中的相应数据。

所以我想要的最终数据框是

ID          Title                        Price      UserID          ProfileName     Helpfulness     Score   Time        summary
B000JVER7W  Mobile Action MA730          unknown    A1RXYH9ROBAKEZ  A. Igoe         0/0             1.0     1233360000  Don'tbuy!               
            Handset Manager - Bluetooth 
            Data Suite

对于使用正则表达式的数据集中的所有评论详细信息,依此类推。由于我是正则表达式的初学者,因此无法执行此操作。我试过做(假设数据集变量包含文本文件的所有内容)

pattern = "product\productId:\s(.*)\s"
a = re.search(pattern, dataset)

通过这样做,我得到了输出

>> a.group(1)
 "B000JVER7W product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite product/price: unknown review/userId: A1RXYH9ROBAKEZ review/profileName: A. Igoe review/helpfulness: 0/0 review/score: 1.0 review/time: 1233360000 review/summary: Dont buy! review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!"

但我想要的是

>> a.group(1)
"["B000JVER7W", "A000123js" ...]"

同样适用于所有领域。

上述要求是否可行,如果是怎么做

提前致谢

标签: pythonregex

解决方案


即使没有任何正则表达式,您也可以通过创建字典然后使用pandas.Dataframe().

尝试这个 :

import pandas as pd

with open("your_file_name") as file:
    product_details = file.read().split("\n\n")

product_dict = {"ID":[],"Title":[],"Price":[],"UserID":[],
                "ProfileName":[],"Helpfulness":[],"Score":[],"Time":[],"summary":[]}

for product in product_details:
    fields = product.split("\n")
    product_dict["ID"].append(fields[0].split(":")[1])
    product_dict["Title"].append(fields[1].split(":")[1])
    product_dict["Price"].append(fields[2].split(":")[1])
    product_dict["UserID"].append(fields[3].split(":")[1])
    product_dict["ProfileName"].append(fields[4].split(":")[1])
    product_dict["Helpfulness"].append(fields[5].split(":")[1])
    product_dict["Score"].append(fields[6].split(":")[1])
    product_dict["Time"].append(fields[7].split(":")[1])
    product_dict["summary"].append(fields[8].split(":")[1])

dataframe = pd.DataFrame(product_dict)
print(dataframe)

输出

第一行看起来像你想要的那样:

ID          Title                        Price      UserID          ProfileName     Helpfulness     Score   Time        summary
B000JVER7W  Mobile Action MA730          unknown    A1RXYH9ROBAKEZ  A. Igoe         0/0             1.0     1233360000  Don'tbuy!               
            Handset Manager - Bluetooth 
            Data Suite

推荐阅读