python - 如何在python中使用正则表达式捕获和分离文本
问题描述
我正在尝试从文本格式的数据集中生成数据框。文本文件格式如下
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A1RXYH9ROBAKEZ
review/profileName: A. Igoe
review/helpfulness: 0/0
review/score: 1.0
review/time: 1233360000
review/summary: Don't buy!
review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A7L6E1KSJTAJ6
review/profileName: Steven Martz
review/helpfulness: 0/0
review/score: 5.0
review/time: 1191456000
review/summary: Mobile Action Bluetooth Mobile Phone Tool Software MA-730
review/text: Great product- tried others and this is a ten compared to them. Real easy to use and sync's easily. Definite recommended buy to transfer data to and from your Cell.
所以我需要生成一个数据框,其中包含所有 ProductID、Title、Price 等作为列标题和每条记录中的相应数据。
所以我想要的最终数据框是
ID Title Price UserID ProfileName Helpfulness Score Time summary
B000JVER7W Mobile Action MA730 unknown A1RXYH9ROBAKEZ A. Igoe 0/0 1.0 1233360000 Don'tbuy!
Handset Manager - Bluetooth
Data Suite
对于使用正则表达式的数据集中的所有评论详细信息,依此类推。由于我是正则表达式的初学者,因此无法执行此操作。我试过做(假设数据集变量包含文本文件的所有内容)
pattern = "product\productId:\s(.*)\s"
a = re.search(pattern, dataset)
通过这样做,我得到了输出
>> a.group(1)
"B000JVER7W product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite product/price: unknown review/userId: A1RXYH9ROBAKEZ review/profileName: A. Igoe review/helpfulness: 0/0 review/score: 1.0 review/time: 1233360000 review/summary: Dont buy! review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!"
但我想要的是
>> a.group(1)
"["B000JVER7W", "A000123js" ...]"
同样适用于所有领域。
上述要求是否可行,如果是怎么做
提前致谢
解决方案
即使没有任何正则表达式,您也可以通过创建字典然后使用pandas.Dataframe()
.
尝试这个 :
import pandas as pd
with open("your_file_name") as file:
product_details = file.read().split("\n\n")
product_dict = {"ID":[],"Title":[],"Price":[],"UserID":[],
"ProfileName":[],"Helpfulness":[],"Score":[],"Time":[],"summary":[]}
for product in product_details:
fields = product.split("\n")
product_dict["ID"].append(fields[0].split(":")[1])
product_dict["Title"].append(fields[1].split(":")[1])
product_dict["Price"].append(fields[2].split(":")[1])
product_dict["UserID"].append(fields[3].split(":")[1])
product_dict["ProfileName"].append(fields[4].split(":")[1])
product_dict["Helpfulness"].append(fields[5].split(":")[1])
product_dict["Score"].append(fields[6].split(":")[1])
product_dict["Time"].append(fields[7].split(":")[1])
product_dict["summary"].append(fields[8].split(":")[1])
dataframe = pd.DataFrame(product_dict)
print(dataframe)
输出
第一行看起来像你想要的那样:
ID Title Price UserID ProfileName Helpfulness Score Time summary
B000JVER7W Mobile Action MA730 unknown A1RXYH9ROBAKEZ A. Igoe 0/0 1.0 1233360000 Don'tbuy!
Handset Manager - Bluetooth
Data Suite
推荐阅读
- angular - 动态创建/编译组件和模块时出错
- html - Heading taking more than container size, when restricting screen size to mobile width
- c# - WPFToolkit AutoCompleteBox not binding correctly inside ListView
- google-cloud-platform - 是什么导致 Google gcloud 错误消息“错误:(gcloud)无效选择:'gcloud'”以及如何解决这个问题?
- python - 将 pandas 数据框导出到 csv 文件中('list' 对象没有属性 'to_csv')
- google-chrome-extension - 查找当前持有 Chrome 窗口或标签的显示器?
- kubernetes - 在金丝雀部署策略中,将特定用户重定向到具有新版本的 pod
- php - WooCommerce/Wordpress 中自定义字段的数据存储在哪里
- java - 我需要 PendingIntent 根据用户收到的通知类型打开不同的活动
- java - 带有 Spring 的 Thymeleaf 无法导入片段