python - Python Regex:如何在两个模式之间选择行
问题描述
考虑如下典型的实时聊天数据:
Peter (08:16):
Hi
What's up?
;-D
Anji Juo (09:13):
Hey, I'm using WhatsApp!
Peter (11:17):
Could you please tell me where is the feedback?
Anji Juo (19:13):
I don't know where it is.
Anji Juo (19:14):
Do you by any chance know where I can catch a taxi ?
要将这个原始文本文件转换为 DataFrame,我需要编写一些正则表达式来识别列名,然后提取相应的值。
请参阅https://regex101.com/r/X3ubqF/1
Index(time) Name Message
08:16 Peter Hi
What's up?
;-D
09:13 Anji Juo Hey, I'm using WhatsApp!
11:17 Peter Could you please tell me where is the feedback?
19:13 Anji Juo I don't know where it is.
19:14 Anji Juo Do you by any chance know where I can catch a taxi ?
正则表达式r"(?P<Name>.*?)\s*\((?P<Index>(?:\d|[01]\d|2[0-3]):[0-5]\d)\)"
可以完美地提取时间和名称列的值,但我不知道如何为每个时间索引突出显示和提取来自特定发件人的消息。
解决方案
您可以使用re
模块来解析字符串(regex101):
import re
s = """
Peter (08:16):
Hi
What's up?
;-D
Anji Juo (09:13):
Hey, I'm using WhatsApp!
Peter (11:17):
Could you please tell me where is the feedback?
Anji Juo (19:13):
I don't know where it is.
Anji Juo (19:14):
Do you by any chance know where I can catch a taxi ?
"""
all_data = []
for part in re.findall(
r"^\s*(.*?)\s+\(([^)]+)\):\s*(.*?)(?:\n\n|\Z)", s, flags=re.M | re.S
):
all_data.append(part)
df = pd.DataFrame(all_data, columns=["Index(time)", "Name", "Message"])
print(df)
印刷:
Index(time) Name Message
0 Peter 08:16 Hi \nWhat's up? \n;-D
1 Anji Juo 09:13 Hey, I'm using WhatsApp!
2 Peter 11:17 Could you please tell me where is the feedback?
3 Anji Juo 19:13 I don't know where it is.
4 Anji Juo 19:14 Do you by any chance know where I can catch a taxi ?\n\n
推荐阅读
- wso2 - 如何将硬编码端口更改为 deployement.toml 中的环境名称?
- c - 在C中连接字符串和数字
- c - 为什么使用“rank[0] - rank[1]”来更新要插入redis的节点的跨度skiplist“zslInsert”
- javascript - 将 socket.io Javascript 代码转换为 python-socketio
- git - 为什么无法在 github 上推送 git local repo?
- html - 更改选择离子角度的输入控制
- javascript - 如何修复,无法对未安装的组件执行 React 状态更新?
- javascript - 如何获取数据集标签图表JS
- node.js - 使用电子应用程序捕获特定的应用程序窗口
- reactjs - 是的,移动浏览器中的验证失败 React js