python - 如何从非结构化文本创建 python 字典?
问题描述
我有一组存在于文本文件中的断开链接检查器结果:
Getting links from: https://www.foo.com/
├───OK─── http://www.this.com/
├───OK─── http://www.is.com/
├─BROKEN─ http://www.broken.com/
├───OK─── http://www.set.com/
├───OK─── http://www.one.com/
5 links found. 0 excluded. 1 broken.
Getting links from: https://www.bar.com/
├───OK─── http://www.this.com/
├───OK─── http://www.is.com/
├─BROKEN─ http://www.broken.com/
3 links found. 0 excluded. 1 broken.
Getting links from: https://www.boo.com/
├───OK─── http://www.this.com/
├───OK─── http://www.is.com/
2 links found. 0 excluded. 0 broken.
我正在尝试编写一个脚本来读取文件并创建一个字典列表,其中每个根链接作为键,其子项作为值(包括摘要行)。
我试图实现的输出如下所示:
{"Getting links from: https://www.foo.com/": ["├───OK─── http://www.this.com/", "├───OK─── http://www.is.com/", "├─BROKEN─ http://www.broken.com/", "├───OK─── http://www.set.com/", "├───OK─── http://www.one.com/", "5 links found. 0 excluded. 1 broken."],
"Getting links from: https://www.bar.com/": ["├───OK─── http://www.this.com/", "├───OK─── http://www.is.com/", "├─BROKEN─ http://www.broken.com/", "3 links found. 0 excluded. 1 broken."],
"Getting links from: https://www.boo.com/": ["├───OK─── http://www.this.com/", "├───OK─── http://www.is.com/", "2 links found. 0 excluded. 0 broken."] }
这是我到目前为止所拥有的:
result_list = []
with open('link_checker_result.txt', 'r') as f:
temp_list = f.readlines()
for line in temp_list:
result_list.append(line)
这给了我输出:
['Getting links from: https://www.foo.com/', '├───OK─── http://www.this.com/', '├───OK─── http://www.is.com/', '├─BROKEN─ http://www.broken.com/', '├───OK─── http://www.set.com/', '├───OK─── http://www.one.com/', '5 links found. 0 excluded. 1 broken.', 'Getting links from: https://www.bar.com/', '├───OK─── http://www.this.com/', '├───OK─── http://www.is.com/', '...' ]
我知道这些集合中的每一个都有一些共同的特征,例如,它们之间有一个空白行,或者它们以“Getting...”开头的事实。这是我应该在写字典之前尝试拆分的东西吗?
我是 Python 的新手,所以我承认我什至不确定我是否朝着正确的方向前进。真的很感谢一些专家对此的看法!提前致谢!
解决方案
这实际上可以很短,在 4 行代码内:
finalDict = {}
with open('link_checker_result.txt', 'r') as f:
lines = list(map(lambda line: line.split('\n'),f.read().split('\n\n')))
finalDict = dict((elem[0],elem[1:]) for elem in lines)
print(finalDict)
输出:
{'Getting links from: https://www.foo.com/': ['+---OK--- http://www.this.com/', '+---OK--- http://www.is.com/', '+-BROKEN- http://www.broken.com/', '+---OK--- http://www.set.com/', '+---OK--- http://www.one.com/'], 'Getting links from: https://www.bar.com/': ['+---OK--- http://www.this.com/', '+---OK--- http://www.is.com/', '+-BROKEN- http://www.broken.com/'], 'Getting links from: https://www.boo.com/': ['+---OK--- http://www.this.com/', '+---OK--- http://www.is.com/']}
上面的代码所做的是,读取输入文件并使用两个连续的换行符将其拆分\n
,以便获取每个 url 的链接。
finalDict
最后,它创建第一个元素的元组和每个列表的其余部分,并将它们转换为字典中的键值对。
一种更容易理解的方法是以下一种:
finalDict = {}
with open('link_checker_result.txt', 'r') as f:
# Getting data and splitting in order to get each url and its links as a unique list element.
data = f.read().split('\n\n')
# Splitting each of the above created elements and discarding the last one which is redundant.
links = [line.split('\n') for line in data]
# Transforming these elements into key-value pairs and inserting them in the dictionary.
finalDict = dict((elem[0],elem[1:]) for elem in links)
print(finalDict)
推荐阅读
- excel - 如何在Excel中总结单圈时间
- swift - 如何在部分中有大量标题的同时制作粘性 headerTableView?
- xcode - 在 Xcode 中构建 EFQRCode 失败
- python - python模块初始化错误:只能将列表(不是“str”)连接到列表
- javascript - 通过 javascript 向 Whatsapp API 发送消息,打开 whatsapp 但显示空白文本
- c# - IDevTools 实例不包含 CreateDevToolsSession 方法
- javascript - 在本机反应中选择日期时如何关闭日期选择器模式
- python - 如何在 Python / Windows 中绕过管道缓冲区大小
- powershell - 使用 powershell 在 Web 中登录并下载文件
- python - 仅使用 python 生成 celery 工人而不会阻塞