首页 > 解决方案 > 如何使用 Python 将文本中的信息放入 CSV

问题描述

我正在使用 re 搜索 html 文档并输出两个给定字符串之间的所有文本。返回的文本有不同城镇的用水率,我需要将其自动放入 csv 表格格式。我怎样才能做到这一点?到目前为止,这是我的代码:

file = open(r'PDFs/murrumbidgee/Murrumbidgee Unregulated River Water Sources 2012_20200815.html', 'r', encoding='utf8')
contents = file.read()
#Converts the html into a soup object
soup = BS(contents, 'html.parser')
rawText = soup.get_text(strip=True, separator='\n')

#Searches the soup object for two given strings and returns the text in between
finalText = re.search(r'19\s+domestic and stock rights(.*?)20\s+native title rights', rawText, flags=re.S|re.I).group(1)

print(finalText)  

这是需要为信息抓取并放入 csv 的输出文本:

The water requirements of persons entitled to domestic and stock rights in these water
sources are estimated to total 4,385 megalitres per year (hereafter
ML/year
),
distributed as follows:
(a)
91 ML/year in the Adjungbilly/Bombowlee/Brungle Water Source,
(b)
75 ML/year in the Billabung Water Source,
(c)
72 ML/year in the Bredbo Water Source,
Page 25
Water Sharing Plan for the Murrumbidgee Unregulated River Water Sources 2012
(d)
82 ML/year in the Burkes/Bullenbung Water Source,
(e)
45 ML/year in the Burrinjuck Dam Catchment Water Source,
(f)
0 ML/year the Burrumbuttock Water Source,
(g)
67 ML/year in the Gilmore/Sandy Water Source,
(h)
39 ML/year in the Goobarragandra Water Source,
(i)
28 ML/year in the Goodradigbee Water Source,
(j)
113 ML/year in the Hillas Water Source,
(k)
100 ML/year in the Houlaghans Water Source,
(l)
291 ML/year in the Jugiong Water Source,
(m)  75 ML/year in the Kyeamba Water Source,
(n)
178 ML/year in the Lake George Water Source,
(o)
44 ML/year the Lower Billabong Water Source,
(p)
169 ML/year the Lower Billabong Anabranch Water Source,
(q)
156 ML/year the Middle Billabong Water Source,
(r)
103 ML/year in the Molonglo Water Source,
(s)
73 ML/year the Mountain Water Source,
(t)
2 ML/year in the Murrumbidgee (Balranald to Weimby) Water Source,
(u)
34 ML/year in the Murrumbidgee (Gogeldrie to Waldaira) Water Source,
(v)
92 ML/year in the Murrumbidgee Central (Burrinjuck to Gogeldrie) Water
Source,
(w)  218 ML/year in the Murrumbidgee I Water Source,
(x)
133 ML/year in the Murrumbidgee II Water Source,
(y)
116 ML/year in the Murrumbidgee III Water Source,
(z)
73 ML/year in the Murrumbidgee North Water Source,
(aa)  476 ML/year in the Murrumbidgee Western Water Source,
(ab)  92 ML/year in the Muttama Water Source,
(ac)  150 ML/year in the Numeralla East Water Source,

这就是表格的样子

Town    Water Usage
Billabung 75 ML
Muttam    92 ML

ETC..

标签: pythonhtmlcsvbeautifulsoupre

解决方案


从您提供的输出中,尚不清楚您究竟要提取什么。如果finalText您发布的内容以及我认为您真正想要的内容,您可以执行以下操作:

import re

l = re.findall(r'(\d+)\s(ML)/year\s(.*) Water', finalText)
header = [('Water', 'Usage', 'Town')]

data = header + l

with open('your.csv', 'w') as f:
    for line in data:
        f.write(f"{line[2]},{line[0]},{line[1]}\n")

your.csv看起来像

Town,Water,Usage
Adjungbilly/Bombowlee/Brungle,91,ML
Billabung,75,ML
Bredbo,72,ML
Burkes/Bullenbung,82,ML
Burrinjuck Dam Catchment,45,ML
the Burrumbuttock,0,ML
Gilmore/Sandy,67,ML
Goobarragandra,39,ML
Goodradigbee,28,ML
Hillas,113,ML
Houlaghans,100,ML
Jugiong,291,ML
Kyeamba,75,ML
Lake George,178,ML
the Lower Billabong,44,ML
the Lower Billabong Anabranch,169,ML
the Middle Billabong,156,ML
Molonglo,103,ML
the Mountain,73,ML
Murrumbidgee (Balranald to Weimby),2,ML
Murrumbidgee (Gogeldrie to Waldaira),34,ML
Murrumbidgee Central (Burrinjuck to Gogeldrie),92,ML
Murrumbidgee I,218,ML
Murrumbidgee II,133,ML
Murrumbidgee III,116,ML
Murrumbidgee North,73,ML
Murrumbidgee Western,476,ML
Muttama,92,ML
Numeralla East,150,ML

推荐阅读