python - 网页抓取:不能写多个段落。只坚持一个
问题描述
网站:https ://www.osa.ind.in/life-members.php
我正在尝试使用以下方法将此页面的每一段写入 .csv 文件:
x = []
for text in soup.tr.stripped_strings:
row = []
for i in soup.p.stripped_strings:
row.append(i)
x.append(row)
with open('sample.csv', 'a', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerows(x)
输出:
我想以表格的形式存储所有信息。
soup.find('p').get_text() ##doesn't help
提前致谢。
解决方案
import re
import csv
from itertools import groupby
url = 'https://www.osa.ind.in/life-members.php'
headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
data = soup.table.get_text(strip=True, separator='|').split('|')
all_data, last = {}, ''
for v, g in groupby(data, lambda k: re.search(r'(?:^|\n|-| )LM', k)):
if v:
last = re.sub(r'\s+', ' ', ''.join(g))
last = re.sub(r' ?- ?', '-', last)
else:
all_data[last] = ' '.join(g).replace('\r\n', ' ')
# print it to screen:
for lm, address in all_data.items():
print('{:<15}{}'.format(lm, address))
# save it to csv:
with open('data.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
for lm, address in all_data.items():
writer.writerow([lm, address])
这打印:
OSA-LM-001 Late Dr. Lal Krishna Dutta Khalihamari, Dibrugarh
OSA-LM-002 Dr. Iralu Ningusalie, Civil Hospital, Kohima,Nagaland
OSA-LM-003 Dr. Nareswar Dutta, Dutta`s Eye Clinic,Rangagora Road, P.O. Tinsukia, Pin- 786125 E-mail: M- 094350-35502
OSA-LM-004 Dr. (Mrs) Nirmali Bujarbarua, P.U.B. Nursing Home Laokhewa Road, Nagaon Ph: 9435537514
OSA-LM-005 Dr. Rajendra Prasad Sarma
OSA-LM-006 Dr. Gopal Chandra Das, Bihutoli Road,Natun Bazar , Hojai Assam M-09435168004 drgcdas@yahoo.com
OSA-LM-007 Dr. Premeswar Nath, Madhab Kandali Path Sankarpur, Gopinath Nagar, Guwahati -781016 premeswar.nath@gmail.com Ph-0361-2471387
OSA-LM-008 Dr. ( Mrs) Dipali Deka Regional Institute of Ophthalmology Guwahati Medical College Bhangagarh Guwahati 781032 dipali_deka@yahoo.com Ph: 9864067474
OSA-LM-009 Dr. T.K.Sarma, Eye Spl , DIMS Hospital , Zoo-Narengi Road, (Near Rly. Yard ), Guwahati 0361-2656980
OSA-LM-010 Dr. ( Mrs) Rani Dutta Sundarpur 18 east lane R.G. Barua Road,Guwahati - 781005
OSA-LM-011 Dr. Birendra Kumar Sarma Ratnagiri Path,Bamunimaidan Guwahati 781021
OSA-LM-012 Dr. Jayanta Baroowa "Kantashree" Tilak Deka Road Nagaon, Assam, Pin 782001 jboroowa@rediffmail.com, jboroowa@gmail.com P: STD -03672-232827 (R) M-9435063195
OSA-LM-013 Dr. Rup Kumar Phukan Milon Nagar, Ward no.10, North Lakhimpur, Assam-787001 drrupkumarphukun@yahoo.co.in M-09435085334
OSA-LM-014 Late Dr. Nabin ch.Bordoloi,Jorhat-1
OSA-LM-015 Dr. Girish Chandra Borgohain Gar-Ali, Jorhat, Assam
OSA-LM-016 Dr. Narayan Bordoloi Chandraprabha Eye hospital , KK Handique Road,Jorhat, Assam drnbordoloi@rediffmail.com M919435051807
OSA-LM-017 Dr. Prabin Bora A.T. Road, Tarajan (Near puja mandir) Jorhat ,Assam-785001 0376-237223 (C) 2372096 (R) M-94350-50658
OSA-LM-018 Dr. Mukul Barthakur Borthakur Eye Clinic B.G. Road Jorhat 785001 M-09954936089 09435051726 nivedita_borthakur@yahoo.co.in
OSA-LM-019 Dr. Padum Kumar Gogoi Kushal Kumar Path Jorhat, Assam 9435050819
OSA-LM-020 Dr. Jayanta Ghosh,UshaEye Clinic, B.G. Road ,Jorhat , Assam-785001 M- 9435351780
OSA-LM-021 Dr. Kumud Nath Jail Road , Jorhat -785001 nathkumud@gmail.com Ph-0376-2320988/2300608 M-94350-51791
OSA-LM-022 Dr. Hiren Saikia Assam Netralaya, Jail Road , Jorhat drhirensaikia@gmail.com M--9435091088 R-0376-2322531
OSA-LM-023 Dr. Nawab M. Rahman Eye Care Contact Lans clinic Gar Ali ,Jorhat 785001 dr.nmrahman@yahoo.com Ph-0376-2304004/2323575/ M: 94350-52042
... and so on.
并保存data.csv
(来自 LibreOffice 的屏幕截图):
推荐阅读
- node.js - 在 Angular 中使用 msal 和服务器端渲染
- java - 将 url 编码的数据转换为 json
- mysql - 如何在具有不同参数和别名的列中执行 SELECT?
- php - 如何计算数组中的数组中的项目?
- javascript - 将 CSS "#" (ID) 替换为 .(Class)
- azure - .NET Core 全局工具在 Azure Cloud Shell 中不起作用
- spotfire - 如何更改数据面板属性悬停在信息上?
- c# - 如何将字符串的子字符串与正则表达式匹配
- r - 如何在 R 中将日期从 dmy 重新排列为 mdy?
- python - Python:TypeError:字符串索引必须是整数