首页 > 解决方案 > 网页抓取:不能写多个段落。只坚持一个

问题描述

网站:https ://www.osa.ind.in/life-members.php

我正在尝试使用以下方法将此页面的每一段写入 .csv 文件:

x = []
for text in soup.tr.stripped_strings:
    row = []
    for i in soup.p.stripped_strings:
        row.append(i)
    x.append(row)

with open('sample.csv', 'a', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerows(x)

输出:

1

我想以表格的形式存储所有信息。

soup.find('p').get_text()    ##doesn't help

提前致谢。

标签: pythonhtmlcsvweb-scrapingbeautifulsoup

解决方案


import re
import csv
from itertools import groupby

url = 'https://www.osa.ind.in/life-members.php'
headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

data = soup.table.get_text(strip=True, separator='|').split('|')

all_data, last = {}, ''
for v, g in groupby(data, lambda k: re.search(r'(?:^|\n|-| )LM', k)):
    if v:
        last = re.sub(r'\s+', ' ', ''.join(g))
        last = re.sub(r' ?- ?', '-', last)
    else:
        all_data[last] = ' '.join(g).replace('\r\n', ' ')

# print it to screen:
for lm, address in all_data.items():
    print('{:<15}{}'.format(lm, address))

# save it to csv:
with open('data.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for lm, address in all_data.items():
        writer.writerow([lm, address])

这打印:

OSA-LM-001     Late Dr. Lal Krishna Dutta Khalihamari, Dibrugarh
OSA-LM-002     Dr. Iralu Ningusalie, Civil Hospital, Kohima,Nagaland
OSA-LM-003     Dr. Nareswar Dutta, Dutta`s Eye Clinic,Rangagora Road, P.O. Tinsukia, Pin-  786125 E-mail: M- 094350-35502
OSA-LM-004     Dr.  (Mrs) Nirmali Bujarbarua, P.U.B. Nursing Home Laokhewa Road, Nagaon Ph:  9435537514
OSA-LM-005     Dr. Rajendra Prasad Sarma
OSA-LM-006     Dr. Gopal Chandra Das, Bihutoli Road,Natun Bazar , Hojai  Assam M-09435168004 drgcdas@yahoo.com
OSA-LM-007     Dr. Premeswar Nath, Madhab Kandali Path Sankarpur, Gopinath Nagar, Guwahati  -781016 premeswar.nath@gmail.com Ph-0361-2471387
OSA-LM-008     Dr. ( Mrs) Dipali Deka Regional Institute of Ophthalmology Guwahati Medical  College Bhangagarh Guwahati 781032 dipali_deka@yahoo.com Ph: 9864067474
OSA-LM-009     Dr. T.K.Sarma, Eye Spl , DIMS Hospital , Zoo-Narengi Road, (Near Rly. Yard ), Guwahati 0361-2656980
OSA-LM-010     Dr. ( Mrs) Rani Dutta Sundarpur 18 east lane R.G. Barua Road,Guwahati - 781005
OSA-LM-011     Dr. Birendra Kumar Sarma Ratnagiri Path,Bamunimaidan  Guwahati 781021
OSA-LM-012     Dr. Jayanta Baroowa "Kantashree"  Tilak Deka Road Nagaon, Assam, Pin 782001 jboroowa@rediffmail.com, jboroowa@gmail.com P: STD -03672-232827 (R) M-9435063195
OSA-LM-013     Dr.  Rup Kumar Phukan Milon Nagar, Ward no.10, North Lakhimpur, Assam-787001 drrupkumarphukun@yahoo.co.in M-09435085334
OSA-LM-014     Late Dr. Nabin  ch.Bordoloi,Jorhat-1
OSA-LM-015     Dr. Girish Chandra Borgohain Gar-Ali, Jorhat, Assam
OSA-LM-016     Dr. Narayan Bordoloi Chandraprabha Eye hospital , KK Handique Road,Jorhat, Assam drnbordoloi@rediffmail.com  M919435051807
OSA-LM-017     Dr. Prabin Bora A.T.  Road, Tarajan (Near puja mandir) Jorhat ,Assam-785001 0376-237223 (C) 2372096 (R) M-94350-50658
OSA-LM-018     Dr. Mukul Barthakur Borthakur Eye Clinic B.G. Road Jorhat 785001 M-09954936089 09435051726 nivedita_borthakur@yahoo.co.in
OSA-LM-019     Dr. Padum Kumar  Gogoi Kushal Kumar Path Jorhat, Assam 9435050819
OSA-LM-020     Dr. Jayanta Ghosh,UshaEye Clinic, B.G. Road ,Jorhat , Assam-785001 M-  9435351780
OSA-LM-021     Dr. Kumud Nath Jail Road , Jorhat  -785001 nathkumud@gmail.com Ph-0376-2320988/2300608 M-94350-51791
OSA-LM-022     Dr. Hiren Saikia Assam Netralaya, Jail Road , Jorhat drhirensaikia@gmail.com M--9435091088 R-0376-2322531
OSA-LM-023     Dr. Nawab M.  Rahman Eye Care Contact Lans clinic Gar Ali ,Jorhat 785001 dr.nmrahman@yahoo.com Ph-0376-2304004/2323575/ M: 94350-52042

... and so on.

并保存data.csv(来自 LibreOffice 的屏幕截图):

在此处输入图像描述


推荐阅读