首页 > 解决方案 > 如何使用 bs4 处理复杂的 Webscraping 边缘案例?

问题描述

请看一看这个 HTML 结构:

<span style="text-decoration: underline; color: #0000ff;"><a style="color: #0000ff; text-decoration: underline;" href="https://www.biharjobportal.com/muzaffarpur-indian-army-rally-recruitment-online-form/" target="_blank" rel="noopener">Muzaffarpur Indian Army Rally Recruitment 2020 <span style="color: #ff0000; text-decoration: underline;">Last Date – 14.01.2021</span></a></span>

在跨度标签内有href链接,Text以及这是网站的截图,以防万一:

在此处输入图像描述

突出显示的是创建所有问题

我设法获得了所有我需要的数据(文本和链接)。为了获取所有链接,我编写了以下代码

    r = requests.get('https://www.biharjobportal.com/', headers = headers )
    soup = BeautifulSoup(r.content, 'lxml'
    first_column =soup.find('div', {'class': 'elementor-column elementor-col-33 elementor-top-column elementor-element elementor-element-b892ae7'}) 
    link = first_column.find_all('a', {'style': 'color: #0000ff; text-decoration: underline;'})
    for i in link:
      links = i['href']
      print(len(links))

为了获得链接中的所有名称,我编写了以下代码:

title = first_column.find_all('span', {'style': 'text-decoration: underline; color: #0000ff;'})
for item in title:
    MainTitle = item.text
    print(len(MainTitle))

后来我意识到Only One of the Items(如图所示)不遵循其他链接的严格性。这是结构,一个单一的链接有:

<a href="https://www.biharjobportal.com/nsp-pre-and-post-matric-scholarship-online-form/" target="_blank" rel="noopener"><span style="text-decoration: underline; color: #0000ff;">NSP Pre and Post Matric Scholarship Form 2020 <span style="text-decoration: underline; color: #ff0000;">Last Date 30.12.2020</span></span></a>

如您所见,它的结构完全相反。Here, the span Tag in inside of href tag,在抓取链接时排除自己/不包括在内,从而完全恶化从本网站挖掘的数据。

现在len(title) = 36andlen(link) = 35也不被 Pandas Dataframe 接受(因为它给出的 Length 每次都应该是相同的错误)

在这种情况下我该怎么办?我知道这里有很多经验丰富的开发人员。请指导我。谢谢

标签: pythonpandasbeautifulsoup

解决方案


您可以使用红色删除所有跨度,然后提取文本:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}

r = requests.get('https://www.biharjobportal.com/', headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')

allColumns = soup.findAll('div', {'class': 'elementor-widget-wrap'})
for column in allColumns:
    _headerObj = column.find('h2')
    if _headerObj and 'latest update' in _headerObj.text.lower():
        _allLinks = column.find('ul')

        for link in _allLinks.findAll('a'):
            # find all spans and remove the one using red color!
            _spans = link.findAll('span')

            for span in _spans:
                if '#ff0000' in span['style']:
                    span.extract()

            _text = link.text
            print(link['href'])
            print(_text)
            print("")

出去:

https://www.biharjobportal.com/muzaffarpur-indian-army-rally-recruitment-online-form/
Muzaffarpur Indian Army Rally Recruitment 2020

https://www.biharjobportal.com/lnmu-pg-admission-online-form/
LNMU PG Admission 1st Selection List 2020

https://www.biharjobportal.com/ssc-chsl-recruitment-online-form/
SSC CHSL Recruitment Online Form 2020

https://www.biharjobportal.com/bihar-board-12th-exam-date-sheet/
Bihar Board Inter Exam Date Sheet 2021

https://www.biharjobportal.com/bpsc-project-manager-recruitment/
BPSC Project Manager Recruitment 2020

https://www.biharjobportal.com/munger-university-ug-admission-online-form/
Munger University UG Spot Admission 2020

https://www.biharjobportal.com/biscomaun-various-post-recruitment/
BISCOMAUN Various Post Admit Card

https://www.biharjobportal.com/bihar-police-constable-bharti/
Bihar Police Constable Bharti 2020

https://www.biharjobportal.com/bseb-crossword/
BSEB Crossword Competition 2020-21

https://www.biharjobportal.com/bssc-anuwadak-recruitment-online-form/
BSSC Anuwadak New Exam Date 2020 Released

https://www.biharjobportal.com/ekalyan-bihar-scholarship/
Ekalyan Bihar 10th पास बालक/बालिका स्कॉलरशिप 2020

https://www.biharjobportal.com/nsp-pre-and-post-matric-scholarship-online-form/
NSP Pre and Post Matric Scholarship Form 2020

https://www.biharjobportal.com/bpsc-66th-combined-exam-online-form/
BPSC 66th Combined Vacancy Increased & Rejected List

https://www.biharjobportal.com/jawahar-navodaya-vidyalaya-6th-class-online-admission-form/
JNV 6th Class Admission Form 2021

https://www.biharjobportal.com/jnv-class-9th-admission-online-form/
JNV Class 9th Admission Online Form 2021

https://www.biharjobportal.com/lnmu-integrated-b-ed-cet-online-form/
LNMU B.Ed Document Verification Call Letter 2020

https://www.biharjobportal.com/jai-prakash-university-graduation-admission-online-form/
JPU UG Admission 1st Allotment List 2020

https://www.biharjobportal.com/bssc-inter-level-exam-online-form/
BSSC Inter(10+2) Level Mains Exam 2020

https://www.biharjobportal.com/csbc-bihar-police-driver-constable-recruitment/
CSBC Bihar Police Driver Constable Admit Card 2020

https://www.biharjobportal.com/bihar-rajya-fasal-sahayta-yojana/
बिहार राज्य फसल सहायता योजना 2020 –

https://www.biharjobportal.com/bihar-bseb-ofss-inter-admission-online-form/
BSEB OFSS Inter Admission 2020

https://www.biharjobportal.com/brabu-graduation-admission-online-form/
BRABU UG Admission 2020

https://www.biharjobportal.com/bsusc-assistant-professor-recruitment/
BSUSC Assistant Professor Recruitment 2020

https://www.biharjobportal.com/simultala-awasiya-vidyalaya-admit-card/
Simultala Awasiya Vidyalaya Admit Card 2020

https://www.biharjobportal.com/indian-air-force-cat-online-form/
Indian Air Force CAT Online Form 2020

https://www.biharjobportal.com/lnmu-ug-admission-online-form/
LNMU UG Admission Re-Open 2020

https://www.biharjobportal.com/bihar-bcece-board-city-manager-bharti/
Bihar BCECE Board City Manager Admit Card 2020

https://www.biharjobportal.com/bihar-iti-admission-online-form/
Bihar ITI Admit Card 2020

https://www.biharjobportal.com/bihar-polytechnic-dcece-online-form/
Bihar Polytechnic DCECE Admit Card 2020

https://www.biharjobportal.com/lnmu-pg-admission-online-form/
LNMU PG Admission Online Form 2020 –

https://www.biharjobportal.com/bceceb-amin-recruitment/
BCECEB Amin Recruitment 2020

https://www.biharjobportal.com/bihar-csbc-forest-guard-recruitment/
Bihar CSBC Forest Guard Admit Card 2020

https://www.biharjobportal.com/bihar-bseb-deled-exam-date-sheet/
Bihar BSEB D.El.Ed Exam Date 2020

https://www.biharjobportal.com/bihar-bseb-deled-exam-date-sheet/
Latest News

https://www.biharjobportal.com/magadh-university-graduation-admission-online-form/
Magadh University UG Spot Admission

https://www.biharjobportal.com/veer-kunwar-singh-university-ug-admission-online-form/
VKSU UG Spot Admission 2020

https://www.biharjobportal.com/bihar-rajya-swasthaya-samiti-anm-bharti-online-form/
Bihar SHSB (Advt No-03/2020) ANM Admit Card 2020

http://biharjobportal.com/bihar-scert-ntse-scholarship-exam/
Bihar SCERT NTSE Scholarship Exam 2021

https://www.biharjobportal.com/bihar-scert-nmmss-scholarship-examination/
Bihar SCERT NMMSS Scholarship Examination 2021 –

https://www.biharjobportal.com/bihar-police-home-guard-recruitment/
Bihar Police Sepoy (Sipahi) New Exam Date Notice

https://www.biharjobportal.com/ssc-delhi-police-constable-recruitment/
SSC Constable in Delhi Police Admit Card 2020

https://www.biharjobportal.com/bpsc-acf-recruitment-online-form/
BPSC ACF Recruitment Exam Date 2020

https://www.biharjobportal.com/csbc-bihar-police-constable-recruitment/
CSBC Bihar Police Constable New PET Exam Date

推荐阅读