首页 > 解决方案 > 使用 beautifulSoup 从非结构化网页中抓取文本

问题描述

我希望获取某些网页文本的所有相关文本部分,并将其解析为结构化格式,例如 CSV 文件以供以后使用。但是,我要从中获取信息的网页并不严格遵循相同的格式,例如,页面:

http://www.cs.bham.ac.uk/research/groupings/machine-learning/ http://www.cs.bham.ac.uk/research/groupings/robotics/ http://www.cs.bham.ac.uk/research/groupings/reasoning/

我一直在使用 BeautifulSoup,这对于遵循明确定义的格式的网页来说很好,但是这些特定的网站不遵循标准格式。如何编写代码以从这些页面中提取正文?我可以提取所有文本并删除不相关/常见的文本吗?
或者我可以以某种方式选择这些较大的文本体,即使它们不均匀出现?这些网站的格式不同,但不是以我认为不可能的复杂方式?

最初我有这样的代码来处理结构化页面:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import sqlite3

conn = sqlite3.connect('/Users/tom/PycharmProjects/tmc765/Parsing/MScProject.db')
c = conn.cursor()


### Specify URL
programme_list = ["http://www.cs.bham.ac.uk/internal/programmes/2017/0144",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/9502",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/452B",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/4436",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/5914",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/9503",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/9499",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/5571",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/5955",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/4443",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/9509",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/5576",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/9501",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/4754",
              "http://www.cs.bham.ac.uk/internal/programmes/2017/5196"]

for programme_page in programme_list:
# Query page, return html to a variable
page = urlopen(programme_page)

soupPage = BeautifulSoup(page, 'html.parser')

name_box = soupPage.find('h1')
Programme_Identifier = name_box.text.strip()

Programme_Award = soupPage.find("td", text="Final Award").find_next_sibling("td").text
Interim_Award = soupPage.find("td", text="Interim Award")
if Interim_Award is not None:
    Interim_Award = Interim_Award = soupPage.find("td", text="Interim Award").find_next_sibling("td").text
Programme_Title = soupPage.find("td", text="Programme Title").find_next_sibling("td").text
School_Department = soupPage.find("td", text="School/Department").find_next_sibling("td").text
Banner_Code = soupPage.find("td", text="Banner Code").find_next_sibling("td").text
Programme_Length = soupPage.find("td", text="Length of Programme").find_next_sibling("td").text
Total_Credits = soupPage.find("td", text="Total Credits").find_next_sibling("td").text
UCAS_Code = soupPage.find("td", text="UCAS Code").find_next_sibling("td").text
Awarding_Institution = soupPage.find("td", text="Awarding Institution").find_next_sibling("td").text
QAA_Benchmarking_Groups = soupPage.find("td", text="QAA Benchmarking Groups").find_next_sibling("td").text

#SQL code for inserting into database
with conn:
    c.execute("INSERT INTO Programme_Pages VALUES (?,?,?,?,?,?,?,?,?,?,?,?)",
              (Programme_Identifier, Programme_Award, Interim_Award, Programme_Title,
               School_Department, Banner_Code, Programme_Length, Total_Credits,
               UCAS_Code, Awarding_Institution, QAA_Benchmarking_Groups, programme_page))

print("Program Title:           ", Programme_Identifier)
print("Program Award:           ", Programme_Award)
print("Interim Award:           ", Interim_Award)
print("Program Title:           ", Programme_Title)
print("School/Department:       ", School_Department)
print("Banner Code:             ", Banner_Code)
print("Length of Program:       ", Programme_Length)
print("Total Credits:           ", Total_Credits)
print("UCAS Code:               ", UCAS_Code)
print("Awarding Institution:    ", Awarding_Institution)
print("QAA Benchmarking Groups: ", QAA_Benchmarking_Groups)
print("~~~~~~~~~~\n~~~~~~~~~~")

Educational_Aims = soupPage.find('div', {"class": "programme-text-block"})
Educational_Aims_Title = Educational_Aims.find('h2')
Educational_Aims_Title = Educational_Aims_Title.text.strip()

Educational_Aims_List = Educational_Aims.findAll("li")
print(Educational_Aims_Title)
for el in Educational_Aims_List:
    text = el.text.strip()
    with conn:
        c.execute("INSERT INTO Programme_Info VALUES (?,?,?,?)", (Programme_Identifier, text,
                  Educational_Aims_Title, programme_page))
    print(el.text.strip())

但是,我还没有找到一种方法来编写脚本来从我上面链接的非结构化页面中提取相关文本。我正在考虑尝试拉出所有标记的部分

然后在它们到来时对其进行处理。我只是认为有人可能对更简单的方法有任何见解。

标签: pythontextweb-scrapingbeautifulsoupnlp

解决方案


这完全取决于您要提取什么样的信息。在我的示例中,我提取了标题、文本和人员列表(如果存在)。您可以添加额外的解析规则来提取更多信息:

urls = ['http://www.cs.bham.ac.uk/research/groupings/machine-learning/',
'http://www.cs.bham.ac.uk/research/groupings/robotics/',
'http://www.cs.bham.ac.uk/research/groupings/reasoning/']

from bs4 import BeautifulSoup
import requests
from pprint import pprint

for url in urls:
    soup = BeautifulSoup(requests.get(url).text, 'lxml')

    # parse title
    title = soup.select_one('h1.title').text

    # parse academic staff (if any):
    staff_list = []
    if soup.select('h2 ~ ul'):
        for li in soup.select('h2 ~ ul')[-1].find_all('li'):
            staff_list.append(li.text)
            li.clear()
        soup.select('h2')[-1].clear()

    # parse the text
    text = ''
    for t in soup.select('nav ~ *'):
        text += t.text.strip() + '\n'

    print(title)
    print(text)
    print('Staff list = ', staff_list)
    print('-' * 80)

将打印(缩写):

Intelligent Robotics Lab


Welcome to the Intelligent Robotics Lab in the School of Computer Science at the University of Birmingham. ...

Staff list =  []
--------------------------------------------------------------------------------
Reasoning

Overview
This grouping includes research on various forms of reasoning, including theorem proving and uncertain reasoning, with particular application to mathematical knowledge management, mathematical document recognition, computer algebra, natural language processing, and multi-attribute and multi-agent decision-making. The research is relevant both to understanding how human reasoning works and to designing useful practical tools...


Staff list =  ['John Barnden', 'Richard Dearden', 'Antoni Diller', 'Manfred Kerber', 'Mark Lee', 'Xudong Luo', 'Alan Sexton', 'Volker Sorge']
--------------------------------------------------------------------------------

推荐阅读