首页 > 解决方案 > 使用 BS4 进行网页抓取,如何设置查看范围

问题描述

我正在尝试抓取此维基百科页面的“事件”部分: https ://en.wikipedia.org/wiki/2020 。该页面没有最容易导航的 HTML,因为大多数标签不是嵌套的,而是同级的。

我想确保我抓取的唯一数据是在下面显示的两个 h2 标签之间。
这是精简的 HTML:

<h2>                  #I ONLY WANT TO SEARCH BETWEEN HERE
    <span id="Events">Events</span>
</h2>
<h3>...</h3>
<ul>...</ul>
<h3>...</h3>
<ul>
    <li>
        <a title="June 17"</a>   #My code below is looking for this, if not found it jumps to another section
    </li>
</ul>
<h3>...</h3>
<ul>...</ul>
<h2>                 #AND HERE. DON"T WANT TO GO PAST HERE
    <span id="Predicted_and_scheduled_events">Predicted_and_scheduled_events</span>
</h2>

如果不清楚,每个标签(跨度除外)都是兄弟。如果两个 h2 标记之间存在日期,我的代码当前可以工作,但是如果日期不存在,它将转到页面的另一个部分以提取数据,这是我不想要的。

这是我的代码:

import sys
import requests
import bs4
res = requests.get('https://en.wikipedia.org/wiki/2020')
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"lxml")
todaysNews = soup.find('a', {"title": "June 17"}) #goes to date's stories

标签: pythonweb-scrapingbeautifulsoup

解决方案


BS 有许多有用的功能和参数。值得阅读整个文档。

它具有获取父元素、下一个兄弟元素、具有任何标题的元素等的功能。


首先我搜索<span id="Events">Events</span>,接下来我得到它的parent元素<h2>,我有数据的开始。

接下来我可以获取next_siblings并在for-loop 中运行,直到我得到带有名称的项目h2并且我得到数据的结尾。

for-loop中,我可以检查所有ul元素并搜索li没有嵌套li元素的直接元素(recursive=False),在里面li我可以得到第一个a包含title任何文本的元素( {"title": True}

import requests
import bs4

res = requests.get('https://en.wikipedia.org/wiki/2020')
res.raise_for_status()

soup = bs4.BeautifulSoup(res.text, 'lxml')

# found start of data `h2`
start = soup.find('span', {'id': 'Events'}).parent

# check sibling items
for item in start.next_siblings:

    # found end of data `h2`
    if item.name == 'h2': 
        break

    if item.name == 'ul':

        # only direct `li` without nested `li`
        for li in item.find_all('li', recursive=False): 

            # `a` which have `title`
            a = li.find('a', {'title': True}) 

            if a:
                print(a['title'])

结果:

January 1
January 2
January 3
January 5
January 7
January 8
January 9
January 10
January 12
January 16
January 18
January 28
January 29
January 30
January 31
February 5
February 11
February 13
February 27
February 28
February 29
March 5
March 8
March 9
March 11
March 12
March 13
March 14
March 16
March 17
March 18
March 20
March 23
March 24
March 26
March 27
March 30
April 1
April 2
April 4
April 5
April 6
April 7
April 8
April 9
April 10
April 12
April 14
April 15
April 17
April 18
April 19
April 20
April 21
April 22
April 23
April 25
April 26
April 27
April 28
April 29
April 30
May 1
May 3
May 4
May 5
May 6
May 7
May 9
May 10
May 11
May 12
May 14
May 15
May 16
May 18
May 19
May 21
May 22
May 23
May 24
May 26
May 27
May 28
May 30
May 31
June 1
June 2
June 3
June 4
June 6
June 7
June 8
June 9
June 16

推荐阅读