首页 > 解决方案 > 如何限制收集的数据,以免收集到它下面的整个元素树?

问题描述

这样,在刮的时候,href它也最终会收集到下面的层,例如level-3,但我想专门收集level-2,我应该修改什么以免发生这种情况?

这是网站:
https ://int.soccerway.com/international/europe/european-championships/2020/group-stage/r38188/

部分注释代码:

ls = soup.find('ul', class_='level-2').findAll('li')
    for i in ls:
        print(i.find('a')['href'])
    print('\n')

完整代码:

import bs4 as bs
import requests

url = 'https://int.soccerway.com/international/europe/european-championships/2020/group-stage/r38188/'
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}

resp = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(resp.text, 'lxml')
ls = soup.find('ul', class_='level-2').findAll('li')
for i in ls:
    print(i.find('a')['href'])
print('\n')

预期输出:

/international/europe/european-championships/2020/group-stage/r38188/
/international/europe/european-championships/2020/s13030/final-stages/

标签: pythonweb-scrapingbeautifulsouppython-requests

解决方案


由于您只想要特定标签的直接子代,因此只需添加参数 <li>recursive=False

import bs4 as bs
import requests

url = 'https://int.soccerway.com/international/europe/european-championships/2020/group-stage/r38188/'
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}

resp = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(resp.text, 'lxml')
ls = soup.find('ul', class_='level-2').findAll('li',recursive=False)
for i in ls:
    print(i.find('a')['href'])
print('\n')

输出:

/international/europe/european-championships/2020/group-stage/r38188/
/international/europe/european-championships/2020/s13030/final-stages/

推荐阅读