首页 > 解决方案 > 如何排除某些内容被 Python 抓取

问题描述

我正在尝试使用 Python 从网站上抓取英语问题(我已经事先获得了这样做的许可);我正在使用BeautifulSoup.

英语问题嵌套在标签<div class="question_body">和之间</div>。下面是我为提取所有英语问题而编写的 Python 代码:

import requests
import pandas as pd
from bs4 import BeautifulSoup

for p in range(1,10):
    web_page = requests.get('https://www.helpteaching.com/search/index.htm?grade=90&question_type=1&keyword=&entity=7&pageNum={}'.format(p))
   
    # Parse web_page
    soup = BeautifulSoup(web_page.text, 'html.parser')
    
    # Create set of results based on HTML tags with desired data
    results = soup.find_all('div', attrs={'class':'question_body'})

但是上面的简单代码有点问题,因为我不想在网上抓取任何“小组问题”。'group questions'(一组基于相同问题文本的不同问题)的内容也嵌套在标签<div class="question_body">和之间</div>,但'group question'和'non-group question'的区别在于源html “小组问题”的代码前面是:

            <p class="group_instructions">
                This question is a part of a group with common instructions.
                <a style="text-decoration:underline;" href="/groups/4913/making-bread">View group &raquo;</a>
            </p>

例如,下面是网站上一组问题之一的 html 源代码:

            <p class="group_instructions">
                This question is a part of a group with common instructions.
                <a style="text-decoration:underline;" href="/groups/4913/making-bread">View group &raquo;</a>
            </p>
        
        <div class="question_body">
            
            
        <a href="/questions/128621/which-is-not-an-ingredient-the-mother-put-in-the-bread">Which is NOT an ingredient the mother put in the bread?</a>
            <ol>

                    <li class="answer correct">
                        Sugar               
                    </li>

                    <li class="answer">
                        Salt    
                    </li>

                    <li class="answer">
                        Yeast
                    </li>

                    <li class="answer">
                        Flour    
                    </li>        
            </ol>              
        </div>
    </div>

注意<p class="group_instructions">前面的方式<div class="question_body">。非小组问题前面没有以 开头的块<p class="group_instructions">

有什么方法可以将小组问题排除在网络抓取之外吗?如果有必要,我不需要坚持使用 BeautifulSoup。

谢谢,

标签: pythonhtmlweb-scraping

解决方案


如果您必须解析不包含某些标签的节点,我认为 xpath 会更容易使用。如果您愿意,我在这里提供了 lxml 解决方案。

import requests
import pandas as pd
from bs4 import BeautifulSoup
from lxml import html
from lxml import etree
from lxml.etree import HTML

web_page = requests.get('https://www.helpteaching.com/search/index.htm?grade=90&question_type=1&keyword=&entity=7&pageNum=1')
soup = BeautifulSoup(web_page.text, 'html.parser')
tree = etree.fromstringlist(soup, parse=HTML)

#This will extract only questions without group questions node.****
results = etree.XPath('//div[@class="question"][not(.//p)]/div[@class="question_body"]/a/text()')

    for result in results:
        print(result)

推荐阅读