首页 > 解决方案 > 如何使用 BeautifulSoup 仅提取给定类或标签内的文本?

问题描述

我正在使用BeautifulSoup4抓取网站。我感兴趣的站点中只有几件事,其中大部分都在标签内article。但是,有些没有article标签,但在div带有类名的标签内。

资料来源:我试图抓取的网站示例是https://news.3m.com/English/3m-stories/3m-details/2020/3M-Foundation-awards-1M-to-four-local -organizations-in-support-of-racial-equity/default.aspx

在该站点中,我只对不在标签内的文章感兴趣article,而是在标签内写div有 class name module_body

这是我到目前为止所做的:

辅助函数

def parse_article(response, tag):
    
    article = [e.get_text() for e in response.find_all(tag)]
   
    article = '\n'.join(article)

    return article



def check_article(response):
    tags_classess_query = [
        ('article'), 
        ('div', {'class': 'module_body'})
    ]
    
    for item in tags_classess_query: 
        print('checking for {}'.format(item))
        
        if response.find(item):
            return item

    return None



# list all html files downloaded
html_files = [file for file in os.listdir(path) if '.html' in file]

# loop html_files to process each file
for file in html_files:
    
    filepath = os.path.join(path,file)
    article_file = os.path.splitext(filepath)[0]
    
    # file name to store the extracted text using BS4
    article_file = article_file + '.txt'
    
    
    with open(filepath, 'r', encoding='utf-8') as f:
    
        html = BeautifulSoup(f, 'html.parser')
        

    
    # check if selected tag exists in HTML. 
    
    tag = check_article(html)
        
    if tag is not None:
        #This is where I'm running into this issue where it still saves all of html page not just the text inside the selected tag/class

        article = parse_article(html, tag)
        
        w = open(article_file, 'w+', encoding='utf-8') 
        w.write(article)
        w.close()

    else:
        print("tag not found for %s" % file)
    
    

我现在遇到了这个问题,它不仅提取所选标签中的文本,而且提取所有内容。我究竟做错了什么?

标签: pythonweb-scrapingbeautifulsoup

解决方案


您正在通过('div',{'class':'module_body'})而不是'div',{'class':'module_body'}. 请注意,后者是 2 个单独的参数。parse_article所以只需在你的函数中替换这一行。

def parse_article(response, tag):
    article = [e.text for e in response.find_all(tag[0],tag[1])]

由于您的其他tag元素没有 2 个元素,您可能会出现索引错误,因此您可以使用解包运算符*

def parse_article(response, tag):
    article = [e.text for e in response.find_all(*tag)]

推荐阅读