首页 > 解决方案 > 使用 BeautifulSoup 和 Requests 解析 html 页面源时出现内存泄漏

问题描述

因此,基本思想是通过使用 beautifulsoup 删除 HTML 标记和脚本,对某些列表 URL 发出 get 请求并解析来自这些页面源的文本。蟒蛇版本2.7

问题是,在每个请求中,解析器函数都会在每个请求中不断添加内存。规模逐渐增大。

def get_text_from_page_source(page_source):
    soup = BeautifulSoup(open(page_source),'html.parser')
#     soup = BeautifulSoup(page_source,"lxml")
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    # print text
    return text

即使在用于解析内存泄漏的本地文本文件中。例如:

#request 1
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #100 MB

#request 2
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #150 MB
#request 3
response = requests.get(url,timeout=timeout)
parsed_string_from_html_source = get_text_from_page_source(response.content) #300 MB

在此处输入图像描述

标签: pythonmemory-leaksbeautifulsouppython-requests

解决方案


您可以尝试调用垃圾收集器:

import gc
response.close()
response = None
gc.collect()

这也可能对您有所帮助:Python high memory usage with BeautifulSoup


推荐阅读