首页 > 解决方案 > 如何处理递归错误并防止 jupyter notebook 内核死机?

问题描述

我正在尝试使用 BeautifulSoup 4 解析网站,然后将所需的信息写入 yaml 文件。我正在解析的网站是蛋白质数据库,我将每个目标结构的 html 地址存储在我的目录中。我在之前的研究中做了很多次,并且使用了更多数据(解析以一次获取更多信息等),但是现在它返回错误"RecursionError: maximum recursion depth exceeded while getting the str of an object"。我试图用sys.setrecursionlimit(50000). 我尝试设置几个不同的数字,但每次我的jupyter notebook kernel dead。因此,这对我不起作用。

任何人都可以帮助我提出一些问题可能是哪里的问题以及如何解决它的方法吗?我使用 jupyter 笔记本。对我来说很奇怪的是,我曾经使用相同的代码进行更多的数据解析和更多的信息提取,并且它有效......但没有问题。它可以以某种方式与内部存储器相关联吗?可能是内存不足?

有我的代码:

from bs4 import BeautifulSoup
import yaml


PDB_ID = "1J5A, 1JZX, 1JZY, 1JZZ, 1K01, 1ML5, 1NKW, 1NWX, 1NWY, 1SM1, 1VVJ,...") #etc.,there is cca 110 of these files/websites -however in my previous research I use exact same amount and it worked
PDB_input = PDB_ID.split(", ")

for ID in PDB_input:
    dir_path = "/Documents/DATA/{}/" .format(ID)
    datafile = dir_path + "{}.html" .format(ID)
    
    with open(datafile) as f:
        soup = BeautifulSoup("".join(f.readlines()), "html.parser")

        rna_a = soup.find("td", text="23S rRNA")
        rna_b = soup.find("td", text="23S RIBOSOMAL RNA")
        rna_c = soup.find("td", text="23S RRNA")
        rna_d = soup.find("td", text="23S ribosomal RNA")
        rna_e = soup.find("td", text="23s RNA")
        rna_f = soup.find("td", text="23S ribsomal RNA")
        rna_g = soup.find("td", text="23S Ribosomal RNA")
        rna_h = soup.find("td", text="50S 23S RIBOSOMAL RNA")
        rna_i = soup.find("td", text="RIBOSOMAL 23S RNA")
        rna_j = soup.find("td", text="LSU rRNA")
        
    desired_info = {}
    if rna_a:
        rna_23S_chain = rna_a.next_sibling.contents[0]
    elif rna_b:
        rna_23S_chain = rna_b.next_sibling.contents[0]
    elif rna_c:
        rna_23S_chain = rna_c.next_sibling.contents[0]
    elif rna_d:
        rna_23S_chain = rna_d.next_sibling.contents[0]
    elif rna_e:
        rna_23S_chain = rna_e.next_sibling.contents[0]
    elif rna_f:
        rna_23S_chain = rna_f.next_sibling.contents[0]
    elif rna_g:
        rna_23S_chain = rna_g.next_sibling.contents[0] 
    elif rna_h:
        rna_23S_chain = rna_h.next_sibling.contents[0]
    elif rna_i:
        rna_23S_chain = rna_i.next_sibling.contents[0]
    elif rna_j:
        rna_23S_chain = rna_j.next_sibling.contents[0]
    else:
        print("{} id missing 23S rRNA alternative name" .format(ID))
        
    desired_info["rRNA_23S"] = rna_23S_chain
    
    yaml_file = dir_path + "{}.yaml" .format(ID)
    with open(yaml_file, "w") as outfile:
        yaml.dump(desired_info, outfile, default_flow_style=False)

标签: pythonpython-3.xrecursionbeautifulsoupjupyter-notebook

解决方案


推荐阅读