首页 > 解决方案 > 从html中提取嵌套字典

问题描述

我有一个 html 文件,如图所示:kegg mapper 结果 ,我想建立一个包含三列的表:“pathway”、“KO”和“Query”:“Pathway”列将包含“01100 代谢途径” ,“KO”列应包含“K00166”,“Query”应包含“Trinity_GG_60253_c0_g1_i9.p2”这是html源文件

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- saved from url=(0050)https://www.genome.jp/kegg-bin/find_pathway_object -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252"><title>KEGG Mapper Reconstruction Result</title>
<meta name="http-equiv" content="Content-Type">
<script type="text/javascript" src="./KEGGResult_files/jquery.min.js.download"></script>
<script type="text/javascript" src="./KEGGResult_files/jquery-ui.min.js.download"></script>
<link rel="stylesheet" type="text/css" href="./KEGGResult_files/jquery-ui.css">
<link rel="stylesheet" type="text/css" href="./KEGGResult_files/mapper2.css">
<script language="JavaScript">
<!---

</style></head>
<body>
<h3>KEGG Mapper Reconstruction Result</h3>
<div class="box1">
<ul class="menu">
<form method="POST" name="form2">
<li class="on"><a href="https://www.genome.jp/kegg-bin/find_pathway_object#">Pathway (4)</a></li>
<li class="off"><a href="javascript:submit_mapper(&#39;find_brite_object&#39;,2)">Brite (2)</a></li>
<li class="off"><a href="javascript:submit_mapper(&#39;find_britetable_object&#39;,1)">Brite Table (1)</a></li>
<li class="off"><a href="javascript:submit_mapper(&#39;find_module_object&#39;,0)">Module (0)</a></li>
<input type="hidden" name="uploadfile" value="1631631910128046/mapper.args">
<input type="hidden" name="module_complete_file" value="1631631910128046/module_complete.list">
<input type="hidden" name="target" value="">
<input type="hidden" name="pathway_count" value="4">
<input type="hidden" name="brite_count" value="2">
<input type="hidden" name="brite_table_count" value="1">
<input type="hidden" name="module_count" value="0">
<input type="hidden" name="pathway_module_count" value="0">
</form>
</ul>
</div>
<div class="box2">
<form method="POST" name="form1" action="https://www.genome.jp/kegg-bin/find_pathway_object">
<input type="hidden" name="uploadfile" value="1631631910128046/mapper.args">
<input type="hidden" name="module_complete_file" value="1631631910128046/module_complete.list">
<input type="hidden" name="sort" value="object">
<input type="hidden" name="target" value="">
<input type="hidden" name="pathway_count" value="4">
<input type="hidden" name="brite_count" value="2">
<input type="hidden" name="brite_table_count" value="1">
<input type="hidden" name="module_count" value="0">
<input type="hidden" name="pathway_module_count" value="0">
</form>
<p>
</p><div id="all_status"><a href="javascript:display_all(&#39;none&#39;)">Hide matched objects</a></div>
<p>
</p><div id="list">
<!-- -->
<b>Metabolism</b>
<ul>
 Global and overview maps
  <ul>
<li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map01100.coords+reference" target="_blank">01100</a> Metabolic pathways&nbsp;(<a href="javascript:display(&#39;map01100&#39;)">1</a>)
<div id="objectmap01100" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li><li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map01110.coords+reference" target="_blank">01110</a> Biosynthesis of secondary metabolites&nbsp;(<a href="javascript:display(&#39;map01110&#39;)">1</a>)
<div id="objectmap01110" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li>  </ul>
 Carbohydrate metabolism
  <ul>
<li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map00640.coords+reference" target="_blank">00640</a> Propanoate metabolism&nbsp;(<a href="javascript:display(&#39;map00640&#39;)">1</a>)
<div id="objectmap00640" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li>  </ul>
 Amino acid metabolism
  <ul>
<li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map00280.coords+reference" target="_blank">00280</a> Valine, leucine and isoleucine degradation&nbsp;(<a href="javascript:display(&#39;map00280&#39;)">1</a>)
<div id="objectmap00280" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li></ul></ul></div></div>
</body></html>

标签: beautifulsouphtml-parsing

解决方案


我已将您的数据作为 HTML 并根据标记查找文本

第一个代谢......它在标签之外,所以我找到了它的下一个标签,然后是方法中的前一个文本

soup=BeautifulSoup(html,"lxml")

data=soup.find_all("li")
lst=[]
for i in data:
    data_lst=[]
    data_lst.append(i.find("a").find_next().previous.replace("("," "))
    data_lst.append(i.find("dt").get_text())
    data_lst.append(i.find("dd").get_text())
    lst.append(data_lst)
    
import pandas as pd
df=pd.DataFrame(columns=["Pathway","KO","Query"],data=lst)

输出:

    Pathway                                     KO      Query
0   Metabolic pathways                          K00166  Trinity_GG_60253_c0_g1_i9.p2
1   Biosynthesis of secondary metabolites       K00166  Trinity_GG_60253_c0_g1_i9.p2
2   Propanoate metabolism                       K00166  Trinity_GG_60253_c0_g1_i9.p2
3   Valine, leucine and isoleucine degradation  K00166  Trinity_GG_60253_c0_g1_i9.p2

推荐阅读