beautifulsoup - 从html中提取嵌套字典
问题描述
我有一个 html 文件,如图所示:kegg mapper 结果 ,我想建立一个包含三列的表:“pathway”、“KO”和“Query”:“Pathway”列将包含“01100 代谢途径” ,“KO”列应包含“K00166”,“Query”应包含“Trinity_GG_60253_c0_g1_i9.p2”这是html源文件
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<!-- saved from url=(0050)https://www.genome.jp/kegg-bin/find_pathway_object -->
<html><head><meta http-equiv="Content-Type" content="text/html; charset=windows-1252"><title>KEGG Mapper Reconstruction Result</title>
<meta name="http-equiv" content="Content-Type">
<script type="text/javascript" src="./KEGGResult_files/jquery.min.js.download"></script>
<script type="text/javascript" src="./KEGGResult_files/jquery-ui.min.js.download"></script>
<link rel="stylesheet" type="text/css" href="./KEGGResult_files/jquery-ui.css">
<link rel="stylesheet" type="text/css" href="./KEGGResult_files/mapper2.css">
<script language="JavaScript">
<!---
</style></head>
<body>
<h3>KEGG Mapper Reconstruction Result</h3>
<div class="box1">
<ul class="menu">
<form method="POST" name="form2">
<li class="on"><a href="https://www.genome.jp/kegg-bin/find_pathway_object#">Pathway (4)</a></li>
<li class="off"><a href="javascript:submit_mapper('find_brite_object',2)">Brite (2)</a></li>
<li class="off"><a href="javascript:submit_mapper('find_britetable_object',1)">Brite Table (1)</a></li>
<li class="off"><a href="javascript:submit_mapper('find_module_object',0)">Module (0)</a></li>
<input type="hidden" name="uploadfile" value="1631631910128046/mapper.args">
<input type="hidden" name="module_complete_file" value="1631631910128046/module_complete.list">
<input type="hidden" name="target" value="">
<input type="hidden" name="pathway_count" value="4">
<input type="hidden" name="brite_count" value="2">
<input type="hidden" name="brite_table_count" value="1">
<input type="hidden" name="module_count" value="0">
<input type="hidden" name="pathway_module_count" value="0">
</form>
</ul>
</div>
<div class="box2">
<form method="POST" name="form1" action="https://www.genome.jp/kegg-bin/find_pathway_object">
<input type="hidden" name="uploadfile" value="1631631910128046/mapper.args">
<input type="hidden" name="module_complete_file" value="1631631910128046/module_complete.list">
<input type="hidden" name="sort" value="object">
<input type="hidden" name="target" value="">
<input type="hidden" name="pathway_count" value="4">
<input type="hidden" name="brite_count" value="2">
<input type="hidden" name="brite_table_count" value="1">
<input type="hidden" name="module_count" value="0">
<input type="hidden" name="pathway_module_count" value="0">
</form>
<p>
</p><div id="all_status"><a href="javascript:display_all('none')">Hide matched objects</a></div>
<p>
</p><div id="list">
<!-- -->
<b>Metabolism</b>
<ul>
Global and overview maps
<ul>
<li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map01100.coords+reference" target="_blank">01100</a> Metabolic pathways (<a href="javascript:display('map01100')">1</a>)
<div id="objectmap01100" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li><li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map01110.coords+reference" target="_blank">01110</a> Biosynthesis of secondary metabolites (<a href="javascript:display('map01110')">1</a>)
<div id="objectmap01110" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li> </ul>
Carbohydrate metabolism
<ul>
<li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map00640.coords+reference" target="_blank">00640</a> Propanoate metabolism (<a href="javascript:display('map00640')">1</a>)
<div id="objectmap00640" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li> </ul>
Amino acid metabolism
<ul>
<li><a href="https://www.genome.jp/kegg-bin/show_pathway?1631631910128046/map00280.coords+reference" target="_blank">00280</a> Valine, leucine and isoleucine degradation (<a href="javascript:display('map00280')">1</a>)
<div id="objectmap00280" class="object" style="display: inline;"><p>
</p><dl>
<dt><a href="https://www.genome.jp/dbget-bin/www_bget?K00166" target="_blank">K00166</a></dt>
<dd>Trinity_GG_60253_c0_g1_i9.p2</dd>
</dl>
</div></li></ul></ul></div></div>
</body></html>
解决方案
我已将您的数据作为 HTML 并根据标记查找文本
第一个代谢......它在标签之外,所以我找到了它的下一个标签,然后是方法中的前一个文本
soup=BeautifulSoup(html,"lxml")
data=soup.find_all("li")
lst=[]
for i in data:
data_lst=[]
data_lst.append(i.find("a").find_next().previous.replace("("," "))
data_lst.append(i.find("dt").get_text())
data_lst.append(i.find("dd").get_text())
lst.append(data_lst)
import pandas as pd
df=pd.DataFrame(columns=["Pathway","KO","Query"],data=lst)
输出:
Pathway KO Query
0 Metabolic pathways K00166 Trinity_GG_60253_c0_g1_i9.p2
1 Biosynthesis of secondary metabolites K00166 Trinity_GG_60253_c0_g1_i9.p2
2 Propanoate metabolism K00166 Trinity_GG_60253_c0_g1_i9.p2
3 Valine, leucine and isoleucine degradation K00166 Trinity_GG_60253_c0_g1_i9.p2
推荐阅读
- python - Scrapy + pyqt5:信号仅适用于主线程错误
- php - 数据表 - 排序和分页
- c++ - 使用单个变量计算字符输入
- openebs - OpenEBS 中的数据存储在哪里?
- android - 在 oreo 8.1.0 中我无法获取 fcm 设备 ID
- java - POST 请求与 POSTMAN/Advance rest 客户端一起正常工作,同时使用 HttpUrlConnection 给出 400 错误
- firebase - 无法访问 Firestore 云函数中的 Firestore 文档
- php - 在 React Native 中将数据插入 MySQL 数据库时出现 JSON 解析错误
- javascript - 使用自调用函数时,为什么不能传递引用?
- java - Intellij 在文件末尾自动生成 java 方法