首页 > 解决方案 > 在使用漂亮的汤在python中抓取时在表格tr标签之间获取html结束标签

问题描述

我正在尝试从 html 页面源中获取数据,我需要的数据位于 table 标记下,其中包含 tr 标记列表。我无法遍历 tr 标签,当我打印 soup.prettify 时,我得到的结果低于中间有 html 结束标签的结果。

from bs4 import BeautifulSoup
from urllib import request
main_url = "https://www.ugc.ac.in/" 
link = "stateuniversitylist.aspx?id=1&Unitype=2"   
state_src = request.urlopen(main_url+link)
state_soup = BeautifulSoup(state_src, "html.parser")
# univ_table = state_soup.table
# out = univ_table.find_all("tr")
print(state_soup.prettify)

在表的第一个 tr 标记处输出,尽管在下面的 html 代码之后还有其他 tr 标记

<tr>
<td>
<div class="panel panel-default">
<div class="panel-body">
<div class="col-md-12">
<font color="#006699"><b>
                                Acharaya N.G.Ranga Agricultural University</b></font><br/>
<a href="http://www.angrau.ac.in">
                                http://www.angrau.ac.in</a><br>
<div class="box100">
<font color="#006699">Address:
                                </font>
</div>
<div class="box200">
                                Lam, Gantur<br/>
</div>
<div class="clear">
</div>
<div class="box100">
<font color="#006699">State:</font></div>
<div class="box200">
                                Andhra Pradesh
                                -
                                522034
                            </div>
</br></div>
<div class="col-md-12">
<div class="panel-heading">
<h4 class="panel-title">
<i aria-hidden="true" class="fa fa-plus-square orange-text"></i><a data-toggle="collapse" href="#collapse10"> View More</a>
</h4>
</div>
<div class="panel-collapse collapse" id="collapse10">
<div class="panel-body">
<div id="ctl00_bps_homeCPH_dluniversity_ctl02_UpdatePanel1">
<ul class="nav nav-pills">
<li class="active" style="font-size: 12px; border: 1px solid;"><a data-toggle="tab" href="#menu10">Student Enrolment Details</a></li>
<li style="font-size: 12px; border: 1px solid;"><a data-toggle="tab" href="#menu20">Faculty Details</a></li>
<li style="font-size: 12px; border: 1px solid;"><a data-toggle="tab" href="#menu30">M.Phils and Ph.Ds Awarded</a></li>
<li style="font-size: 12px; border: 1px solid;"><a data-toggle="tab" href="#menu40">Grant Allocation Details</a></li>
<li style="font-size: 12px; border: 1px solid;"><a data-toggle="tab" href="#menu50">More</a></li>
</ul>
<div class="tab-content">
<div class="tab-pane fade in active" id="menu10">
<iframe frameborder="0" id="myFrame" src="uni_stuinfo.aspx?id=185" width="100%">
</iframe>
</div>
<div class="tab-pane fade" id="menu20">
<iframe frameborder="0" id="myFrame" src="uni_faculty.aspx?id=185" width="100%">
</iframe>
</div>
<div class="tab-pane fade" id="menu30">
<iframe frameborder="0" id="myFrame" src="uni_phd.aspx?id=185" width="100%">
</iframe>
</div>
<div class="tab-pane fade" id="menu40">
<iframe frameborder="0" id="myFrame" src="uni_grantinfo.aspx?id=185" width="100%">
</iframe>
</div>
<div class="tab-pane fade" id="menu50">
<iframe frameborder="0" id="myFrame" src="uni_contactinfo.aspx?id=185" width="100%"></iframe>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</td></tr></table></div>
</div></div></div></div></div></div></div></form></body></html>
<tr>

标签: pythonhtmlweb-scrapingbeautifulsoup

解决方案


如果有人发现这个问题,请尝试将您的解析器更改为 lxml。 Beautiful Soup 在实际结束前有额外的 </body>


推荐阅读