python - BeautifulSoup: Elements Won't Show on Children List. Parser problem?
问题描述
Here is my code:
import bs4 as bs
from urllib.request import urlopen
page = urlopen("https://www.netimoveis.com/locacao/minas-gerais/belo-horizonte/bairros/santo-antonio/apartamento/#1/").read()
soup = bs.BeautifulSoup(page, "lxml")
div_lista_locacao = soup.select("div#lista-locacao")[0]
ul_tags = list(div_lista_locacao.children)
print("ul_tags = ",ul_tags)
(You can see I printed a list containing the children of the div_lista_locacao).
The output:
ul_tags = ['\n']
(And it only shows a line break, even though there are actual children to it as you can see below).
This is the HTML of my source:
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" style="" class=" js flexbox flexboxlegacy canvas canvastext webgl no-touch geolocation postmessage no-websqldatabase indexeddb hashchange history draganddrop websockets rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients no-cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent video audio localstorage sessionstorage webworkers applicationcache svg inlinesvg smil svgclippaths"
lang="pt">
<head></head>
<body id="topo_geral" itemscope="" itemtype="http://schema.org
/WebPage">
<div id="container-hero" class="container-fluid"></div>
<div id="resultado" class="container-fluid page-container">
<!-- DESKTOP -->
<div id="banner-resultado" class="col col-xs-12 col-sm-12 col-
md-12col-lg-12 text-center hide"></div>
<div class="row hidden-xs hidden-sm">
<div class="col col-xs-12 col-sm-12 col-md-3 col-lg-3 filtro-
resultado"></div>
<div class="col col-xs-12 col-sm-12 col-md-9 col-lg-9 box-
resultado-hidden-xs hidden-sm"></div>
<button id="btn-ordenacao-por-valor" data-ordenar="asc" class="btnbtn-valor btn-branco"></button>
<ul class="nav nav-tabs" role="tablist" id="myTab"></ul>
<div class="tab-content">
<div role="tabpanel" class="tab-pane active" id="locacao">
#Currently manipulating this tag beneath. This is the "div_lista_locacao" variable.
<div id="lista-locacao" class="col col-xs-12 col-sm-12 col-
md-12 col-lg-12 nopadmar">
##Need to iterate between these 'ul' tags beneath and parse the text internally.
## But they won't show up in the .children list.
<ul class="ul-resultado paginacao paginacao_numero_1" style="display: block;"></ul>
<ul class="ul-resultado paginacao paginacao_numero_2" style="display: block;"></ul>
<ul class="ul-resultado paginacao paginacao_numero_3" style="display: none;"></ul>
</div>
</div>
</div>
</div>
</div>
</body>
</html>
##I can reply with the contents inside the 'ul' tags if requested.
##But I just thought it wouldn't be necessary for this particular question.
I'm using "lxml" to parse it, but I've already tried changing it to "html.parser","html5lib" and "xml". All giving similar results.
So, is it the parser? Is it the library I used to download the web page? Did it not download this section? Or maybe a BS bug? IDK.
解决方案
正如@facelessuser 的回答中已经提到的那样,内容是使用 Javascript 动态加载的。
好消息是您可以通过 python 发出相同的 ajax 请求并获得 json 响应。这包含您需要的所有数据。我只是打印价格。
import bs4 as bs
from urllib.request import urlopen
import json
page = urlopen("https://www.netimoveis.com/locacao/minas-gerais/belo-horizonte/bairros/santo-antonio/apartamento/?pagina=1&busca=%7B%22valorMinimo%22%3Anull%2C%22valorMaximo%22%3Anull%2C%22quartos%22%3Anull%2C%22suites%22%3Anull%2C%22banhos%22%3Anull%2C%22vagas%22%3Anull%2C%22idadeMinima%22%3Anull%2C%22areaMinima%22%3Anull%2C%22areaMaxima%22%3Anull%2C%22bairros%22%3A%5B%22santo-antonio%22%5D%2C%22ordenar%22%3Anull%7D&outrasPags=true&quantidadeDeRegistro=20&first=false").read()
properties=json.loads(page)['lista']
for item in properties:
print(item['valorLocacaoFormat'])
输出
R$ 1.490,00
R$ 2.300,00
R$ 1.480,00
R$ 1.600,00
R$ 1.700,00
R$ 2.100,00
R$ 1.600,00
...
注意:要查找我正在使用的 ajax url,请在浏览器开发人员工具中打开网络选项卡并转到该 url。您可以看到正在发出的 xhr 请求。
推荐阅读
- ios - 在什么情况下会出现单元测试在 View 完全加载之前运行的情况?
- angular - 错误的打包库 - Angular 自定义库 - NPM
- python - 通过 Python 在 Internet 上在 QNAP NAS 上创建文件夹的最佳方法
- javascript - POST 期间控制台中的 FormData() 对象为空
- flutter - 未调用 Dart 方法
- ruby-on-rails - 有没有办法验证用什么参数调用 new
- ios - ARKit - 如何从当前相机位置获取 z 距离
- r - 如何在 read_excel 路径 Rstudio 中包含字符串向量或 sys.date() 介绍?
- html - 无法用我的导航栏填充窗口的高度
- reactjs - 使用现有搜索栏实现 Google Places 自动完成