首页 > 解决方案 > BeautifulSoup:

问题描述

Here is my code:

import bs4 as bs
from urllib.request import urlopen

page = urlopen("https://www.netimoveis.com/locacao/minas-gerais/belo-horizonte/bairros/santo-antonio/apartamento/#1/").read()

soup = bs.BeautifulSoup(page, "lxml")

div_lista_locacao = soup.select("div#lista-locacao")[0]

ul_tags = list(div_lista_locacao.children)

print("ul_tags = ",ul_tags)

(You can see I printed a list containing the children of the div_lista_locacao).

The output:

ul_tags =  ['\n']

(And it only shows a line break, even though there are actual children to it as you can see below).

This is the HTML of my source:

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" style="" class=" js flexbox flexboxlegacy canvas canvastext webgl no-touch geolocation postmessage no-websqldatabase indexeddb hashchange history draganddrop websockets rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients no-cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent video audio localstorage sessionstorage webworkers applicationcache svg inlinesvg smil svgclippaths"
  lang="pt">
<head></head>
<body id="topo_geral" itemscope="" itemtype="http://schema.org     
   /WebPage">
  <div id="container-hero" class="container-fluid"></div>
  <div id="resultado" class="container-fluid page-container">
    <!-- DESKTOP -->
    <div id="banner-resultado" class="col col-xs-12 col-sm-12 col-
       md-12col-lg-12 text-center hide"></div>
    <div class="row hidden-xs hidden-sm">
      <div class="col col-xs-12 col-sm-12 col-md-3 col-lg-3 filtro-  
         resultado"></div>
      <div class="col col-xs-12 col-sm-12 col-md-9 col-lg-9 box-
         resultado-hidden-xs hidden-sm"></div>
      <button id="btn-ordenacao-por-valor" data-ordenar="asc" class="btnbtn-valor btn-branco"></button>
      <ul class="nav nav-tabs" role="tablist" id="myTab"></ul>
      <div class="tab-content">
        <div role="tabpanel" class="tab-pane active" id="locacao">
          #Currently manipulating this tag beneath. This is the "div_lista_locacao" variable.
          <div id="lista-locacao" class="col col-xs-12 col-sm-12 col-
            md-12 col-lg-12 nopadmar">
            ##Need to iterate between these 'ul' tags beneath and parse the text internally.
            ## But they won't show up in the .children list.
              <ul class="ul-resultado paginacao paginacao_numero_1" style="display: block;"></ul>
              <ul class="ul-resultado paginacao paginacao_numero_2" style="display: block;"></ul>
              <ul class="ul-resultado paginacao paginacao_numero_3" style="display: none;"></ul>
          </div>
        </div>
      </div>
    </div>
  </div>
</body>

</html>

##I can reply with the contents inside the 'ul' tags if requested. 
##But I just thought it wouldn't be necessary for this particular question.

I'm using "lxml" to parse it, but I've already tried changing it to "html.parser","html5lib" and "xml". All giving similar results.

So, is it the parser? Is it the library I used to download the web page? Did it not download this section? Or maybe a BS bug? IDK.

标签: pythonhtmlparsingweb-scrapingbeautifulsoup

解决方案


正如@facelessuser 的回答中已经提到的那样,内容是使用 Javascript 动态加载的。

好消息是您可以通过 python 发出相同的 ajax 请求并获得 json 响应。这包含您需要的所有数据。我只是打印价格。

import bs4 as bs
from urllib.request import urlopen
import json
page = urlopen("https://www.netimoveis.com/locacao/minas-gerais/belo-horizonte/bairros/santo-antonio/apartamento/?pagina=1&busca=%7B%22valorMinimo%22%3Anull%2C%22valorMaximo%22%3Anull%2C%22quartos%22%3Anull%2C%22suites%22%3Anull%2C%22banhos%22%3Anull%2C%22vagas%22%3Anull%2C%22idadeMinima%22%3Anull%2C%22areaMinima%22%3Anull%2C%22areaMaxima%22%3Anull%2C%22bairros%22%3A%5B%22santo-antonio%22%5D%2C%22ordenar%22%3Anull%7D&outrasPags=true&quantidadeDeRegistro=20&first=false").read()
properties=json.loads(page)['lista']
for item in properties:
    print(item['valorLocacaoFormat'])

输出

R$ 1.490,00
R$ 2.300,00
R$ 1.480,00
R$ 1.600,00
R$ 1.700,00
R$ 2.100,00
R$ 1.600,00
...

注意:要查找我正在使用的 ajax url,请在浏览器开发人员工具中打开网络选项卡并转到该 url。您可以看到正在发出的 xhr 请求。

在此处输入图像描述


推荐阅读