首页 > 解决方案 > 从仪表板python beautifulsoup中提取数据时标签中不存在的值

问题描述

我怎么能用 Python 从这个页面,特别是图表中刮取数据?我已经尝试过beautifulsoup,但我检查了 HTML 页面,它似乎不在任何可用的标签中进行抓取。

我在我的请求响应中找不到图表中的数字,我在检查 HTML 时也找不到它们(见下图)。

输入

from bs4 import BeautifulSoup
import requests
url = "https://viz.saude.gov.br/extensions/CobVac_MOV/CobVac_MOV.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, "html")
print(soup.prettify())

输出

<!DOCTYPE html>
<html>
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <title>
   MS-SUS COVID-19 Distribuição de Vacinas
  </title>
  <meta charset="utf-8"/>
  <meta content="True" name="HandheldFriendly"/>
  <meta content="320" name="MobileOptimized"/>
  <meta content="width=device-width, initial-scale=1.0, maximum-scale=1.0, minimum-scale=1.0, user-scalable=no" name="viewport"/>
  <meta content="yes" name="apple-mobile-web-app-capable"/>
  <meta content="black" name="apple-mobile-web-app-status-bar-style"/>
  <meta content="on" http-equiv="cleartype"/>
  <!--Polymer stuff -->
  <script src="https://cdn.rawgit.com/download/polymer-cdn/1.7.0.2/lib/webcomponentsjs/webcomponents-lite.min.js">
  </script>
  <script src="https://kit.fontawesome.com/a076d05399.js">
  </script>
  <link href="qliksense-card.html" rel="import"/>
  <link href="https://cdn.rawgit.com/download/polymer-cdn/1.7.0.2/lib/iron-flex-layout/iron-flex-layout-classes.html" rel="import"/>
  <link href="https://cdn.rawgit.com/download/polymer-cdn/1.7.0.2/lib/paper-header-panel/paper-header-panel.html" rel="import"/>
  <link href="https://cdn.rawgit.com/download/polymer-cdn/1.7.0.2/lib/paper-toolbar/paper-toolbar.html" rel="import"/>
  <link href="https://cdn.rawgit.com/download/polymer-cdn/1.7.0.2/lib/paper-drawer-panel/paper-drawer-panel.html" rel="import"/>
  <link href="https://cdn.rawgit.com/download/polymer-cdn/1.7.0.2/lib/paper-icon-button/paper-icon-button.html" rel="import"/>
  <link href="https://cdn.rawgit.com/download/polymer-cdn/1.7.0.2/lib/iron-icons/iron-icons.html" rel="import"/>
  <link href="https://cdn.rawgit.com/download/polymer-cdn/1.7.0.2/lib/iron-pages/iron-pages.html" rel="import"/>
  <link href="https://cdn.rawgit.com/download/polymer-cdn/1.7.0.2/lib/paper-menu/paper-menu.html" rel="import"/>
  <link href="https://cdn.rawgit.com/download/polymer-cdn/1.7.0.2/lib/paper-item/paper-item.html" rel="import"/>
  <link href="polymer-mixins.html" rel="import"/>
  <style include="iron-flex iron-positioning" is="custom-style">
  </style>
  <style include="polymer-mixins" is="custom-style">
  </style>
  <!-- Bootstrap css -->
  <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" rel="stylesheet"/>
  <!-- Font Awesome -->
  <link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
  <!-- Qlik -->
  <link href="../../resources/autogenerated/qlik-styles.css" rel="stylesheet"/>
  <script src="../../resources/assets/external/requirejs/require.js">
  </script>
  <!-- Bootstrap js -->
  <script crossorigin="anonymous" src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/js/bootstrap.min.js">
  </script>
  <!-- google fonts -->
  <link href="https://fonts.googleapis.com/css?family=Source+Sans+Pro" rel="stylesheet"/>
  <!-- Project code -->
  <link href="CobVac_MOV.css" rel="stylesheet"/>
  <script src="CobVac_MOV.js">
  </script>
  <!-- fontawesome -->
  <link href="https://maxcdn.bootstrapcdn.com/font-awesome/4.7.0/css/font-awesome.min.css" rel="stylesheet"/>
 </head>
 <body class="fullbleed vertical layout">
  <paper-drawer-panel disable-edge-swipe="true" force-narrow="true" right-drawer="" z-index="1000">
   <!-- FILTROS INI ============================================================ -->
   <div drawer="">
    <div class="drawer-title">
     Filtros
    </div>
    <div class="filter-container">
     <div class="qvobject" id="qvfilters">
     </div>
    </div>
   </div>
   <!-- FILTROS FIM ============================================================ -->
   <!-- PAGINA INI ============================================================ -->
   <paper-header-panel main="">
    <!-- HEADER INI ============================================================ -->
    <div class="paper-header">
     <paper-toolbar style="background-color: #306BBC; color: #ffffff;">
      <paper-icon-button class="visible-xs-block" icon="menu" id="nav-menu-button">
      </paper-icon-button>
      <img src="LOGO_TOPO.png" style="height:33px; width:161px;"/>
      <div class="title" style="font-size:18px;">
       <b>
        COVID-19 Vacinação
        <br/>
        Distribuição de Vacinas
       </b>
      </div>
      <!--TITLE-->
      <paper-icon-button class="filter-drawer-toggle" icon="search" paper-drawer-toggle="">
      </paper-icon-button>
      <paper-icon-button class="filter-drawer-toggle" data-target="#basic" data-toggle="modal" icon="help">
      </paper-icon-button>
     </paper-toolbar>
     <!-- BARRA DE FILTROS =================== more-vert -->
     <div class="qvobjects" id="CurrentSelections" style="position:relative; top:0; left:0; width:100%; height:38px;">
     </div>
    </div>
    <!-- HEADER FIM ============================================================ -->
    <!-- PAGINA UTIL INI ============================================================ -->
    <paper-drawer-panel drawer-width="0px" id="nav-drawer">
     <!-- PAGINAS ## INI ============================================================ -->
     <iron-pages main="" selected="0" style="background-color:#eee;">
      <!-- Each .paper-body contained within <iron-pages> is a view. Copy and paste to add more views. -->
      <!-- Don't forget to add a <paper-item> in the <paper-menu> above to be able to navigate to any view you add -->
      <!-- ========================== -->
      <!-- PAGINA 0 -->
      <!-- ========================== -->
      <div class="paper-body">
       <div class="container-fluid">
        <!-- A .qvplaceholder will become a droppable area in the dev-hub -->
        <!-- Each .qvplaceholder must have a unique id -->
        <!-- These .qvplaceholder objects below have an extra class, .kpi, which applies some simple styles intended for kpi objects -->
        <!--
                            <div class="row">
                            <p style="color:red">
                             <b>IMPORTANTE: As informações mostradas neste painel referem-se apenas às doses enviadas a partir do Ministério da Saúde.</b>
                            </p>
                            </div>
-->
        <!--<div class="row">
                             <b>DOSES ENVIADAS PELO MINISTÉRIO DA SAÚDE AOS ESTADOS</b>
                            </div>-->
        <!-- =================== -->
        <!-- KPIS -->
        <!-- =================== -->
        <!-- ====================================================== -->
        <!-- Ver os icones em https://fontawesome.com/v4.7.0/icons/ -->
        <!-- ====================================================== -->
        <div class="row kpi-row">
         <div class="col-xs-12 col-sm-12 col-lg-7">
          <div class="kpi-side">
           <i class="fas fa-syringe">
           </i>
          </div>
          <div class="kpi corkpi01 qvobject" id="KPI-01">
          </div>
         </div>
         <!--<div class="col-xs-12 col-sm-6  col-lg-3">
                                    <div class="kpi-side"><i class="fas fa-syringe"></i></div>
                                    <div class="kpi corkpi02 qvobject" id="KPI-02"></div>
                                </div>
                                <div class="col-xs-12 col-sm-6 col-lg-4">
                                    <div class="kpi-side"><i class="fas fa-syringe"></i></div>
                                    <div class="kpi corkpi03 qvobject" id="KPI-03"></div>
                                </div>
                                <div class="col-xs-12 col-sm-6 col-lg-8">
                                    <div class="kpi-side"><i class="fas fa-syringe"></i></div>
                                    <div class="kpi corkpi04 qvobject" id="KPI-04"></div>
                                </div>-->
        </div>
        <div class="row">
         <p>
          <b>
           <a href="https://sage.saude.gov.br/sistemas/vacina/documentosVacina.php">
            Acesse aqui
           </a>
           os arquivos com os comprovantes de recebimento pelos Estados.
           <br/>
          </b>
          <!--<br>
                             Esclarecimento: Doses em Trânsito são aquelas que estão sendo enviadas pelos Estados aos seus Municípios.
                             </p>-->
         </p>
        </div>
        <!-- =================== -->
        <!-- GRAFICOS 0.1  UF, MAPA  -->
        <!-- =================== -->
        <div class="row">
         <div class="col-xs-12 col-sm-8">
          <!-- Placing a .qvplaceholder within a <qliksense-card> will create a cardified object -->
          <qliksense-card content-height="300px">
           <div class="with-title qvobject" id="QV1-G01A">
           </div>
          </qliksense-card>
         </div>
         <div class="col-xs-12 col-sm-4">
          <qliksense-card content-height="300px">
           <div class="with-title qvobject" id="QV1-G01B">
           </div>
          </qliksense-card>
         </div>
        </div>
        <!-- =================== -->
        <!-- GRAFICOS 0.2 VACINA, TEMPO -->
        <!-- =================== -->
        <div class="row">
         <div class="col-xs-12 col-sm-6">
          <!-- Placing a .qvplaceholder within a <qliksense-card> will create a cardified object -->
          <qliksense-card content-height="400px">
           <div class="with-title qvobject" id="QV1-G02A">
           </div>
          </qliksense-card>
         </div>
         <div class="col-xs-12 col-sm-6">
          <qliksense-card content-height="400px">
           <div class="with-title qvobject" id="QV1-G02B">
           </div>
          </qliksense-card>
         </div>
        </div>
        <!-- =================== -->
        <!-- GRAFICOS 0.3 TABELA_UF PERCENTUAL_REPASSE -->
        <!-- =================== -->
        <div class="row">
         <div class="col-xs-12 col-sm-5">
          <!-- Placing a .qvplaceholder within a <qliksense-card> will create a cardified object -->
          <qliksense-card content-height="300px">
           <div class="with-title qvobject" id="QV1-G03A">
           </div>
          </qliksense-card>
         </div>
         <div class="col-xs-12 col-sm-7">
          <!-- Placing a .qvplaceholder within a <qliksense-card> will create a cardified object -->
          <qliksense-card content-height="300px">
           <div class="with-title qvobject" id="QV1-G03B">
           </div>
          </qliksense-card>
         </div>
        </div>
        <!-- ================================================================================================================== -->
        <!--<div class="row">
                             <b>DOSES REPASSADAS PELOS ESTADOS AOS MUNICÍPIOS</b>
                            </div>-->
        <!-- =================== -->
        <!-- KPIS -->
        <!-- =================== -->
        <!-- ====================================================== -->
        <!-- Ver os icones em https://fontawesome.com/v4.7.0/icons/ -->
        <!-- ====================================================== -->
        <div class="row kpi-row">
         <!--<div class="col-xs-12 col-sm-12 col-lg-3">
                                    <div class="kpi-side"><i class="fas fa-syringe"></i></div>
                                    <div class="kpi corkpi01 qvobject" id="KPI-01B"></div>
                                </div>
                                <div class="col-xs-12 col-sm-6  col-lg-3">
                                    <div class="kpi-side"><i class="fas fa-syringe"></i></div>
                                    <div class="kpi corkpi02 qvobject" id="KPI-02B"></div>
                                </div>
                                <div class="col-xs-12 col-sm-6 col-lg-3">
                                    <div class="kpi-side"><i class="fas fa-syringe"></i></div>
                                    <div class="kpi corkpi03 qvobject" id="KPI-03B"></div>
                                </div>-->
         <div class="col-xs-12 col-sm-6 col-lg-6">
          <div class="kpi-side">
           <i class="fas fa-syringe">
           </i>
          </div>
          <div class="kpi corkpi04 qvobject" id="KPI-04B">
          </div>
         </div>
        </div>
        <!--<div class="row">
                             <b>Esclarecimento: Doses em Trânsito são aquelas que estão sendo enviadas pelos Estados aos seus Municípios.</b>
                            </div>-->
        <!-- =================== -->
        <!-- MAPAS 0.4   MN RELOGIO -->
        <!-- =================== -->
        <div class="row">
         <div class="col-xs-12 col-sm-8">
          <qliksense-card content-height="300px">
           <div class="with-title qvobject" id="QV1-G04A">
           </div>
          </qliksense-card>
         </div>
         <div class="col-xs-12 col-sm-4">
          <qliksense-card content-height="300px">
           <div class="with-title qvobject" id="QV1-G04B">
           </div>
          </qliksense-card>
         </div>
        </div>
        <!-- =================== -->
        <!-- GRAFICOS 0.5 VACINA TEMPO   -->
        <!-- =================== -->
        <div class="row">
         <div class="col-xs-12 col-sm-6">
          <qliksense-card content-height="300px">
           <div class="with-title qvobject" id="QV1-G05A">
           </div>
          </qliksense-card>
         </div>
         <div class="col-xs-12 col-sm-6">
          <qliksense-card content-height="300px">
           <div class="with-title qvobject" id="QV1-G05B">
           </div>
          </qliksense-card>
         </div>
        </div>
        <!-- =================== -->
        <!-- GRAFICOS 0.5 TABELA   -->
        <!-- =================== -->
        <div class="row">
         <div class="col-xs-12 col-sm-6">
          <qliksense-card content-height="300px">
           <div class="with-title qvobject" id="QV1-G06A">
           </div>
          </qliksense-card>
         </div>
        </div>
        <!-- ====================================================== -->
        <!-- EXPORT -->
        <!-- ====================================================== -->
        <div class="row kpi-row">
         <div class="col-xs-12 col-sm-12 col-md-4">
          <div class="kpi white-2 qvobject" id="TXT-Origem" style="box-shadow:none">
          </div>
         </div>
         <div class="col-xs-12 col-sm-6 col-md-4">
          <div class="kpi white-2 qvobject" id="TXT-DTATU" style="box-shadow:none">
          </div>
         </div>
         <div class="col-xs-12 col-sm-6 col-md-4">
          <div class="kpi white-2 qvplaceholder" id="BT-EXPO" style="box-shadow:none">
          </div>
         </div>
        </div>
       </div>
      </div>
     </iron-pages>
     <!-- PAGINAS ## FIM ============================================================ -->
    </paper-drawer-panel>
    <!-- PAGINA UTIL FIM ============================================================ -->
   </paper-header-panel>
   <!-- PAGINA FIM ============================================================ -->
  </paper-drawer-panel>
  <!-- MODAL HELP INI ============================================================ -->
  <!-- Modal -->
  <div aria-hidden="true" class="modal fade" id="basic" role="basic" tabindex="-1">
   <div class="modal-dialog">
    <div class="modal-content">
     <div class="modal-header">
      <button aria-hidden="true" class="close" data-dismiss="modal" type="button">
      </button>
      <h4 class="modal-title">
       SOBRE ESTE PAINEL
      </h4>
     </div>
     <div class="modal-body">
      <p>
       Este painel apresenta informações sobre a distribuição de Vacinas contra a Covid-19, a partir do Ministério da Saúde.
       <br/>
       <br/>
       A fonte dos dados é a Secretaria de Vigilância Sanitária (SVS).
       <br/>
       <br/>
       Informações adicionais podem ser encontradas no site do
       <a href="https://saude.gov.br/">
        Ministério da Saúde
       </a>
       .
       <br/>
       ___________________________
       <br/>
       <br/>
       <img src="UsoPainel.png" style="width:565px;"/>
      </p>
     </div>
     <div class="modal-footer">
      <button class="btn dark btn-outline" data-dismiss="modal" type="button">
       Close
      </button>
     </div>
    </div>
    <!-- /.modal-content -->
   </div>
   <!-- /.modal-dialog -->
  </div>
  <!-- End Modal -->
  <!-- MODAL HELP FIM ============================================================ -->
  <div class="footer" style="z-index: 20000; height:34px; background-color:#ccc;">
   <div style="position:absolute; height:25px; top:10px; left:10px; text-align:left; color:#333;">
    Versão Beta - Maiores informações no site do
    <a href="https://saude.gov.br/">
     Ministério da Saúde
    </a>
   </div>
   <img src="LOGO_BASE.png" style="position:absolute; height:30px; width:145px; bottom:2px; right:10px;"/>
  </div>
  <script>
   var root = this.root;
        $(document).ready(function() {
            $("#nav-drawer paper-menu paper-item").click(function() {
                var index = $(this).index();
                Polymer.dom(root).querySelector("iron-pages").selectIndex(index);
            });
            $("#nav-menu-button").click(function() {
                Polymer.dom(root).querySelector("#nav-drawer").togglePanel();
            });
            $(window).resize(function() {
                Polymer.updateStyles();
            });
        });
  </script>
 </body>
</html>

在此处输入图像描述

如果我搜索该div元素,我将无法获得所需的数据。

我需要的是这样的字典:

{"MG": 655588, "RJ":758120, ...}

我的示例中的数据可能会更改仪表板中的适当更新。

我如何从这些图表中提取数据,因为它们不在任何 HTML 标记中?

标签: pythonweb-scrapingbeautifulsoupcharts

解决方案


方法:

一种方法是使用 selenium 来获取您指示的图表的所有这些值。您可以导航到该页面,然后向下移动到列出感兴趣图表中的值的表格;单击右上角的方形图标展开,并以最大窗口大小获取元素列表并创建您的字典。

有一些问题可以确定正确的块,该块需要一个 xpath,它从在title属性中找到正确的文本向上移动到图标可用于与之交互以扩展卡片的级别。

//paper-card[*//h1[@title='Resumo das Doses Enviadas aos Estados']]//*[@id='icon']

我欢迎任何能够找到方法来消除硬编码 time.sleep(4) 的人提供意见。


派:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

d = webdriver.Chrome()
d.maximize_window()
d.get('https://viz.saude.gov.br/extensions/CobVac_MOV/CobVac_MOV.html')   
target = WebDriverWait(d, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '[title="Resumo das Doses Enviadas aos Estados"]')))
coord = target.location_once_scrolled_into_view
WebDriverWait(d, 10).until(EC.element_to_be_clickable((By.XPATH, "//paper-card[*//h1[@title='Resumo das Doses Enviadas aos Estados']]//*[@id='icon']"))).click()
time.sleep(4)
all_elements = [i.text for i in  
                WebDriverWait(d, 5).until(EC.presence_of_all_elements_located((By.XPATH, "//paper-card[*//h1[@title='Resumo das Doses Enviadas aos Estados']]//*[@class='qv-st-value-overflow']/span[@ng-bind='cell.text']")))
               if i.text]

results = dict(zip(all_elements[0::2][:-1], all_elements[1::2]))
print(results)
d.quit()

输出:

在此处输入图像描述


参考资料:我从@lavinio https://stackoverflow.com/a/1457668中读到了有关在 xpath 中指定父项和子项的信息


推荐阅读